lrnnx.models.ltv package

Linear Time-Varying (LTV) LRNN models.

class LTV_LRNN[source]

Bases: LRNN

Base class for all LTV (Linear Time-Varying) LRNN models.

Note

LTV models support async discretization for event-driven processing where timesteps between events may vary. This is specified via the integration_timesteps parameter in forward().

Example

>>> from lrnnx.models.ltv import LTV_LRNN
>>> my_lrnn = LTV_LRNN("zoh")
>>> # create dummy input tensor and perform forward pass
>>> # in subclass
__init__(discretization: Literal['zoh', 'bilinear', 'dirac', 'async', 'no_discretization'] | None)[source]

Initialize the LTV LRNN base class.

Parameters:

discretization (Literal["zoh", "bilinear", "dirac", "async", "no_discretization"] | None) – Discretization method to use. Can be one of "zoh", "bilinear", "dirac", "async", "no_discretization", or None for models that handle discretization internally.

abstractmethod forward(x: torch.Tensor, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None, inference_cache: Dict[str, Any] | None = None) torch.Tensor[source]

Forward pass through the LTV model.

Parameters:
  • x (torch.Tensor) – Input tensor, shape (B, L, H).

  • integration_timesteps (torch.Tensor, optional) – Timesteps for async/event-driven discretization (Reference: https://arxiv.org/abs/2404.18508), shape (B, L). If None, uniform timesteps are assumed. Defaults to None.

  • lengths (torch.Tensor, optional) – Lengths of sequences, shape (B,), required for variable-length sequences or bidirectional models. Defaults to None.

  • inference_cache (dict, optional) – Cache containing states and pre-computed values for efficient autoregressive generation. If provided during inference, enables incremental processing. Defaults to None.

Returns:

Output tensor, same shape as input (x), i.e., (B, L, H).

Return type:

torch.Tensor

abstractmethod step(x: torch.Tensor, inference_cache: Dict[str, Any]) Tuple[torch.Tensor, Dict[str, Any]][source]

Performs a single recurrent step of the LTV model.

This method is used for autoregressive inference, where inputs are processed one timestep at a time. Unlike LTI models, the dynamics may vary at each step based on the input.

Parameters:
  • x (torch.Tensor) – Input at current timestep, shape (B, 1, H).

  • inference_cache (Dict[str, Any]) – Cache dictionary containing model states. This is the same format returned by allocate_inference_cache(). The cache is updated in-place and also returned for convenience.

Returns:

A tuple containing:
  • y : Output at current timestep, shape (B, 1, H).

  • inference_cache : Updated cache dictionary.

Return type:

tuple[torch.Tensor, Dict[str, Any]]

abstractmethod allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[str, Any][source]

Allocates cache for efficient autoregressive inference.

For LTV models, this typically includes:

  • Initial hidden state(s)

  • Any auxiliary states (e.g., convolution state for Mamba)

  • Metadata for tracking sequence position

Parameters:
  • batch_size (int) – The batch size for inference.

  • max_seqlen (int) – Maximum sequence length to support.

  • dtype (torch.dtype, optional) – Data type for allocated tensors. If None, uses the model’s default dtype. Defaults to None.

Returns:

Cache dictionary that can be passed to forward().

Should contain at minimum: - Model state tensors (e.g., “lrnn_state”, “conv_state”) - “seqlen_offset”: Current position in the sequence

Return type:

Dict[str, Any]

class Mamba[source]

Bases: LTV_LRNN

Mamba: Selective State Space Model with optional event-based processing.

When integration_timesteps is provided in forward(), uses asymmetric discretization (separate dtA and dtB) for event-driven processing. Otherwise, uses standard Mamba discretization.

Example

>>> model = Mamba(d_model=64, d_state=16, d_conv=4)
>>> x = torch.randn(2, 128, 64)
>>> y = model(x)
>>> y.shape
torch.Size([2, 128, 64])
__init__(d_model, d_state=16, d_conv=4, expand=2, dt_rank='auto', dt_min=0.001, dt_max=0.1, dt_init='random', dt_scale=1.0, dt_init_floor=0.0001, conv_bias=True, bias=False, use_fast_path=True, layer_idx=None, device=None, dtype=None, discretization='mamba')[source]

Initialize Mamba model.

Parameters:
  • d_model (int) – Model dimension.

  • d_state (int, optional) – SSM state dimension (N). Defaults to 16.

  • d_conv (int, optional) – Convolution kernel size. Defaults to 4.

  • expand (int, optional) – Expansion factor for inner dimension. Defaults to 2.

  • dt_rank (Union[int, str], optional) – Rank for delta projection, "auto" = ceil(d_model / 16). Defaults to "auto".

  • dt_min (float, optional) – Minimum value for delta initialization. Defaults to 0.001.

  • dt_max (float, optional) – Maximum value for delta initialization. Defaults to 0.1.

  • dt_init (str, optional) – Initialization method ("random" or "constant"). Defaults to "random".

  • dt_scale (float, optional) – Scale factor for dt initialization. Defaults to 1.0.

  • dt_init_floor (float, optional) – Floor value for dt initialization. Defaults to 1e-4.

  • conv_bias (bool, optional) – Whether to use bias in convolution. Defaults to True.

  • bias (bool, optional) – Whether to use bias in linear projections. Defaults to False.

  • use_fast_path (bool, optional) – Whether to use fused CUDA kernels. Defaults to True.

  • layer_idx (int, optional) – Layer index for multi-layer caching. Defaults to None.

  • device (torch.device, optional) – Device for parameters. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for parameters. Defaults to None.

  • discretization (str, optional) – Discretization type. Defaults to "mamba".

forward(hidden_states, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None, inference_cache: Dict[str, Any] | None = None)[source]

Forward pass through Mamba.

Parameters:
  • hidden_states (torch.Tensor) – Input tensor, shape (B, L, D).

  • integration_timesteps (torch.Tensor, optional) – Time intervals between events. Shape (B, L). When provided, uses asymmetric discretization with separate dtA and dtB for event-driven processing. Defaults to None.

  • lengths (torch.Tensor, optional) – Not used by Mamba currently. Defaults to None.

  • inference_cache (dict, optional) – Cache for autoregressive generation. If provided, contains “conv_state” and “lrnn_state” tensors. Defaults to None.

Returns:

Output tensor, shape (B, L, D).

Return type:

torch.Tensor

step(x: torch.Tensor, inference_cache: Dict[str, Any], integration_timesteps: torch.Tensor | None = None) Tuple[torch.Tensor, Dict[str, Any]][source]

Performs a single recurrent step of Mamba.

Parameters:
  • x (torch.Tensor) – Input at current timestep, shape (B, 1, D).

  • inference_cache (Dict[str, Any]) – Cache dictionary containing: - “conv_state”: Convolution state, shape (B, D_inner, d_conv) - “lrnn_state”: SSM state, shape (B, D_inner, N) - “seqlen_offset”: Current position in sequence

  • integration_timesteps (torch.Tensor, optional) – Integration timestep, shape (B, 1) or (B,). When provided, uses event-based asymmetric discretization. Defaults to None.

Returns:

A tuple containing:
  • out : Output at current timestep, shape (B, 1, D).

  • inference_cache : Updated cache dictionary.

Return type:

tuple[torch.Tensor, Dict[str, Any]]

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[str, Any][source]

Allocates cache for Mamba autoregressive inference.

Parameters:
  • batch_size (int) – The batch size for inference.

  • max_seqlen (int) – Maximum sequence length (not used by Mamba, but kept for interface consistency).

  • dtype (torch.dtype, optional) – Data type for allocated tensors. Defaults to None.

Returns:

Cache dictionary containing:
  • ”conv_state”: Convolution state, shape (B, D_inner, d_conv).

  • ”lrnn_state”: SSM state, shape (B, D_inner, N).

  • ”seqlen_offset”: Current position in the sequence (starts at 0).

Return type:

Dict[str, Any]

class RGLRU[source]

Bases: LTV_LRNN

RG-LRU block following the Griffin architecture.

Example

>>> model = RGLRU(d_model=64, d_state=1, d_conv=4)
>>> x = torch.randn(2, 128, 64)
>>> y = model(x)
>>> y.shape
torch.Size([2, 128, 64])
__init__(d_model: int, d_conv: int = 4, expand: int = 1, c: float = 8.0, a_init_range: Tuple[float, float] = (0.9, 0.999), conv_bias: bool = True, bias: bool = False, use_fast_path: bool = True, layer_idx: int | None = None, device=None, dtype=None)[source]

Initialize RG-LRU block.

Parameters:
  • d_model (int) – Model dimension.

  • d_conv (int, optional) – Temporal convolution kernel size. Defaults to 4.

  • expand (int, optional) – Expansion factor for inner dimension. Defaults to 1.

  • c (float, optional) – Fixed scalar for recurrent gate scaling. Defaults to 8.0.

  • a_init_range (Tuple[float, float], optional) – Tuple (lo, hi) so a is initialised in [lo, hi] in (0, 1). Defaults to (0.9, 0.999).

  • conv_bias (bool, optional) – Whether the Conv1D uses a bias term. Defaults to True.

  • bias (bool, optional) – Whether Linear projections use bias. Defaults to False.

  • use_fast_path (bool, optional) – Use the fused CUDA kernel when available. Defaults to True.

  • layer_idx (int, optional) – Layer index (for multi-layer caching). Defaults to None.

  • device (torch.device, optional) – Device for parameters. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for parameters. Defaults to None.

forward(hidden_states: torch.Tensor, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None, inference_cache: Dict[str, Any] | None = None) torch.Tensor[source]

Forward pass through the RG-LRU block.

Parameters:
  • hidden_states (torch.Tensor) – Input tensor of shape (B, L, D).

  • integration_timesteps (torch.Tensor, optional) – Unused - kept for LTV interface compat. Defaults to None.

  • lengths (torch.Tensor, optional) – Unused - kept for interface compatibility. Defaults to None.

  • inference_cache (Dict[str, Any], optional) – Cache dict for autoregressive generation. Defaults to None.

Returns:

Output tensor of shape (B, L, D).

Return type:

torch.Tensor

step(hidden_states: torch.Tensor, inference_cache: Dict[str, Any]) Tuple[torch.Tensor, Dict[str, Any]][source]

Single recurrent step for autoregressive inference.

Parameters:
  • hidden_states (torch.Tensor) – Input tensor of shape (B, 1, D).

  • inference_cache (Dict[str, Any]) – Must contain conv_state, lrnn_state, and seqlen_offset.

Returns:

Tuple containing:
  • out : Output tensor of shape (B, 1, D).

  • inference_cache : Updated cache dictionary.

Return type:

tuple[torch.Tensor, Dict[str, Any]]

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[str, Any][source]

Allocate cache for autoregressive inference.

Parameters:
  • batch_size (int) – Batch size.

  • max_seqlen (int) – Unused, kept for interface consistency.

  • dtype (torch.dtype, optional) – Data type for cache tensors. Defaults to None.

Returns:

Cache dictionary containing “conv_state”, “ssm_state”, and “seqlen_offset”.

Return type:

Dict[str, Any]

class S7[source]

Bases: LTV_LRNN

S7: Selective and Simplified State Space Layers for Sequence Modeling.

Example

>>> model = S7(d_model=64, d_state=64)
>>> x = torch.randn(2, 128, 64)
>>> y = model(x)
>>> y.shape
torch.Size([2, 128, 64])
__init__(d_model: int, d_state: int, J: int = 1, use_fast_path: bool = True, layer_idx: int | None = None, device=None, dtype=None)[source]

Initialize S7 model.

Parameters:
  • d_model (int) – Model dimension.

  • d_state (int) – State dimension. Must be divisible by J.

  • J (int, optional) – Number of blocks for initialization. Defaults to 1.

  • use_fast_path (bool, optional) – Whether to use the CUDA fast path if available. Defaults to True.

  • layer_idx (int, optional) – Layer index for multi-layer models, used for caching. Defaults to None.

  • device (torch.device, optional) – Device for the model parameters. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for the model parameters. Defaults to None.

forward(hidden_states: torch.Tensor, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None, inference_cache: Dict[str, Any] | None = None) torch.Tensor[source]

Forward pass through the S7 layer.

Parameters:
  • hidden_states (torch.Tensor) – Input tensor of shape (B, L, H).

  • integration_timesteps (torch.Tensor, optional) – Timesteps for async/event-driven discretization. Defaults to None.

  • lengths (torch.Tensor, optional) – Lengths of sequences, required for variable-length sequences. Defaults to None.

  • inference_cache (Dict[str, Any], optional) – Cache for autoregressive generation. Defaults to None.

Returns:

Output tensor of shape (B, L, H).

Return type:

torch.Tensor

step(hidden_states: torch.Tensor, inference_cache: Dict[str, Any]) Tuple[torch.Tensor, Dict[str, Any]][source]

Performs a single recurrent step of S7 for autoregressive inference.

Parameters:
  • hidden_states (torch.Tensor) – Input at current timestep, shape (B, 1, H).

  • inference_cache (Dict[str, Any]) – Cache dictionary containing the model state.

Returns:

A tuple containing:
  • out : Output tensor at the current timestep, shape (B, 1, H).

  • inference_cache : Updated cache dictionary.

Return type:

tuple[torch.Tensor, Dict[str, Any]]

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[str, Any][source]

Allocates cache for S7 autoregressive inference.

Parameters:
  • batch_size (int) – The batch size for inference.

  • max_seqlen (int) – Maximum sequence length (unused, kept for interface consistency).

  • dtype (torch.dtype, optional) – Data type for allocated tensors. Defaults to None.

Returns:

Cache dictionary containing “lrnn_state” and “seqlen_offset”.

Return type:

Dict[str, Any]

class S5[source]

Bases: LTV_LRNN

S5 SSM with CUDA kernel acceleration. Reference: https://openreview.net/forum?id=Ai8Hw3AXqks

Example

>>> model = S5(d_model=64, d_state=64)
>>> x = torch.randn(2, 128, 64)
>>> y = model(x)
>>> y.shape
torch.Size([2, 128, 64])
__init__(d_model: int, d_state: int, discretization: Literal['bilinear', 'zoh', 'dirac'] = 'zoh', conj_sym: bool = False, dt_min: float = 0.001, dt_max: float = 0.1, step_rescale: float = 1.0, use_fast_path: bool = True, device=None, dtype=None)[source]

Initialize S5 model.

Parameters:
  • d_model (int) – Model dimension.

  • d_state (int) – State dimension.

  • discretization (Literal["bilinear", "zoh", "dirac"], optional) – Discretization method. Defaults to "zoh".

  • conj_sym (bool, optional) – If True, uses conjugate symmetry for the state space model. Defaults to False.

  • dt_min (float, optional) – Minimum value for dt initialization. Defaults to 0.001.

  • dt_max (float, optional) – Maximum value for dt initialization. Defaults to 0.1.

  • step_rescale (float, optional) – Rescale factor for step size. Defaults to 1.0.

  • use_fast_path (bool, optional) – Whether to use fused CUDA kernels. Defaults to True.

  • device (torch.device, optional) – Device for parameters. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for parameters. Defaults to None.

forward(x: torch.Tensor, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None, inference_cache: Dict[str, Any] | None = None) torch.Tensor[source]

Forward pass through S5.

Parameters:
  • x (torch.Tensor) – Input tensor of shape (B, L, H).

  • integration_timesteps (torch.Tensor, optional) – Timesteps for async/event-driven discretization. Defaults to None.

  • lengths (torch.Tensor, optional) – Lengths of sequences, required for variable-length sequences. Defaults to None.

  • inference_cache (Dict[str, Any], optional) – Cache for autoregressive generation. Defaults to None.

Returns:

Output tensor of shape (B, L, H).

Return type:

torch.Tensor

step(x: torch.Tensor, inference_cache: Dict[str, Any], integration_timesteps: torch.Tensor | None = None) Tuple[torch.Tensor, Dict[str, Any]][source]

Performs a single recurrent step of S5.

When the simplified_state_update Triton kernel is available and the tensors live on CUDA, the state is updated in-place via the kernel (which also fuses discretization, input projection, and output projection into a single launch). Otherwise a pure-PyTorch fallback is used.

Parameters:
  • x (torch.Tensor) – Input at current timestep, shape (B, 1, H) or (B, H).

  • inference_cache (Dict[str, Any]) – Cache dictionary containing SSM state and continuous-time parameters.

  • integration_timesteps (torch.Tensor, optional) – Optional per-step integration timesteps for event/async mode, shape (B,) or (B, 1). Defaults to None.

Returns:

A tuple containing:
  • y : Output tensor at the current timestep.

  • inference_cache : Updated cache dictionary.

Return type:

tuple[torch.Tensor, Dict[str, Any]]

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[str, Any][source]

Allocates cache for S5 autoregressive inference.

Stores the continuous-time parameters so that simplified_state_update can fuse discretization into the kernel.

Parameters:
  • batch_size (int) – The batch size for inference.

  • max_seqlen (int) – Maximum sequence length (unused, for interface consistency).

  • dtype (torch.dtype, optional) – Data type for allocated tensors. Defaults to None.

Returns:

Cache dictionary containing SSM state and continuous-time matrices.

Return type:

Dict[str, Any]

Submodules