lrnnx.architectures.language_model module

Language Model architecture. Reference: https://github.com/state-spaces/mamba/blob/main/mamba_ssm/models/mixer_seq_simple.py

create_block(d_model: int, d_state: int, d_intermediate: int, mixer_type: str, mixer_kwargs: Dict | None = None, attn_cfg: Dict | None = None, norm_epsilon: float = 1e-05, rms_norm: bool = False, residual_in_fp32: bool = False, fused_add_norm: bool = True, layer_idx: int | None = None, device: torch.device | None = None, dtype: torch.dtype | None = None) Block[source]

Create a block.

Parameters:
  • d_model (int) – Model dimension.

  • d_state (int) – State dimension.

  • d_intermediate (int) – Intermediate dimension for MLP layers (0 to disable MLP).

  • mixer_type (str) – Name of the mixer type (e.g., “LRU”, “S5”, “attn”).

  • attn_cfg (dict, optional) – Configuration for attention layers. Defaults to None.

  • norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.

  • rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to False.

  • residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.

  • fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.

  • layer_idx (int, optional) – Index of the current layer. Defaults to None.

  • device (torch.device, optional) – Device to place tensors on. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

Returns:

A configured block module.

Return type:

Block

class LRNNModel[source]

Bases: Module

Core LRNN backbone.

Parameters:
  • d_model (int) – Model dimension.

  • d_state (int) – State dimension.

  • n_layer (int) – Number of layers in the model.

  • vocab_size (int) – Size of the vocabulary.

  • mixer_types (list) – List of mixer type names for each layer (e.g., ["S5", "S7", "attn", ...]).

  • d_intermediate (int, optional) – Intermediate dimension for MLP layers (0 to disable MLP). Defaults to 0.

  • mlp_cls (type, optional) – MLP class to use. Defaults to None.

  • norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.

  • rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to True.

  • initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.

  • fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.

  • residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.

  • device (torch.device, optional) – Device to place tensors on. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[source]

Allocate inference cache for autoregressive generation.

Parameters:
  • batch_size (int) – Batch size for inference.

  • max_seqlen (int) – Maximum sequence length for inference.

  • dtype (torch.dtype, optional) – Data type for cache tensors.

Returns:

Dictionary mapping layer indices to their allocated caches.

Return type:

dict

step(input_ids: torch.Tensor, caches: Dict, integration_timesteps: torch.Tensor | None = None) torch.Tensor[source]

Single-step inference for autoregressive generation.

Parameters:
  • input_ids (torch.Tensor) – Input token IDs of shape (B, 1) — single token.

  • caches (Dict) – Dictionary mapping layer indices to their cached states.

  • integration_timesteps (torch.Tensor, optional) – Integration timesteps for LTV models (shape: (B, 1) or (B,)). Defaults to None.

Returns:

Hidden states of shape (B, 1, d_model).

Return type:

torch.Tensor

forward(input_ids: torch.Tensor, inference_params: Dict | None = None, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None) torch.Tensor[source]

Forward pass of the LRNN backbone.

Parameters:
  • input_ids (torch.Tensor) – Input token IDs of shape (B, L).

  • inference_params (Dict, optional) – Parameters for inference mode. Defaults to None.

  • integration_timesteps (torch.Tensor, optional) – Timesteps for LTV models (shape: (B, L)). Defaults to None.

  • lengths (torch.Tensor, optional) – Sequence lengths for variable-length sequences (shape: (B,)). Defaults to None.

Returns:

Hidden states of shape (B, L, d_model).

Return type:

torch.Tensor

class LRNNLMHeadModel[source]

Bases: Module

LRNN Language Model with a language modeling head.

Parameters:
  • d_model (int) – Model dimension.

  • d_state (int) – State dimension.

  • n_layer (int) – Number of layers in the model.

  • vocab_size (int) – Size of the vocabulary.

  • mixer_types (list) – List of mixer type names for each layer (e.g., ["S5", "S7", "attn", ...]).

  • d_intermediate (int, optional) – Intermediate dimension for MLP layers (0 to disable MLP). Defaults to 0.

  • mlp_cls (type, optional) – MLP class to use. Defaults to None.

  • norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.

  • rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to True.

  • initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.

  • fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.

  • residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.

  • tie_embeddings (bool, optional) – Whether to tie input and output embeddings. Defaults to True.

  • pad_vocab_size_multiple (int, optional) – Pad vocabulary size to multiple of this value. Defaults to 8.

  • device (torch.device, optional) – Device to place tensors on. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

tie_weights() None[source]

Tie input and output embeddings.

This makes the embedding layer and language modeling head share the same weights, which is a common practice to reduce parameters and improve performance.

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[source]

Allocate inference cache.

Parameters:
  • batch_size (int) – Batch size for inference.

  • max_seqlen (int) – Maximum sequence length for inference.

  • dtype (torch.dtype, optional) – Data type for cache tensors.

Returns:

Dictionary mapping layer indices to their allocated caches.

Return type:

dict

step(input_ids: torch.Tensor, caches: Dict, integration_timesteps: torch.Tensor | None = None) namedtuple[source]

Single-step inference for autoregressive generation.

Parameters:
  • input_ids (torch.Tensor) – Input token IDs of shape (B, 1) — single token.

  • caches (Dict) – Dictionary mapping layer indices to their cached states.

  • integration_timesteps (torch.Tensor, optional) – Integration timesteps for LTV models (shape: (B, 1) or (B,)). Defaults to None.

Returns:

Contains logits tensor of shape (B, 1, vocab_size).

Return type:

namedtuple

forward(input_ids: torch.Tensor, position_ids: torch.Tensor | None = None, inference_params: Dict | None = None, num_last_tokens: int = 0, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None) namedtuple[source]

Forward pass of the language model.

Parameters:
  • input_ids (torch.Tensor) – Input token IDs of shape (B, L).

  • position_ids (torch.Tensor, optional) – Position IDs (unused, for compatibility). Defaults to None.

  • inference_params (Dict, optional) – Parameters for inference mode. Defaults to None.

  • num_last_tokens (int, optional) – If > 0, only return logits for last n tokens. Defaults to 0.

  • integration_timesteps (torch.Tensor, optional) – Timesteps for LTV models (shape: (B, L)). Defaults to None.

  • lengths (torch.Tensor, optional) – Sequence lengths for variable-length sequences (shape: (B,)). Defaults to None.

Returns:

Contains logits tensor of shape (B, L, vocab_size).

Return type:

namedtuple

save_pretrained(save_directory: str) None[source]

Save the model and configuration to a directory.

Parameters:

save_directory (str) – Directory path where model and config will be saved.

classmethod from_pretrained(pretrained_model_path: str, mixer_kwargs: Dict | None = None, mlp_cls=None, initializer_cfg: Dict[str, Any] | None = None, device: torch.device | None = None, dtype: torch.dtype | None = None) LRNNLMHeadModel[source]

Load a pretrained model from a directory.

Parameters:
  • pretrained_model_path (str) – Path to directory containing saved model and config.

  • mlp_cls (type, optional) – MLP class to use. Defaults to None.

  • initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.

  • device (torch.device, optional) – Device to place tensors on. Defaults to None.

  • dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

Returns:

Loaded model instance.

Return type:

LRNNLMHeadModel