lrnnx.architectures.language_model module¶

Language Model architecture. Reference: https://github.com/state-spaces/mamba/blob/main/mamba_ssm/models/mixer_seq_simple.py

create_block(d_model: int, d_state: int, d_intermediate: int, mixer_type: str, mixer_kwargs: Dict | None = None, attn_cfg: Dict | None = None, norm_epsilon: float = 1e-05, rms_norm: bool = False, residual_in_fp32: bool = False, fused_add_norm: bool = True, layer_idx: int | None = None, device: torch.device | None = None, dtype: torch.dtype | None = None) → Block[source]¶

Create a block.

Parameters:

d_model (int) – Model dimension.
d_state (int) – State dimension.
d_intermediate (int) – Intermediate dimension for MLP layers (0 to disable MLP).
mixer_type (str) – Name of the mixer type (e.g., “LRU”, “S5”, “attn”).
attn_cfg (dict, optional) – Configuration for attention layers. Defaults to None.
norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.
rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to False.
residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.
fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.
layer_idx (int, optional) – Index of the current layer. Defaults to None.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

Returns:

A configured block module.

Return type:

Block

class LRNNModel[source]¶

Bases: Module

Core LRNN backbone.

Parameters:

d_model (int) – Model dimension.
d_state (int) – State dimension.
n_layer (int) – Number of layers in the model.
vocab_size (int) – Size of the vocabulary.
mixer_types (list) – List of mixer type names for each layer (e.g., ["S5", "S7", "attn", ...]).
d_intermediate (int, optional) – Intermediate dimension for MLP layers (0 to disable MLP). Defaults to 0.
mlp_cls (type, optional) – MLP class to use. Defaults to None.
norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.
rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to True.
initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.
fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.
residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) → Dict[source]¶

Allocate inference cache for autoregressive generation.

Parameters:

batch_size (int) – Batch size for inference.
max_seqlen (int) – Maximum sequence length for inference.
dtype (torch.dtype, optional) – Data type for cache tensors.

Returns:

Dictionary mapping layer indices to their allocated caches.

Return type:

dict

step(input_ids: torch.Tensor, caches: Dict, integration_timesteps: torch.Tensor | None = None) → torch.Tensor[source]¶

Single-step inference for autoregressive generation.

Parameters:

input_ids (torch.Tensor) – Input token IDs of shape (B, 1) — single token.
caches (Dict) – Dictionary mapping layer indices to their cached states.
integration_timesteps (torch.Tensor, optional) – Integration timesteps for LTV models (shape: (B, 1) or (B,)). Defaults to None.

Returns:

Hidden states of shape (B, 1, d_model).

Return type:

torch.Tensor

forward(input_ids: torch.Tensor, inference_params: Dict | None = None, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None) → torch.Tensor[source]¶

Forward pass of the LRNN backbone.

Parameters:

input_ids (torch.Tensor) – Input token IDs of shape (B, L).
inference_params (Dict, optional) – Parameters for inference mode. Defaults to None.
integration_timesteps (torch.Tensor, optional) – Timesteps for LTV models (shape: (B, L)). Defaults to None.
lengths (torch.Tensor, optional) – Sequence lengths for variable-length sequences (shape: (B,)). Defaults to None.

Returns:

Hidden states of shape (B, L, d_model).

Return type:

torch.Tensor

class LRNNLMHeadModel[source]¶

Bases: Module

LRNN Language Model with a language modeling head.

Parameters:

d_model (int) – Model dimension.
d_state (int) – State dimension.
n_layer (int) – Number of layers in the model.
vocab_size (int) – Size of the vocabulary.
mixer_types (list) – List of mixer type names for each layer (e.g., ["S5", "S7", "attn", ...]).
d_intermediate (int, optional) – Intermediate dimension for MLP layers (0 to disable MLP). Defaults to 0.
mlp_cls (type, optional) – MLP class to use. Defaults to None.
norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.
rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to True.
initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.
fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.
residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.
tie_embeddings (bool, optional) – Whether to tie input and output embeddings. Defaults to True.
pad_vocab_size_multiple (int, optional) – Pad vocabulary size to multiple of this value. Defaults to 8.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

tie_weights() → None[source]¶

Tie input and output embeddings.

This makes the embedding layer and language modeling head share the same weights, which is a common practice to reduce parameters and improve performance.

allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) → Dict[source]¶

Allocate inference cache.

Parameters:

batch_size (int) – Batch size for inference.
max_seqlen (int) – Maximum sequence length for inference.
dtype (torch.dtype, optional) – Data type for cache tensors.

Returns:

Dictionary mapping layer indices to their allocated caches.

Return type:

dict

step(input_ids: torch.Tensor, caches: Dict, integration_timesteps: torch.Tensor | None = None) → namedtuple[source]¶

Single-step inference for autoregressive generation.

Parameters:

input_ids (torch.Tensor) – Input token IDs of shape (B, 1) — single token.
caches (Dict) – Dictionary mapping layer indices to their cached states.
integration_timesteps (torch.Tensor, optional) – Integration timesteps for LTV models (shape: (B, 1) or (B,)). Defaults to None.

Returns:

Contains logits tensor of shape (B, 1, vocab_size).

Return type:

namedtuple

forward(input_ids: torch.Tensor, position_ids: torch.Tensor | None = None, inference_params: Dict | None = None, num_last_tokens: int = 0, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None) → namedtuple[source]¶

Forward pass of the language model.

Parameters:

input_ids (torch.Tensor) – Input token IDs of shape (B, L).
position_ids (torch.Tensor, optional) – Position IDs (unused, for compatibility). Defaults to None.
inference_params (Dict, optional) – Parameters for inference mode. Defaults to None.
num_last_tokens (int, optional) – If > 0, only return logits for last n tokens. Defaults to 0.
integration_timesteps (torch.Tensor, optional) – Timesteps for LTV models (shape: (B, L)). Defaults to None.
lengths (torch.Tensor, optional) – Sequence lengths for variable-length sequences (shape: (B,)). Defaults to None.

Returns:

Contains logits tensor of shape (B, L, vocab_size).

Return type:

namedtuple

save_pretrained(save_directory: str) → None[source]¶

Save the model and configuration to a directory.

Parameters:: save_directory (str) – Directory path where model and config will be saved.

classmethod from_pretrained(pretrained_model_path: str, mixer_kwargs: Dict | None = None, mlp_cls=None, initializer_cfg: Dict[str, Any] | None = None, device: torch.device | None = None, dtype: torch.dtype | None = None) → LRNNLMHeadModel[source]¶

Load a pretrained model from a directory.

Parameters:

pretrained_model_path (str) – Path to directory containing saved model and config.
mlp_cls (type, optional) – MLP class to use. Defaults to None.
initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.

Returns:

Loaded model instance.

Return type:

LRNNLMHeadModel