lrnnx.architectures.language_model module¶
Language Model architecture. Reference: https://github.com/state-spaces/mamba/blob/main/mamba_ssm/models/mixer_seq_simple.py
- create_block(d_model: int, d_state: int, d_intermediate: int, mixer_type: str, mixer_kwargs: Dict | None = None, attn_cfg: Dict | None = None, norm_epsilon: float = 1e-05, rms_norm: bool = False, residual_in_fp32: bool = False, fused_add_norm: bool = True, layer_idx: int | None = None, device: torch.device | None = None, dtype: torch.dtype | None = None) Block[source]¶
Create a block.
- Parameters:
d_model (int) – Model dimension.
d_state (int) – State dimension.
d_intermediate (int) – Intermediate dimension for MLP layers (0 to disable MLP).
mixer_type (str) – Name of the mixer type (e.g., “LRU”, “S5”, “attn”).
attn_cfg (dict, optional) – Configuration for attention layers. Defaults to None.
norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.
rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to False.
residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.
fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.
layer_idx (int, optional) – Index of the current layer. Defaults to None.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.
- Returns:
A configured block module.
- Return type:
Block
- class LRNNModel[source]¶
Bases:
ModuleCore LRNN backbone.
- Parameters:
d_model (int) – Model dimension.
d_state (int) – State dimension.
n_layer (int) – Number of layers in the model.
vocab_size (int) – Size of the vocabulary.
mixer_types (list) – List of mixer type names for each layer (e.g.,
["S5", "S7", "attn", ...]).d_intermediate (int, optional) – Intermediate dimension for MLP layers (0 to disable MLP). Defaults to 0.
mlp_cls (type, optional) – MLP class to use. Defaults to None.
norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.
rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to True.
initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.
fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.
residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.
- allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[source]¶
Allocate inference cache for autoregressive generation.
- Parameters:
batch_size (int) – Batch size for inference.
max_seqlen (int) – Maximum sequence length for inference.
dtype (torch.dtype, optional) – Data type for cache tensors.
- Returns:
Dictionary mapping layer indices to their allocated caches.
- Return type:
- step(input_ids: torch.Tensor, caches: Dict, integration_timesteps: torch.Tensor | None = None) torch.Tensor[source]¶
Single-step inference for autoregressive generation.
- Parameters:
input_ids (torch.Tensor) – Input token IDs of shape
(B, 1)— single token.caches (Dict) – Dictionary mapping layer indices to their cached states.
integration_timesteps (torch.Tensor, optional) – Integration timesteps for LTV models (shape:
(B, 1)or(B,)). Defaults to None.
- Returns:
Hidden states of shape
(B, 1, d_model).- Return type:
- forward(input_ids: torch.Tensor, inference_params: Dict | None = None, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None) torch.Tensor[source]¶
Forward pass of the LRNN backbone.
- Parameters:
input_ids (torch.Tensor) – Input token IDs of shape
(B, L).inference_params (Dict, optional) – Parameters for inference mode. Defaults to None.
integration_timesteps (torch.Tensor, optional) – Timesteps for LTV models (shape:
(B, L)). Defaults to None.lengths (torch.Tensor, optional) – Sequence lengths for variable-length sequences (shape:
(B,)). Defaults to None.
- Returns:
Hidden states of shape
(B, L, d_model).- Return type:
- class LRNNLMHeadModel[source]¶
Bases:
ModuleLRNN Language Model with a language modeling head.
- Parameters:
d_model (int) – Model dimension.
d_state (int) – State dimension.
n_layer (int) – Number of layers in the model.
vocab_size (int) – Size of the vocabulary.
mixer_types (list) – List of mixer type names for each layer (e.g.,
["S5", "S7", "attn", ...]).d_intermediate (int, optional) – Intermediate dimension for MLP layers (0 to disable MLP). Defaults to 0.
mlp_cls (type, optional) – MLP class to use. Defaults to None.
norm_epsilon (float, optional) – Epsilon value for layer normalization. Defaults to 1e-5.
rms_norm (bool, optional) – Whether to use RMSNorm instead of LayerNorm. Defaults to True.
initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.
fused_add_norm (bool, optional) – Whether to use fused add+norm operations. Defaults to True.
residual_in_fp32 (bool, optional) – Whether to compute residuals in float32. Defaults to False.
tie_embeddings (bool, optional) – Whether to tie input and output embeddings. Defaults to True.
pad_vocab_size_multiple (int, optional) – Pad vocabulary size to multiple of this value. Defaults to 8.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.
- tie_weights() None[source]¶
Tie input and output embeddings.
This makes the embedding layer and language modeling head share the same weights, which is a common practice to reduce parameters and improve performance.
- allocate_inference_cache(batch_size: int, max_seqlen: int, dtype: torch.dtype | None = None) Dict[source]¶
Allocate inference cache.
- Parameters:
batch_size (int) – Batch size for inference.
max_seqlen (int) – Maximum sequence length for inference.
dtype (torch.dtype, optional) – Data type for cache tensors.
- Returns:
Dictionary mapping layer indices to their allocated caches.
- Return type:
- step(input_ids: torch.Tensor, caches: Dict, integration_timesteps: torch.Tensor | None = None) namedtuple[source]¶
Single-step inference for autoregressive generation.
- Parameters:
input_ids (torch.Tensor) – Input token IDs of shape
(B, 1)— single token.caches (Dict) – Dictionary mapping layer indices to their cached states.
integration_timesteps (torch.Tensor, optional) – Integration timesteps for LTV models (shape:
(B, 1)or(B,)). Defaults to None.
- Returns:
Contains logits tensor of shape
(B, 1, vocab_size).- Return type:
namedtuple
- forward(input_ids: torch.Tensor, position_ids: torch.Tensor | None = None, inference_params: Dict | None = None, num_last_tokens: int = 0, integration_timesteps: torch.Tensor | None = None, lengths: torch.Tensor | None = None) namedtuple[source]¶
Forward pass of the language model.
- Parameters:
input_ids (torch.Tensor) – Input token IDs of shape
(B, L).position_ids (torch.Tensor, optional) – Position IDs (unused, for compatibility). Defaults to None.
inference_params (Dict, optional) – Parameters for inference mode. Defaults to None.
num_last_tokens (int, optional) – If > 0, only return logits for last n tokens. Defaults to 0.
integration_timesteps (torch.Tensor, optional) – Timesteps for LTV models (shape:
(B, L)). Defaults to None.lengths (torch.Tensor, optional) – Sequence lengths for variable-length sequences (shape:
(B,)). Defaults to None.
- Returns:
Contains logits tensor of shape
(B, L, vocab_size).- Return type:
namedtuple
- save_pretrained(save_directory: str) None[source]¶
Save the model and configuration to a directory.
- Parameters:
save_directory (str) – Directory path where model and config will be saved.
- classmethod from_pretrained(pretrained_model_path: str, mixer_kwargs: Dict | None = None, mlp_cls=None, initializer_cfg: Dict[str, Any] | None = None, device: torch.device | None = None, dtype: torch.dtype | None = None) LRNNLMHeadModel[source]¶
Load a pretrained model from a directory.
- Parameters:
pretrained_model_path (str) – Path to directory containing saved model and config.
mlp_cls (type, optional) – MLP class to use. Defaults to None.
initializer_cfg (dict, optional) – Configuration for weight initialization. Defaults to None.
device (torch.device, optional) – Device to place tensors on. Defaults to None.
dtype (torch.dtype, optional) – Data type for tensors. Defaults to None.
- Returns:
Loaded model instance.
- Return type: