HSSM
HSSM is a Hierarchical State Space Model for autoregressive language modeling. This public release contains the FineWeb-Edu pretrained checkpoint of the model published by DevHunterAI.
Model Summary
HSSM combines hierarchical chunked sequence processing, selective state space dynamics, and sparse mixture-of-experts routing in a single language model. The design goal is to preserve long-range sequential modeling capacity while keeping feed-forward capacity high through sparse expert activation.
This release corresponds to the pretrained checkpoint:
hssm_fineweb_edu_final.pt
Parameter count:
73.8Mparameters
This checkpoint was pretrained on:
HuggingFaceFW/fineweb-edu
Intended Use
This model is intended for:
- research on hierarchical state space models
- experimentation with sparse expert routing for language modeling
- continued fine-tuning on dialogue, instruction, or domain datasets
- architecture analysis and comparison against transformer and recurrent baselines
This checkpoint is pretrained, not fully instruction-tuned. It can produce text continuations, but high-quality conversational behavior generally requires an additional dialogue or instruction fine-tuning stage.
Training Dataset
The pretraining data source selected for this release is:
- Dataset:
HuggingFaceFW/fineweb-edu - Usage mode: streaming pretraining pipeline
- Selection: first 1.5 million samples
- Epochs: 1
FineWeb-Edu is a large educational web-text corpus suitable for language model pretraining and broad text continuation tasks.
Architecture Overview
HSSM is organized as a stacked hierarchical autoregressive architecture with four main stages.
1. Token Embedding Layer
Input token ids are mapped into a dense latent space of dimension d_model=512.
2. Hierarchical Chunker
The embedded token sequence is grouped into fixed-size chunks with:
chunk_size=4
This chunking stage compresses local token neighborhoods into chunk-level representations before they are processed by deeper sequence blocks. The hierarchical view allows the model to reason over short local neighborhoods while reducing sequence-processing burden in later stages.
3. Repeated HSSM Blocks
The model contains:
num_blocks=6
Each HSSM block combines two complementary mechanisms:
a. Selective State Space Modeling
A selective state space module processes the chunked sequence with structured recurrence-like dynamics. Instead of relying purely on attention, it models ordered token evolution through learned state transitions. This helps the model retain sequential inductive bias and capture progression through text.
Key state-space parameter:
d_state=32
b. Sparse Mixture-of-Experts Feed-Forward Stage
Each block also contains a sparse mixture-of-experts module:
num_experts=8top_k=2expert_dim=1024
For every processed representation, the router activates only the top-2 experts rather than all experts. This increases representational capacity without paying the full dense compute cost of all experts every time.
4. Final Normalization and Output Projection
After the stacked HSSM blocks, the model applies final normalization and projects back to vocabulary logits for next-token prediction.
Released Configuration
This release uses the larger Config A style setup:
vocab_size=20000d_model=512d_state=32num_blocks=6num_experts=8top_k=2chunk_size=4expert_dim=1024
How HSSM Works Internally
At a high level, HSSM processes text as follows:
- Tokens are embedded into a continuous space.
- Neighboring tokens are grouped into chunks.
- Chunk representations are passed through repeated hierarchical blocks.
- Inside each block, selective state space dynamics model ordered sequence behavior.
- Sparse expert routing expands feed-forward capacity using only a small subset of experts per step.
- Final logits are produced for autoregressive next-token generation.
This creates a hybrid inductive bias:
- hierarchical because tokens are compressed into chunk-level structure
- state-space based because sequential dynamics are modeled through learned latent state transitions
- sparse expert based because only a subset of experts is activated for each representation
Known Limitations
Because this is a pretrained checkpoint and not a final instruction-tuned release, users may observe:
- repetitive continuations
- weak dialogue alignment
- unstable chat behavior on open-ended prompts
- sensitivity to tokenizer choice
For stronger conversational quality, this checkpoint should be further fine-tuned on dialogue or instruction data.
Files in This Repository
hssm_fineweb_edu_final.pt— pretrained HSSM checkpointsimple_tokenizer_20k.json— tokenizer file used with this releaseHSSM.png— architecture image shown in this model card
Example Loading (PyTorch)
import torch
from hssm_pretrained_chat import load_pretrained, generate_reply
tokenizer, model = load_pretrained(
"hssm_fineweb_edu_final.pt",
"simple_tokenizer_20k.json",
device="cpu",
)
reply = generate_reply(
model=model,
tokenizer=tokenizer,
prompt="What is machine learning?",
max_length=48,
temperature=0.3,
top_k=12,
top_p=0.78,
repetition_penalty=1.45,
no_repeat_ngram_size=4,
)
print(reply)
Repository / Author
- Model name:
HSSM - Publisher: DevHunterAI
- Checkpoint type: pretrained public release
Citation
If you use this release in experiments, please cite the model repository and mention the FineWeb-Edu pretraining source.
