leonsarmiento/GLM-4.7-Flash-5bit-mlx

This model was converted to MLX format from zai-org/GLM-4.7-Flash using mixed 8/5-bit quantization optimized for Apple Silicon.

GLM-4.7-Flash is a 31B-parameter text-only MoE (Mixture of Experts) model with 64 routed experts (4 active per token + 1 shared expert), MLA-style attention with LoRA-rank Q/KV compression, and speculative decoding support (MTP). Despite 31B total parameters, only ~3B are activated per token for efficient inference.

Mixed Quantization Strategy

This model uses layer-aware mixed-bit quantization that allocates higher precision to sensitive layers and lower precision to bulk parameters, maximizing quality per gigabyte.

Bit Depth Layers Rationale
8-bit embed_tokens, lm_head, router gate + e_score_correction_bias, shared_experts, self_attn (MLA), dense mlp (layer 0), layernorms All critical layers preserved at full 8-bit — embeddings, routing, shared representation, attention, and the dense MLP
5-bit switch_mlp (routed experts) Bulk of parameters, only 4 of 64 experts active per token (6.25%) — natural redundancy tolerates lower precision

Why this matters for MoE

In Mixture of Experts models, the router gate determines which experts handle each token. A poorly quantized router sends tokens to the wrong experts, cascading errors through the entire forward pass. The shared expert processes all tokens regardless of routing, making it equally critical.

By preserving the router, shared expert, and all attention layers at 8-bit, while quantizing the 64 routed experts to 5-bit, we maintain routing accuracy, shared representation quality, and full attention fidelity while achieving significant compression. The 5-bit variant provides higher fidelity for the routed experts compared to the 4-bit version, at the cost of additional ~3 GB.

Quantization Details

Metric Value
Quantization type Mixed 8/5-bit
Average 5.718 bits per weight
Group size 64
Method mlx_lm with custom quant_predicate
Total output size ~20 GB (from ~62.5 GB BF16)
Compression ratio ~3.1×

Recommended Inference Parameters

Parameter Value
temperature 0.2
top_k 50
top_p 0.95
min_p 0.01
repeat_penalty disabled

LM Studio Jinja template or oMLX custom kwargs

add these flags to the top of the jinja template or as custom kwargs to use this model in the way it was intended by GLM:

{%- set enable_thinking = true -%}
{%- set clear_thinking = false -%}

Use with mlx-lm

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("leonsarmiento/GLM-4.7-Flash-5bit-mlx")

prompt = "Hello, how are you?"

response = generate(model, tokenizer, prompt=prompt, temp=0.2, top_k=50, top_p=0.95)
print(response)
Downloads last month
285
Safetensors
Model size
30B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for leonsarmiento/GLM-4.7-Flash-5bit-mlx

Quantized
(83)
this model