Instructions to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("leonsarmiento/GLM-4.7-Flash-5bit-mlx") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "leonsarmiento/GLM-4.7-Flash-5bit-mlx" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default leonsarmiento/GLM-4.7-Flash-5bit-mlx
Run Hermes
hermes
- MLX LM
How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "leonsarmiento/GLM-4.7-Flash-5bit-mlx", "messages": [ {"role": "user", "content": "Hello"} ] }'
leonsarmiento/GLM-4.7-Flash-5bit-mlx
This model was converted to MLX format from zai-org/GLM-4.7-Flash using mixed 8/5-bit quantization optimized for Apple Silicon.
GLM-4.7-Flash is a 31B-parameter text-only MoE (Mixture of Experts) model with 64 routed experts (4 active per token + 1 shared expert), MLA-style attention with LoRA-rank Q/KV compression, and speculative decoding support (MTP). Despite 31B total parameters, only ~3B are activated per token for efficient inference.
Mixed Quantization Strategy
This model uses layer-aware mixed-bit quantization that allocates higher precision to sensitive layers and lower precision to bulk parameters, maximizing quality per gigabyte.
| Bit Depth | Layers | Rationale |
|---|---|---|
| 8-bit | embed_tokens, lm_head, router gate + e_score_correction_bias, shared_experts, self_attn (MLA), dense mlp (layer 0), layernorms |
All critical layers preserved at full 8-bit — embeddings, routing, shared representation, attention, and the dense MLP |
| 5-bit | switch_mlp (routed experts) |
Bulk of parameters, only 4 of 64 experts active per token (6.25%) — natural redundancy tolerates lower precision |
Why this matters for MoE
In Mixture of Experts models, the router gate determines which experts handle each token. A poorly quantized router sends tokens to the wrong experts, cascading errors through the entire forward pass. The shared expert processes all tokens regardless of routing, making it equally critical.
By preserving the router, shared expert, and all attention layers at 8-bit, while quantizing the 64 routed experts to 5-bit, we maintain routing accuracy, shared representation quality, and full attention fidelity while achieving significant compression. The 5-bit variant provides higher fidelity for the routed experts compared to the 4-bit version, at the cost of additional ~3 GB.
Quantization Details
| Metric | Value |
|---|---|
| Quantization type | Mixed 8/5-bit |
| Average | 5.718 bits per weight |
| Group size | 64 |
| Method | mlx_lm with custom quant_predicate |
| Total output size | ~20 GB (from ~62.5 GB BF16) |
| Compression ratio | ~3.1× |
Recommended Inference Parameters
| Parameter | Value |
|---|---|
temperature |
0.2 |
top_k |
50 |
top_p |
0.95 |
min_p |
0.01 |
repeat_penalty |
disabled |
LM Studio Jinja template or oMLX custom kwargs
add these flags to the top of the jinja template or as custom kwargs to use this model in the way it was intended by GLM:
{%- set enable_thinking = true -%}
{%- set clear_thinking = false -%}
Use with mlx-lm
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("leonsarmiento/GLM-4.7-Flash-5bit-mlx")
prompt = "Hello, how are you?"
response = generate(model, tokenizer, prompt=prompt, temp=0.2, top_k=50, top_p=0.95)
print(response)
- Downloads last month
- 285
5-bit
Model tree for leonsarmiento/GLM-4.7-Flash-5bit-mlx
Base model
zai-org/GLM-4.7-Flash