Instructions to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("leonsarmiento/GLM-4.7-Flash-5bit-mlx")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps
LM Studio

Pi new

How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "leonsarmiento/GLM-4.7-Flash-5bit-mlx"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default leonsarmiento/GLM-4.7-Flash-5bit-mlx

Run Hermes

hermes

MLX LM

How to use leonsarmiento/GLM-4.7-Flash-5bit-mlx with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "leonsarmiento/GLM-4.7-Flash-5bit-mlx"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "leonsarmiento/GLM-4.7-Flash-5bit-mlx",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

leonsarmiento/GLM-4.7-Flash-5bit-mlx

This model was converted to MLX format from zai-org/GLM-4.7-Flash using mixed 8/5-bit quantization optimized for Apple Silicon.

GLM-4.7-Flash is a 31B-parameter text-only MoE (Mixture of Experts) model with 64 routed experts (4 active per token + 1 shared expert), MLA-style attention with LoRA-rank Q/KV compression, and speculative decoding support (MTP). Despite 31B total parameters, only ~3B are activated per token for efficient inference.

Mixed Quantization Strategy

This model uses layer-aware mixed-bit quantization that allocates higher precision to sensitive layers and lower precision to bulk parameters, maximizing quality per gigabyte.

Bit Depth	Layers	Rationale
8-bit	`embed_tokens`, `lm_head`, router `gate` + `e_score_correction_bias`, `shared_experts`, `self_attn` (MLA), dense `mlp` (layer 0), layernorms	All critical layers preserved at full 8-bit — embeddings, routing, shared representation, attention, and the dense MLP
5-bit	`switch_mlp` (routed experts)	Bulk of parameters, only 4 of 64 experts active per token (6.25%) — natural redundancy tolerates lower precision

Why this matters for MoE

In Mixture of Experts models, the router gate determines which experts handle each token. A poorly quantized router sends tokens to the wrong experts, cascading errors through the entire forward pass. The shared expert processes all tokens regardless of routing, making it equally critical.

By preserving the router, shared expert, and all attention layers at 8-bit, while quantizing the 64 routed experts to 5-bit, we maintain routing accuracy, shared representation quality, and full attention fidelity while achieving significant compression. The 5-bit variant provides higher fidelity for the routed experts compared to the 4-bit version, at the cost of additional ~3 GB.

Quantization Details

Metric	Value
Quantization type	Mixed 8/5-bit
Average	5.718 bits per weight
Group size	64
Method	`mlx_lm` with custom `quant_predicate`
Total output size	~20 GB (from ~62.5 GB BF16)
Compression ratio	~3.1×

Recommended Inference Parameters

Parameter	Value
`temperature`	0.2
`top_k`	50
`top_p`	0.95
`min_p`	0.01
`repeat_penalty`	disabled

LM Studio Jinja template or oMLX custom kwargs

add these flags to the top of the jinja template or as custom kwargs to use this model in the way it was intended by GLM:

{%- set enable_thinking = true -%}
{%- set clear_thinking = false -%}

Use with mlx-lm

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("leonsarmiento/GLM-4.7-Flash-5bit-mlx")

prompt = "Hello, how are you?"

response = generate(model, tokenizer, prompt=prompt, temp=0.2, top_k=50, top_p=0.95)
print(response)

Downloads last month: 285

Safetensors

Model size

30B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

5-bit

Model tree for leonsarmiento/GLM-4.7-Flash-5bit-mlx

Base model

zai-org/GLM-4.7-Flash

Quantized

(83)

this model