GadflyII/GLM-4.7-Flash-MXFP4 · Update MXFP4 format to compressed-tensors

Update MXFP4 format to compressed-tensors

by mgoin - opened 18 days ago

Discussion

mgoin

18 days ago

•

edited 18 days ago

Hey @GadflyII great work with all your checkpoints and features on vLLM!

I wanted to let you know specifically for mxfp4 that I think we'd like to keep mxfp4.py specific for gpt-oss in upstream for a few reasons, but mostly since that model is the only one using it at the moment.

We have support for mxfp4 w4a16 the same as gpt-oss but generalized through the compressed-tensors pathway (https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_mxfp4.py) , see this example to make a model of your own https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a16_fp4/mxfp4/qwen3_example.py

See the uploaded model here which is tested in CI https://huggingface.co/nm-testing/Qwen3-30B-A3B-MXFP4A16

Remaking your checkpoint in that format may help you run this model on upstream vLLM as-is. LMK what you think

GadflyII

Owner 18 days ago

Hey, thanks for reaching out. I will absolutely do that and see how it goes; I am deep in a training run so it will be a few days at least, I need more hardware 😞

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment