Qwen3-VL-2B-Instruct-GPTQ-Int4

This version of Qwen3-VL-2B-Instruct-GPTQ-Int4 has been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(168 tokens)	w4a16	CMM	Flash
AX650	384*384	1	238 ms	323 ms	14.1 tokens/sec	2.5GiB	3.3GiB

Video Process

Chips	input size	image num	image encoder	ttft(600 tokens)	w4a16	CMM	Flash
AX650	384*384	8	751 ms	843 ms	14.1 tokens/sec	2.5GiB	3.3GiB

Image Process (Image Encoder U8+U16 Quantization)

Chips	input size	image num	image encoder	ttft(168 tokens)	w4a16	CMM	Flash
AX650	384*384	1	135 ms	323 ms	14.1 tokens/sec	2.5GiB	3.3GiB

Video Process (Image Encoder U8+U16 Quantization)

Chips	input size	image num	image encoder	ttft(600 tokens)	w4a16	CMM	Flash
AX650	384*384	8	466 ms	843 ms	14.1 tokens/sec	2.5GiB	3.3GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一：克隆仓库后执行安装脚本：

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二：一行命令安装（默认分支 axllm）：

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三：下载Github Actions CI 导出的可执行程序（适合没有编译环境的用户）：

如果没有编译环境，请到： https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序（axllm），然后：

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载（Hugging Face）

mkdir -p AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4
cd AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4
hf download AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4 --local-dir .

# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
    `-- Qwen3-VL-2B-Instruct-GPTQ-Int4
        |-- Qwen3-VL-2B-Instruct_vision.axmodel
        |-- Qwen3-VL-2B-Instruct_vision_1280x736.axmodel
        |-- Qwen3-VL-2B-Instruct_vision_640x640.axmodel
        |-- Qwen3-VL-2B-Instruct_vision_u8.axmodel
        |-- README.md
        |-- config.json
        |-- image.png
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        |-- qwen3_tokenizer.txt
        |-- qwen3_vl_text_p128_l0_together.axmodel
        ...
        |-- qwen3_vl_text_p128_l9_together.axmodel
        |-- qwen3_vl_text_post.axmodel
        `-- vision_cache

3 directories, 39 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行（CLI）

root@ax650:~# axllm run AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4/
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [11.50s<11.88s, 2.61 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [11.50s<11.50s, 2.70 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> who are you
image >>
[I][                      SetKVCache][ 406]: prefill_grpid:2 kv_cache_num:128 precompute_len:0 input_num_token:22
[I][                      SetKVCache][ 408]: current prefill_max_token_num:1152
[I][                      SetKVCache][ 409]: first run
[I][                             Run][ 457]: input token num : 22, prefill_split_num : 1
[I][                             Run][ 497]: prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=22
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=0
[I][                             Run][ 627]: ttft: 174.42 ms
I am Qwen, a large-scale language model developed by the Tongyi Lab of Alibaba Group. I can answer questions, write stories, create essays, and more. I am designed to be helpful, harmless, and honest. I hope to assist you in any way I can!

[N][                             Run][ 709]: hit eos,avg 10.48 token/s

[I][                      GetKVCache][ 380]: precompute_len:79, remaining:1073
prompt >> describe the image
image >> ./AXERA-TECH/Qwen3-VL-2B-Instruct-AX650-c128_p1152-int4/image.png
[I][                EncodeForContent][ 971]: Qwen-VL pixel_values[0] bytes=884736 min=0 max=241 (w=384 h=384 tp=2 ps=16 sm=2)
[I][                EncodeForContent][ 994]: vision cache store: ./AXERA-TECH/Qwen3-VL-2B-Instruct-AX650-c128_p1152-int4/image.png
[I][                      SetKVCache][ 406]: prefill_grpid:3 kv_cache_num:256 precompute_len:79 input_num_token:159
[I][                      SetKVCache][ 408]: current prefill_max_token_num:1024
[I][                             Run][ 457]: input token num : 159, prefill_split_num : 2
[I][                             Run][ 497]: prefill chunk p=0 history_len=79 grpid=2 kv_cache_num=128 input_tokens=128
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=3
[I][                             Run][ 497]: prefill chunk p=1 history_len=207 grpid=3 kv_cache_num=256 input_tokens=31
[I][                             Run][ 519]: prefill indices shape: p=1 idx_elems=384 idx_rows=3 pos_rows=3
[I][                             Run][ 627]: ttft: 379.97 ms
This image depicts three astronauts in white space suits standing in a dense, leafy forest. The scene is set in a dark, shadowy environment, with the astronauts appearing to be in a natural, possibly alien, environment. The image has a monochromatic, almost grayscale color scheme, giving it a mysterious and somber atmosphere. The astronauts are positioned in the center of the frame, with one standing upright and the other two slightly bent, as if they are exploring or searching for something in the dense foliage. The overall mood of the image is mysterious and contemplative.

[N][                             Run][ 709]: hit eos,avg 10.33 token/s

[I][                      GetKVCache][ 380]: precompute_len:239, remaining:913
prompt >> how many people in the image?
image >>
[I][                EncodeForContent][ 926]: vision cache hit (mem): ./AXERA-TECH/Qwen3-VL-2B-Instruct-AX650-c128_p1152-int4/image.png
[I][                      SetKVCache][ 406]: prefill_grpid:4 kv_cache_num:384 precompute_len:239 input_num_token:74
[I][                      SetKVCache][ 408]: current prefill_max_token_num:896
[I][                             Run][ 457]: input token num : 74, prefill_split_num : 1
[I][                             Run][ 497]: prefill chunk p=0 history_len=239 grpid=3 kv_cache_num=256 input_tokens=74
[I][                             Run][ 519]: prefill indices shape: p=0 idx_elems=384 idx_rows=3 pos_rows=3
[I][                             Run][ 627]: ttft: 193.78 ms
This image depicts three astronauts in white space suits standing in a dense, leafy forest. The scene is set in a dark, shadowy environment, with the astronauts appearing to be in a natural, possibly alien, environment. The image has a monochromatic, almost grayscale color scheme, giving it a mysterious and somber atmosphere. The astronauts are positioned in the center of the frame, with one standing upright and the other two slightly bent, as if they are exploring or searching for something in the dense foliage. The overall mood of the image is mysterious and contemplative.

[N][                             Run][ 709]: hit eos,avg 10.48 token/s

[I][                      GetKVCache][ 380]: precompute_len:410, remaining:742
prompt >> q

启动服务（OpenAI 兼容）

root@ax650:~# axllm serve AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")