Qwen3-0.6B-Int8

This version of Qwen3-0.6B-Int8 has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.2(Not released yet)

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-0.6B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

Chips w8a16 CMM Flash
AX650 20 tokens/sec 1.3 GiB 1.2GiB

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3-0.6B
cd AXERA-TECH/Qwen3-0.6B
hf download AXERA-TECH/Qwen3-0.6B --local-dir .

# structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
    └── Qwen3-0.6B
        ├── README.md
        ├── config.json
        ├── model.embed_tokens.weight.bfloat16.bin
        ├── post_config.json
        ├── qwen3_p128_l0_together.axmodel
...
        ├── qwen3_p128_l9_together.axmodel
        ├── qwen3_post.axmodel
        └── qwen3_tokenizer.txt

2 directories, 34 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-0.6B/
[I][                            Init][ 127]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [2.35s<2.42s, 12.79 count/s] init post axmodel ok,remain_cmm(8662 MB)
[I][                            Init][ 188]: max_token_len : 2559
[I][                            Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 194]: prefill_token_num : 128
[I][                            Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
[I][                            Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
[I][                            Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
[I][                            Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
[I][                            Init][ 203]: prefill_max_token_num : 2048
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [2.35s<2.35s, 13.21 count/s] embed_selector init ok
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 224]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
----------------------------------------
prompt >> who are you
[I][                      SetKVCache][ 357]: prefill_grpid:2 kv_cache_num:512 precompute_len:0 input_num_token:22
[I][                      SetKVCache][ 359]: current prefill_max_token_num:2048
[I][                      SetKVCache][ 360]: first run
[I][                             Run][ 412]: input token num : 22, prefill_split_num : 1
[I][                             Run][ 474]: ttft: 586.40 ms
<think>
Okay, the user asked, "Who are you?" I need to respond appropriately. Since I'm an AI assistant, I should acknowledge their question and explain my purpose. I should mention that I'm here to help and that I can assist with various tasks. I should keep the response friendly and open-ended to encourage further interaction. Let me make sure the language is clear and natural.
</think>

I'm an AI assistant designed to help you with a wide range of questions and tasks. How can I assist you today? 😊

[N][                             Run][ 554]: hit eos,avg 15.63 token/s

[I][                      GetKVCache][ 331]: precompute_len:130, remaining:1918
prompt >> q

启动服务(OpenAI 兼容)

(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-0.6B/
[I][                            Init][ 127]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [2.06s<2.13s, 14.58 count/s] init post axmodel ok,remain_cmm(8662 MB)
[I][                            Init][ 188]: max_token_len : 2559
[I][                            Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 194]: prefill_token_num : 128
[I][                            Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
[I][                            Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
[I][                            Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
[I][                            Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
[I][                            Init][ 203]: prefill_max_token_num : 2048
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [2.06s<2.06s, 15.07 count/s] embed_selector init ok
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 224]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-0.6B'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-0.6B

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-0.6B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-0.6B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
37
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-0.6B

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(689)
this model

Collection including AXERA-TECH/Qwen3-0.6B