Qwen3
Collection
7 items • Updated
• 1
This version of Qwen3-0.6B-Int8 has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.2(Not released yet)
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-0.6B
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
| Chips | w8a16 | CMM | Flash |
|---|---|---|---|
| AX650 | 20 tokens/sec | 1.3 GiB | 1.2GiB |
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3-0.6B
cd AXERA-TECH/Qwen3-0.6B
hf download AXERA-TECH/Qwen3-0.6B --local-dir .
# structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
└── Qwen3-0.6B
├── README.md
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── post_config.json
├── qwen3_p128_l0_together.axmodel
...
├── qwen3_p128_l9_together.axmodel
├── qwen3_post.axmodel
└── qwen3_tokenizer.txt
2 directories, 34 files
(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-0.6B/
[I][ Init][ 127]: LLM init start
tokenizer_type = 1
96% | ███████████████████████████████ | 30 / 31 [2.35s<2.42s, 12.79 count/s] init post axmodel ok,remain_cmm(8662 MB)
[I][ Init][ 188]: max_token_len : 2559
[I][ Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
[I][ Init][ 194]: prefill_token_num : 128
[I][ Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
[I][ Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
[I][ Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
[I][ Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
[I][ Init][ 203]: prefill_max_token_num : 2048
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 31 / 31 [2.35s<2.35s, 13.21 count/s] embed_selector init ok
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": false,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 224]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
----------------------------------------
prompt >> who are you
[I][ SetKVCache][ 357]: prefill_grpid:2 kv_cache_num:512 precompute_len:0 input_num_token:22
[I][ SetKVCache][ 359]: current prefill_max_token_num:2048
[I][ SetKVCache][ 360]: first run
[I][ Run][ 412]: input token num : 22, prefill_split_num : 1
[I][ Run][ 474]: ttft: 586.40 ms
<think>
Okay, the user asked, "Who are you?" I need to respond appropriately. Since I'm an AI assistant, I should acknowledge their question and explain my purpose. I should mention that I'm here to help and that I can assist with various tasks. I should keep the response friendly and open-ended to encourage further interaction. Let me make sure the language is clear and natural.
</think>
I'm an AI assistant designed to help you with a wide range of questions and tasks. How can I assist you today? 😊
[N][ Run][ 554]: hit eos,avg 15.63 token/s
[I][ GetKVCache][ 331]: precompute_len:130, remaining:1918
prompt >> q
(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-0.6B/
[I][ Init][ 127]: LLM init start
tokenizer_type = 1
96% | ███████████████████████████████ | 30 / 31 [2.06s<2.13s, 14.58 count/s] init post axmodel ok,remain_cmm(8662 MB)
[I][ Init][ 188]: max_token_len : 2559
[I][ Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
[I][ Init][ 194]: prefill_token_num : 128
[I][ Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][ Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
[I][ Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
[I][ Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
[I][ Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
[I][ Init][ 203]: prefill_max_token_num : 2048
[I][ Init][ 27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 31 / 31 [2.06s<2.06s, 15.07 count/s] embed_selector init ok
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": false,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 224]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-0.6B'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-0.6B
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-0.6B"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
model=MODEL,
messages=messages,
)
print(completion.choices[0].message.content)
from openai import OpenAI
API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-0.6B"
messages = [
{"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
{"role": "user", "content": "hello"},
]
client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
model=MODEL,
messages=messages,
stream=True,
)
print("assistant:")
for ev in stream:
delta = getattr(ev.choices[0], "delta", None)
if delta and getattr(delta, "content", None):
print(delta.content, end="", flush=True)
print("
")