--- license: apache-2.0 language: - en - zh library_name: transformers pipeline_tag: text-generation tags: - llm - nanbeige ---
Nanbeige Logo
# 1. Introduction Nanbeige4-3B-Base is a **3B-parameter base model** within the fourth-generation Nanbeige LLM family. It showcases that even a compact model can achieve advanced performances through continuous enhancements in data quality and training methodologies. When performing supervised fine-tuning (SFT) on the same training data, our model significantly outperforms open-source models of the same size, and even surpasses larger models such as Qwen3-8B. * Technical Report: https://arxiv.org/pdf/2512.06266
# 2. Model Summary Training Data
* We constructed a comprehensive **23T-tokens** training corpus from web texts, books, code, and papers, meticulously filtered through a hybrid strategy of tagging-based scoring and retrieval-based recalling. This foundation was then augmented with **knowledge-dense and reasoning-intensive synthetic data**, including Q&A pairs, textbooks, and Long-COTs, which significantly benefited the downstream task performance. Training Recipe
* We designed an innovative **FG-WSD (Fine-Grained Warmup-Stable-Decay)** training scheduler, meticulously refining the conventional WSD approach. This scheduler was implemented with a **fine-grained, quality-progressive data curriculum**, dividing the Stable stage into multiple phases with progressively improved data mixtures. Compared to the vanilla WSD, our method achieved notable performance gains. During the Decay stage, we increased the proportion of math, code, synthetic QA, and synthetic Long-COT data to further enhance reasoning capabilities. | Stage | Training Tokens | Learning Rate | |-------------------------------|-----------------|-----------------------| | Warmup Stage | 0.1T | 0 ——> 4.5e-4 | | Diversity-Enriched Stable Stage| 12.4T | Constant 4.5e-4 | | High-Quality Stable Stage | 6.5T | Constant 4.5e-4 | | Decay and Long-Context Stage | 4T | 4.5e-4 ——> 1.5e-6 |

# 3. Model Performance For model performance comparison, we fine-tuned both our base model and the Qwen series base models using the same fine-tuning data and evaluated their downstream task metrics. We believe that when evaluating base models, this end-to-end validation approach better reflects the model's ultimate performance in downstream tasks compared to the few-shot testing approach. To ensure a fair comparison, we conducted experiments with three distinct datasets, including [Nemotron-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1), [Ring-lite-sft-data](https://huggingface.co/datasets/inclusionAI/Ring-lite-sft-data), and [OpenThoughts3](https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M). For each dataset, we randomly select 500k training samples for SFT experiments. * Finetuned with Nemotron-Dataset-v1 | Model | AIME2024 | AIME2025 | Math-500 | GPQA | |--------------------|----------|----------|----------|------| | Qwen3-4B-Base | 24.6 | 25.0 | 90.4 | 44.6 | | Qwen3-8B-Base | 37.9 | 29.6 | 91.1 | 48.9 | | **Nanbeige4-3B-Base** | **52.9** | **40.8** | **93.4** | **53.4** | * Finetuned with Ring-lite-sft-data | Model | AIME2024 | AIME2025 | Math-500 | GPQA | |--------------------|----------|----------|----------|------| | Qwen3-4B-Base | 40.4 | 31.3 | 93.6 | 51.4 | | Qwen3-8B-Base | 50.0 | 35.8 | 94.4 | 55.1 | | **Nanbeige4-3B-Base** | **56.8** | **45.3** | **95.5** | **57.7** | * Finetuned with OpenThoughts3 | Model | AIME2024 | AIME2025 | Math-500 | GPQA | |--------------------|----------|----------|----------|------| | Qwen3-4B-Base | 52.9 | 42.1 | 93.2 | 49.6 | | Qwen3-8B-Base | 60.4 | 47.1 | **95.0** | 55.3 | | **Nanbeige4-3B-Base** | **62.4** | **49.2** | 94.6 | **56.9** | The results demonstrate that **Nanbeige4-3B-Base** significantly outperforms Qwen3-4B-Base, and even surpasses the larger Qwen3-8B-Base, highlighting the greater potential of our base model after fine-tuning. This advantage stems from the optimized training recipe during our Stable stage and the extensive high-quality synthetic data incorporated during the Decay stage. ## 4. Quickstart ``` from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( 'Nanbeige/Nanbeige4-3B-Base', use_fast=False, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained( 'Nanbeige/Nanbeige4-3B-Base', torch_dtype='auto', device_map='auto', trust_remote_code=True ) prompt = "中国的首都是" input_ids = tokenizer(prompt, return_tensors='pt').input_ids output_ids = model.generate(input_ids.to('cuda')) resp = tokenizer.decode(output_ids[0][len(input_ids[0]):], skip_special_tokens=True) print(resp) ``` # 5. Limitations While we place great emphasis on the safety of the model during the training process, striving to ensure that its outputs align with ethical and legal requirements, it may not completely avoid generating unexpected outputs due to the model's size and probabilistic nature. These outputs may include harmful content such as bias or discrimination. Please don't propagate such content. We do not assume any responsibility for the consequences resulting from the dissemination of inappropriate information.
# 6. Citation If you find our model useful or want to use it in your projects, please cite as follows: ``` @misc{yang2025nanbeige43btechnicalreportexploring, title={Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models}, author={Chen Yang and Guangyue Peng and Jiaying Zhu and Ran Le and Ruixiang Feng and Tao Zhang and Wei Ruan and Xiaoqi Liu and Xiaoxue Cheng and Xiyun Xu and Yang Song and Yanzipeng Gao and Yiming Jia and Yun Xing and Yuntao Wen and Zekai Wang and Zhenwei An and Zhicong Sun and Zongchao Chen}, year={2025}, eprint={2512.06266}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2512.06266}, } ```
# 7. Contact If you have any questions, please raise an issue or contact us at nanbeige@126.com.