|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- math |
|
|
- reasoning |
|
|
- agent |
|
|
- qwen |
|
|
- grpo |
|
|
- reinforcement-learning |
|
|
base_model: Qwen/Qwen3-4B-Thinking-2507 |
|
|
datasets: |
|
|
- nvidia/OpenMathReasoning |
|
|
metrics: |
|
|
- accuracy |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# DeepMath: A Lightweight Math Reasoning Agent |
|
|
|
|
|
<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/ndb_WmPavW1MONAjsGpYT.jpeg" style="width:600px" alt="An LLM is using a calculator to answer questions." /> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**DeepMath** is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and trained with **GRPO (Group Relative Policy Optimization)**, DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length. |
|
|
|
|
|
- **Developed by:** Intel AI Labs |
|
|
- **Model type:** Causal language model with agent capabilities |
|
|
- **Language:** English |
|
|
- **Base model:** [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) |
|
|
- **License:** Apache 2.0 |
|
|
- **Blog:**: 🔗 <https://huggingface.co/blog/intel-deepmath> |
|
|
- **Repository:** 💻 [https://github.com/IntelLabs/DeepMath](https://github.com/IntelLabs/DeepMath) |
|
|
|
|
|
## Key Features |
|
|
|
|
|
✅ **Code-driven reasoning:** Generates short Python snippets for intermediate computational steps |
|
|
✅ **Sandboxed execution:** No file I/O, no network calls, strict timeouts |
|
|
✅ **Improved accuracy:** Offloading computation reduces arithmetic errors |
|
|
✅ **Reduced verbosity:** Up to 66% shorter outputs compared to baseline |
|
|
✅ **Safe and auditable:** Deterministic execution with readable code snippets |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components: |
|
|
|
|
|
- **Agent Interface:** Outputs special tokens for Python code execution during reasoning |
|
|
- **Executor:** Sandboxed Python environment with allow-listed modules |
|
|
- **Safety Constraints:** Per-snippet timeouts, no file/network access |
|
|
- **Training Method:** GRPO with accuracy and code generation rewards |
|
|
|
|
|
<figure> |
|
|
<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/zOcvJ2DY61QZyozarsKbT.png" style="width:400px" alt="Changes to vLLM client and server in TRL library." /> |
|
|
<figcaption><p><em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em></p></figcaption> |
|
|
</figure> |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) (tool-usage subset) |
|
|
- **Note:** GRPO training only uses problems, not solutions |
|
|
- **In-context Learning:** 4 solved examples demonstrating agent call syntax and patterns |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
**GRPO (Group Relative Policy Optimization)** fine-tuning with: |
|
|
|
|
|
- **Accuracy Reward:** +1 for correct answers |
|
|
- **Code Generation Reward:** +1 for using code snippets (weighted 10:1 vs. accuracy) |
|
|
- **Length Constraint:** GRPO completions limited to 5k tokens |
|
|
- **Temperature Scheduling:** Linear schedule from T=1.2 → T=0.7 during training |
|
|
- **Infrastructure:** Modified TRL library's vLLM client and server |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- Base inference engine: [vLLM](https://github.com/vllm-project/vllm) |
|
|
- Agent framework: Based on [SmolAgents](https://github.com/huggingface/smolagents/) |
|
|
- Training framework: Modified [TRL](https://github.com/huggingface/trl) GRPO trainer |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Benchmark Results |
|
|
|
|
|
We evaluated DeepMath on four mathematical reasoning datasets using **majority@16** and mean output length metrics: |
|
|
|
|
|
<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/mBuINzNvjDKdZEuIqzJeO.png" style="width:800px" alt="Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets."/> |
|
|
|
|
|
**Key Findings:** |
|
|
|
|
|
- **Accuracy:** Improved performance on challenging datasets (AIME, HMMT, HLE) |
|
|
- **Efficiency:** Up to **66% reduction** in output length |
|
|
- **Robustness:** Consistent improvements when combining agent + GRPO training |
|
|
|
|
|
### Evaluation Datasets |
|
|
|
|
|
- **MATH500:** Subset of the MATH dataset |
|
|
- **AIME:** American Invitational Mathematics Examination problems |
|
|
- **HMMT:** Harvard-MIT Mathematics Tournament problems |
|
|
- **HLE:** High-level exam problems |
|
|
|
|
|
<figure> |
|
|
<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/a-kn3oHdlxTP_L-63N9LX.png" style="width:700px" alt="Output example showing Python code generation and execution." /> |
|
|
<figcaption><p><em>Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.</em></p></figcaption> |
|
|
</figure> |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
# Install uv package manager |
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh |
|
|
|
|
|
# Clone repository |
|
|
git clone https://github.com/IntelLabs/DeepMath.git |
|
|
cd DeepMath |
|
|
|
|
|
# Install dependencies |
|
|
uv pip install -r requirements.txt |
|
|
uv pip install -e . |
|
|
``` |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "Intel/deepmath-v1" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
# Example problem |
|
|
problem = "What is the sum of the first 100 positive integers?" |
|
|
|
|
|
inputs = tokenizer(problem, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=3000) |
|
|
print(tokenizer.decode(outputs[0])) |
|
|
``` |
|
|
|
|
|
### Inference with Agent |
|
|
|
|
|
For full agent capabilities with sandboxed Python execution: |
|
|
|
|
|
```bash |
|
|
python inference.py \ |
|
|
+model.use_vllm=true \ |
|
|
+model.math_agent=true \ |
|
|
+model.examples=deep_math/fewshot.txt \ |
|
|
model.generation.max_new_tokens=3000 \ |
|
|
+model.max_agent_output=20000 \ |
|
|
+model.max_steps=50 \ |
|
|
model.model_name_or_path=Intel/deepmath-v1 \ |
|
|
hf_tag=HuggingFaceH4/MATH-500 \ |
|
|
generated_file=output.jsonl |
|
|
``` |
|
|
|
|
|
See the [repository](https://github.com/IntelLabs/DeepMath) for complete usage examples. |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- **Scope:** Optimized for mathematical reasoning tasks; may not generalize to other domains |
|
|
- **Problem Types:** Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown |
|
|
- **Model Size:** 4B parameters may limit reasoning depth on extremely complex problems |
|
|
- **Code Execution:** Requires sandboxed environment for full agent capabilities |
|
|
|
|
|
### Safety Considerations |
|
|
|
|
|
⚠️ **Code Execution Risk:** This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should: |
|
|
|
|
|
- Carefully manage attack surfaces |
|
|
- Enforce rate limits |
|
|
- Use proper isolation (containers, VMs) |
|
|
- Monitor resource usage |
|
|
- Validate generated code before execution in production |
|
|
|
|
|
### Ethical Considerations |
|
|
|
|
|
- The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight |
|
|
- Generated code should be reviewed before execution in production environments |
|
|
- The model may reflect biases present in the training data |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use DeepMath in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{deepmath2025, |
|
|
author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe}, |
|
|
title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs}, |
|
|
year = {2025}, |
|
|
publisher = {Intel AI Labs}, |
|
|
url = {https://github.com/IntelLabs/DeepMath} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or issues, please open an issue on the [GitHub repository](https://github.com/IntelLabs/DeepMath). |
|
|
|