File size: 7,774 Bytes
bdd9de5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f35ce4b
bdd9de5
7a7885d
bdd9de5
 
 
f35ce4b
bdd9de5
 
 
 
 
 
15e66a8
 
bdd9de5
 
 
 
 
 
 
 
 
 
 
f35ce4b
bdd9de5
 
 
 
 
 
 
7a7885d
bdd9de5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a7885d
bdd9de5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a7885d
bdd9de5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
language:
- en
license: apache-2.0
tags:
- math
- reasoning
- agent
- qwen
- grpo
- reinforcement-learning
base_model: Qwen/Qwen3-4B-Thinking-2507
datasets:
- nvidia/OpenMathReasoning
metrics:
- accuracy
library_name: transformers
pipeline_tag: text-generation
---

# DeepMath: A Lightweight Math Reasoning Agent

<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/ndb_WmPavW1MONAjsGpYT.jpeg" style="width:600px" alt="An LLM is using a calculator to answer questions." />

## Model Description

**DeepMath** is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and trained with **GRPO (Group Relative Policy Optimization)**, DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length.

- **Developed by:** Intel AI Labs
- **Model type:** Causal language model with agent capabilities
- **Language:** English
- **Base model:** [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)
- **License:** Apache 2.0
- **Blog:**: πŸ”— <https://huggingface.co/blog/intel-deepmath>
- **Repository:** πŸ’» [https://github.com/IntelLabs/DeepMath](https://github.com/IntelLabs/DeepMath)

## Key Features

βœ… **Code-driven reasoning:** Generates short Python snippets for intermediate computational steps  
βœ… **Sandboxed execution:** No file I/O, no network calls, strict timeouts  
βœ… **Improved accuracy:** Offloading computation reduces arithmetic errors  
βœ… **Reduced verbosity:** Up to 66% shorter outputs compared to baseline  
βœ… **Safe and auditable:** Deterministic execution with readable code snippets  

## Model Architecture

DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components:

- **Agent Interface:** Outputs special tokens for Python code execution during reasoning
- **Executor:** Sandboxed Python environment with allow-listed modules
- **Safety Constraints:** Per-snippet timeouts, no file/network access
- **Training Method:** GRPO with accuracy and code generation rewards

<figure>
<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/zOcvJ2DY61QZyozarsKbT.png" style="width:400px" alt="Changes to vLLM client and server in TRL library." />
<figcaption><p><em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em></p></figcaption>
</figure>

## Training Details

### Training Data

- **Dataset:** [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) (tool-usage subset)
- **Note:** GRPO training only uses problems, not solutions
- **In-context Learning:** 4 solved examples demonstrating agent call syntax and patterns

### Training Procedure

**GRPO (Group Relative Policy Optimization)** fine-tuning with:

- **Accuracy Reward:** +1 for correct answers
- **Code Generation Reward:** +1 for using code snippets (weighted 10:1 vs. accuracy)
- **Length Constraint:** GRPO completions limited to 5k tokens
- **Temperature Scheduling:** Linear schedule from T=1.2 β†’ T=0.7 during training
- **Infrastructure:** Modified TRL library's vLLM client and server

### Training Infrastructure

- Base inference engine: [vLLM](https://github.com/vllm-project/vllm)
- Agent framework: Based on [SmolAgents](https://github.com/huggingface/smolagents/)
- Training framework: Modified [TRL](https://github.com/huggingface/trl) GRPO trainer

## Performance

### Benchmark Results

We evaluated DeepMath on four mathematical reasoning datasets using **majority@16** and mean output length metrics:

<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/mBuINzNvjDKdZEuIqzJeO.png" style="width:800px" alt="Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets."/>

**Key Findings:**

- **Accuracy:** Improved performance on challenging datasets (AIME, HMMT, HLE)
- **Efficiency:** Up to **66% reduction** in output length
- **Robustness:** Consistent improvements when combining agent + GRPO training

### Evaluation Datasets

- **MATH500:** Subset of the MATH dataset
- **AIME:** American Invitational Mathematics Examination problems
- **HMMT:** Harvard-MIT Mathematics Tournament problems
- **HLE:** High-level exam problems

<figure>
<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/a-kn3oHdlxTP_L-63N9LX.png" style="width:700px" alt="Output example showing Python code generation and execution." />
<figcaption><p><em>Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.</em></p></figcaption>
</figure>

## Usage

### Installation

```bash
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/IntelLabs/DeepMath.git
cd DeepMath

# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .
```

### Basic Inference

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/deepmath-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example problem
problem = "What is the sum of the first 100 positive integers?"

inputs = tokenizer(problem, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=3000)
print(tokenizer.decode(outputs[0]))
```

### Inference with Agent

For full agent capabilities with sandboxed Python execution:

```bash
python inference.py \
    +model.use_vllm=true \
    +model.math_agent=true \
    +model.examples=deep_math/fewshot.txt \
    model.generation.max_new_tokens=3000 \
    +model.max_agent_output=20000 \
    +model.max_steps=50 \
    model.model_name_or_path=Intel/deepmath-v1 \
    hf_tag=HuggingFaceH4/MATH-500 \
    generated_file=output.jsonl
```

See the [repository](https://github.com/IntelLabs/DeepMath) for complete usage examples.

## Limitations and Biases

### Limitations

- **Scope:** Optimized for mathematical reasoning tasks; may not generalize to other domains
- **Problem Types:** Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown
- **Model Size:** 4B parameters may limit reasoning depth on extremely complex problems
- **Code Execution:** Requires sandboxed environment for full agent capabilities

### Safety Considerations

⚠️ **Code Execution Risk:** This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should:

- Carefully manage attack surfaces
- Enforce rate limits
- Use proper isolation (containers, VMs)
- Monitor resource usage
- Validate generated code before execution in production

### Ethical Considerations

- The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight
- Generated code should be reviewed before execution in production environments
- The model may reflect biases present in the training data

## Citation

If you use DeepMath in your research, please cite:

```bibtex
@software{deepmath2025,
  author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
  title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
  year = {2025},
  publisher = {Intel AI Labs},
  url = {https://github.com/IntelLabs/DeepMath}
}
```

## Model Card Contact

For questions or issues, please open an issue on the [GitHub repository](https://github.com/IntelLabs/DeepMath).