deepmath-v1 / README.md

Update README.md

15e66a8 verified 12 days ago

7.77 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- math
	- reasoning
	- agent
	- qwen
	- grpo
	- reinforcement-learning
	base_model: Qwen/Qwen3-4B-Thinking-2507
	datasets:
	- nvidia/OpenMathReasoning
	metrics:
	- accuracy
	library_name: transformers
	pipeline_tag: text-generation
	---

	# DeepMath: A Lightweight Math Reasoning Agent

	<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/ndb_WmPavW1MONAjsGpYT.jpeg" style="width:600px" alt="An LLM is using a calculator to answer questions." />

	## Model Description

	DeepMath is a 4B parameter mathematical reasoning model that combines a fine-tuned LLM with a sandboxed Python executor. Built on [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) and trained with GRPO (Group Relative Policy Optimization), DeepMath generates concise Python snippets for computational steps instead of verbose text explanations, significantly reducing errors and output length.

	- Developed by: Intel AI Labs
	- Model type: Causal language model with agent capabilities
	- Language: English
	- Base model: [Qwen3-4B Thinking](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507)
	- License: Apache 2.0
	- Blog:: 🔗 <https://huggingface.co/blog/intel-deepmath>
	- Repository: 💻 [https://github.com/IntelLabs/DeepMath](https://github.com/IntelLabs/DeepMath)

	## Key Features

	✅ Code-driven reasoning: Generates short Python snippets for intermediate computational steps
	✅ Sandboxed execution: No file I/O, no network calls, strict timeouts
	✅ Improved accuracy: Offloading computation reduces arithmetic errors
	✅ Reduced verbosity: Up to 66% shorter outputs compared to baseline
	✅ Safe and auditable: Deterministic execution with readable code snippets

	## Model Architecture

	DeepMath uses a LoRA adapter fine-tuned on top of Qwen3-4B Thinking with the following components:

	- Agent Interface: Outputs special tokens for Python code execution during reasoning
	- Executor: Sandboxed Python environment with allow-listed modules
	- Safety Constraints: Per-snippet timeouts, no file/network access
	- Training Method: GRPO with accuracy and code generation rewards

	<figure>
	<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/zOcvJ2DY61QZyozarsKbT.png" style="width:400px" alt="Changes to vLLM client and server in TRL library." />
	<figcaption><p><em>Figure 1: The vLLM client and server were modified to use the DeepMath agent in generating the candidates, while using the vLLM backend.</em></p></figcaption>
	</figure>

	## Training Details

	### Training Data

	- Dataset: [OpenMathReasoning](https://huggingface.co/datasets/nvidia/OpenMathReasoning) (tool-usage subset)
	- Note: GRPO training only uses problems, not solutions
	- In-context Learning: 4 solved examples demonstrating agent call syntax and patterns

	### Training Procedure

	GRPO (Group Relative Policy Optimization) fine-tuning with:

	- Accuracy Reward: +1 for correct answers
	- Code Generation Reward: +1 for using code snippets (weighted 10:1 vs. accuracy)
	- Length Constraint: GRPO completions limited to 5k tokens
	- Temperature Scheduling: Linear schedule from T=1.2 → T=0.7 during training
	- Infrastructure: Modified TRL library's vLLM client and server

	### Training Infrastructure

	- Base inference engine: [vLLM](https://github.com/vllm-project/vllm)
	- Agent framework: Based on [SmolAgents](https://github.com/huggingface/smolagents/)
	- Training framework: Modified [TRL](https://github.com/huggingface/trl) GRPO trainer

	## Performance

	### Benchmark Results

	We evaluated DeepMath on four mathematical reasoning datasets using majority@16 and mean output length metrics:

	<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/mBuINzNvjDKdZEuIqzJeO.png" style="width:800px" alt="Main results table showing performance across MATH500, AIME, HMMT, and HLE datasets."/>

	Key Findings:

	- Accuracy: Improved performance on challenging datasets (AIME, HMMT, HLE)
	- Efficiency: Up to 66% reduction in output length
	- Robustness: Consistent improvements when combining agent + GRPO training

	### Evaluation Datasets

	- MATH500: Subset of the MATH dataset
	- AIME: American Invitational Mathematics Examination problems
	- HMMT: Harvard-MIT Mathematics Tournament problems
	- HLE: High-level exam problems

	<figure>
	<img src="https://huggingface.co/proxy/cdn-uploads.huggingface.co/production/uploads/62d93cd728f9c86a4031562e/a-kn3oHdlxTP_L-63N9LX.png" style="width:700px" alt="Output example showing Python code generation and execution." />
	<figcaption><p><em>Figure 2: Example output where Python code is generated, evaluated, and the result is inserted into the reasoning trace.</em></p></figcaption>
	</figure>

	## Usage

	### Installation

	```bash
	# Install uv package manager
	curl -LsSf https://astral.sh/uv/install.sh \| sh

	# Clone repository
	git clone https://github.com/IntelLabs/DeepMath.git
	cd DeepMath

	# Install dependencies
	uv pip install -r requirements.txt
	uv pip install -e .
	```

	### Basic Inference

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "Intel/deepmath-v1"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Example problem
	problem = "What is the sum of the first 100 positive integers?"

	inputs = tokenizer(problem, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=3000)
	print(tokenizer.decode(outputs[0]))
	```

	### Inference with Agent

	For full agent capabilities with sandboxed Python execution:

	```bash
	python inference.py \
	+model.use_vllm=true \
	+model.math_agent=true \
	+model.examples=deep_math/fewshot.txt \
	model.generation.max_new_tokens=3000 \
	+model.max_agent_output=20000 \
	+model.max_steps=50 \
	model.model_name_or_path=Intel/deepmath-v1 \
	hf_tag=HuggingFaceH4/MATH-500 \
	generated_file=output.jsonl
	```

	See the [repository](https://github.com/IntelLabs/DeepMath) for complete usage examples.

	## Limitations and Biases

	### Limitations

	- Scope: Optimized for mathematical reasoning tasks; may not generalize to other domains
	- Problem Types: Evaluated on contest-style math problems; performance on open-ended mathematical creativity or formal proofs is unknown
	- Model Size: 4B parameters may limit reasoning depth on extremely complex problems
	- Code Execution: Requires sandboxed environment for full agent capabilities

	### Safety Considerations

	⚠️ Code Execution Risk: This model generates and executes Python code. While DeepMath uses strict sandboxing and resource limits, any deployment should:

	- Carefully manage attack surfaces
	- Enforce rate limits
	- Use proper isolation (containers, VMs)
	- Monitor resource usage
	- Validate generated code before execution in production

	### Ethical Considerations

	- The model is trained on mathematical problem-solving datasets and should not be used for decision-making in critical applications without human oversight
	- Generated code should be reviewed before execution in production environments
	- The model may reflect biases present in the training data

	## Citation

	If you use DeepMath in your research, please cite:

	```bibtex
	@software{deepmath2025,
	author = {Fleischer, Daniel and Berchansky, Moshe and Wasserblat, Moshe},
	title = {DeepMath: A Lightweight Math Reasoning Agent for LLMs},
	year = {2025},
	publisher = {Intel AI Labs},
	url = {https://github.com/IntelLabs/DeepMath}
	}
	```

	## Model Card Contact

	For questions or issues, please open an issue on the [GitHub repository](https://github.com/IntelLabs/DeepMath).