Text Generation
Transformers
Safetensors
English
qwen3
conversational
text-generation-inference
AIPlans

Qwen3-0.6B-DPO

Model Card for Model ID

This model is a fine-tuned variant of Qwen/Qwen3-0.6B, trained using Direct Preference Optimization (DPO) on a preference-form version of the nvidia/HelpSteer2 dataset as part of the AIPlans Model Diffing Project.

Model Details

Model Description

This model is a 0.6B parameter language model based on Qwen3-0.6B and fine-tuned using DPO for preference optimization.
The goal of the fine-tuning was to improve helpfulness and harmlessness as measured by the HelpSteer2 preference dataset, while enabling controlled model diffing experiments within the AIPlans research workflow.

Special attention was paid to training efficiency, including gradient checkpointing and other memory-saving strategies.

Developed by: AIPlans
Funded by: AIPlans
Shared by: AIPlans

Model type: Causal decoder-only Transformer (LLM)
Languages: English
License: MIT
Fine-tuned from: Qwen/Qwen3-0.6B
Training Method: Direct Preference Optimization (DPO)
Intended Use: Research on model diffing, preference fine-tuning, evaluation of lightweight LLM behavior changes.

Model Sources

Training Details

Training Data

Dataset is taken from Jennny/helpsteer2-helpfulness-preference . Thanks Jennny

Evaluation

Below is a comparison between the base Qwen3-0.6B model and our DPO-trained version (trained using HelpSteer2 preference data).

Evaluation Results

The model was evaluated using lm-eval-harness on multiple reasoning and truthfulness benchmarks.
Below is a comparison between the Base Qwen3-0.6B model and This DPO-Trained Model.

πŸ“Š Benchmark Comparison

Task Metric Base Model DPO Model Change
ARC-Challenge acc 0.3148 0.3208 +0.0060
acc_norm 0.3447 0.3430 -0.0017
ARC-Easy acc 0.6044 0.6103 +0.0059
acc_norm 0.5589 0.5589 0
HellaSwag acc 0.3751 0.3752 +0.0001
acc_norm 0.4738 0.4740 +0.0002
TruthfulQA (MC2) acc 0.4275 0.4305 +0.0030
Winogrande acc 0.5604 0.5620 +0.0016

πŸ“ Summary

  • The DPO model shows small but consistent improvements across reasoning benchmarks.
  • TruthfulQA improves, indicating better factuality and reduced hallucination.
  • No regressions observed β€” core reasoning abilities remain stable.
  • These results match expectations for preference-based DPO training using HelpSteer2.

Model Card Authors

Jithesh Pavan D Souza – AIPlans Research Intern

Model Card Contact

Jithesh – [email protected]

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AIPlans/Qwen3-0.6B-DPO

Finetuned
Qwen/Qwen3-0.6B
Finetuned
(545)
this model

Datasets used to train AIPlans/Qwen3-0.6B-DPO

Collection including AIPlans/Qwen3-0.6B-DPO