Explore-Execute Chain (E²C) — Qwen3-8B
This repository contains the E²C model weights trained on top of Qwen3-8B.
Paper: Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm
Code: GitHub
What is E²C?
Standard chain-of-thought mixes high-level planning and low-level derivation in a single undifferentiated sequence. E²C splits reasoning into two explicit phases inside one model:
- Exploration (
<EXPLORATION>...</EXPLORATION>): a short, stochastic plan that outlines the solution strategy (~500 tokens). - Execution (
<EXECUTION>...</EXECUTION>): a deterministic, step-by-step derivation that follows the plan exactly.
The two phases are trained jointly. A causal SFT stage teaches the model the E²C format; a two-stage GRPO stage then amplifies the gradient weight on exploration tokens (λ > 1) to sharpen planning while keeping execution deterministic.
Model variants
| Name | Base | Training |
|---|---|---|
8B-Final |
Qwen3-8B | E²C-SFT → E²C-RL (Stage 1 + Stage 2) |
4B-Final |
Qwen3-4B | E²C-SFT → E²C-RL (Stage 1 + Stage 2) |
Performance
Mathematical reasoning (Pass@1, 8 samples, Qwen3-8B base):
| Benchmark | Qwen3-8B + GRPO | E²C (SFT+RL) |
|---|---|---|
| AIME 2024 | 36.9% | 40.6% |
| AIME 2025 | 34.4% | 33.8% |
| MATH500 | 88.2% | 87.7% |
| AMC 2023 | 79.3% | 80.3% |
Test-time scaling on AIME 2024 (K=32):
| Method | Accuracy | Tokens (k) |
|---|---|---|
| Self-Consistency | 50.0% | 86.2 |
| Tree-of-Thoughts | 50.0% | 71.3 |
| E²C-ReAct Loop | 53.3% | 12.4 |
E²C-ReAct Loop reaches higher accuracy than standard TTS methods while using 7× fewer tokens, by running the search over short exploration plans rather than full chains.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TingheOliver/Explore-Execute-Chain-Qwen",
subfolder="8B-Final",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
"TingheOliver/Explore-Execute-Chain-Qwen",
subfolder="8B-Final",
)
problem = "Find all positive integers n such that n² + 1 divides n³ + 1."
messages = [{"role": "user", "content": problem}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
# Parse phases
if "<EXPLORATION>" in response and "<EXECUTION>" in response:
exploration = response.split("<EXPLORATION>")[1].split("</EXPLORATION>")[0].strip()
execution = response.split("<EXECUTION>")[1].split("</EXECUTION>")[0].strip()
print("Plan:\n", exploration)
print("\nSolution:\n", execution)
else:
print(response)
See the GitHub repository for full evaluation scripts and test-time scaling experiments.
Training details
| Stage | Description |
|---|---|
| Causal SFT data | Full solutions distilled into (exploration, execution) pairs; execution conditioned on exploration |
| E²C-SFT | Standard cross-entropy on structured output (prompt tokens masked) |
| E²C-RL Stage 1 | GRPO, rollout=32, temp=1.3, 1 epoch — diversifies exploration |
| E²C-RL Stage 2 | GRPO, rollout=8, temp=1.0, adv_coeff=2.0, 2 epochs — sharpens execution determinism |
Exploration tokens receive λ-amplified gradient weight throughout RL training to focus the policy improvement signal on the planning phase.
Citation
@misc{yang2025e2c,
title = {Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm},
author = {Kaisen Yang and Tinghe Zhang and Rushi Shah and Kaicheng Yang and
Qinwei Ma and Dianbo Liu and Alex Lamb},
year = {2025},
eprint = {2509.23946},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2509.23946}
}