Explore-Execute Chain (E²C) — Qwen3-8B

This repository contains the E²C model weights trained on top of Qwen3-8B.

Paper: Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm
Code: GitHub

What is E²C?

Standard chain-of-thought mixes high-level planning and low-level derivation in a single undifferentiated sequence. E²C splits reasoning into two explicit phases inside one model:

  • Exploration (<EXPLORATION>...</EXPLORATION>): a short, stochastic plan that outlines the solution strategy (~500 tokens).
  • Execution (<EXECUTION>...</EXECUTION>): a deterministic, step-by-step derivation that follows the plan exactly.

The two phases are trained jointly. A causal SFT stage teaches the model the E²C format; a two-stage GRPO stage then amplifies the gradient weight on exploration tokens (λ > 1) to sharpen planning while keeping execution deterministic.

Model variants

Name Base Training
8B-Final Qwen3-8B E²C-SFT → E²C-RL (Stage 1 + Stage 2)
4B-Final Qwen3-4B E²C-SFT → E²C-RL (Stage 1 + Stage 2)

Performance

Mathematical reasoning (Pass@1, 8 samples, Qwen3-8B base):

Benchmark Qwen3-8B + GRPO E²C (SFT+RL)
AIME 2024 36.9% 40.6%
AIME 2025 34.4% 33.8%
MATH500 88.2% 87.7%
AMC 2023 79.3% 80.3%

Test-time scaling on AIME 2024 (K=32):

Method Accuracy Tokens (k)
Self-Consistency 50.0% 86.2
Tree-of-Thoughts 50.0% 71.3
E²C-ReAct Loop 53.3% 12.4

E²C-ReAct Loop reaches higher accuracy than standard TTS methods while using 7× fewer tokens, by running the search over short exploration plans rather than full chains.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TingheOliver/Explore-Execute-Chain-Qwen",
    subfolder="8B-Final",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "TingheOliver/Explore-Execute-Chain-Qwen",
    subfolder="8B-Final",
)

problem = "Find all positive integers n such that n² + 1 divides n³ + 1."

messages = [{"role": "user", "content": problem}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)

# Parse phases
if "<EXPLORATION>" in response and "<EXECUTION>" in response:
    exploration = response.split("<EXPLORATION>")[1].split("</EXPLORATION>")[0].strip()
    execution   = response.split("<EXECUTION>")[1].split("</EXECUTION>")[0].strip()
    print("Plan:\n", exploration)
    print("\nSolution:\n", execution)
else:
    print(response)

See the GitHub repository for full evaluation scripts and test-time scaling experiments.

Training details

Stage Description
Causal SFT data Full solutions distilled into (exploration, execution) pairs; execution conditioned on exploration
E²C-SFT Standard cross-entropy on structured output (prompt tokens masked)
E²C-RL Stage 1 GRPO, rollout=32, temp=1.3, 1 epoch — diversifies exploration
E²C-RL Stage 2 GRPO, rollout=8, temp=1.0, adv_coeff=2.0, 2 epochs — sharpens execution determinism

Exploration tokens receive λ-amplified gradient weight throughout RL training to focus the policy improvement signal on the planning phase.

Citation

@misc{yang2025e2c,
  title     = {Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm},
  author    = {Kaisen Yang and Tinghe Zhang and Rushi Shah and Kaicheng Yang and
               Qinwei Ma and Dianbo Liu and Alex Lamb},
  year      = {2025},
  eprint    = {2509.23946},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url       = {https://arxiv.org/abs/2509.23946}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for TingheOliver/Explore-Execute-Chain-Qwen

Finetuned
Qwen/Qwen3-8B
Finetuned
(1643)
this model

Paper for TingheOliver/Explore-Execute-Chain-Qwen