Explore-Execute Chain (E²C) — Qwen3-8B

This repository contains the E²C model weights trained on top of Qwen3-8B.

Paper: Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm
Code: GitHub

What is E²C?

Standard chain-of-thought mixes high-level planning and low-level derivation in a single undifferentiated sequence. E²C splits reasoning into two explicit phases inside one model:

Exploration (<EXPLORATION>...</EXPLORATION>): a short, stochastic plan that outlines the solution strategy (~500 tokens).
Execution (<EXECUTION>...</EXECUTION>): a deterministic, step-by-step derivation that follows the plan exactly.

The two phases are trained jointly. A causal SFT stage teaches the model the E²C format; a two-stage GRPO stage then amplifies the gradient weight on exploration tokens (λ > 1) to sharpen planning while keeping execution deterministic.

Model variants

Name	Base	Training
`8B-Final`	Qwen3-8B	E²C-SFT → E²C-RL (Stage 1 + Stage 2)
`4B-Final`	Qwen3-4B	E²C-SFT → E²C-RL (Stage 1 + Stage 2)

Performance

Mathematical reasoning (Pass@1, 8 samples, Qwen3-8B base):

Benchmark	Qwen3-8B + GRPO	E²C (SFT+RL)
AIME 2024	36.9%	40.6%
AIME 2025	34.4%	33.8%
MATH500	88.2%	87.7%
AMC 2023	79.3%	80.3%

Test-time scaling on AIME 2024 (K=32):

Method	Accuracy	Tokens (k)
Self-Consistency	50.0%	86.2
Tree-of-Thoughts	50.0%	71.3
E²C-ReAct Loop	53.3%	12.4

E²C-ReAct Loop reaches higher accuracy than standard TTS methods while using 7× fewer tokens, by running the search over short exploration plans rather than full chains.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TingheOliver/Explore-Execute-Chain-Qwen",
    subfolder="8B-Final",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "TingheOliver/Explore-Execute-Chain-Qwen",
    subfolder="8B-Final",
)

problem = "Find all positive integers n such that n² + 1 divides n³ + 1."

messages = [{"role": "user", "content": problem}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=4096, temperature=0.7, do_sample=True)
response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)

# Parse phases
if "<EXPLORATION>" in response and "<EXECUTION>" in response:
    exploration = response.split("<EXPLORATION>")[1].split("</EXPLORATION>")[0].strip()
    execution   = response.split("<EXECUTION>")[1].split("</EXECUTION>")[0].strip()
    print("Plan:\n", exploration)
    print("\nSolution:\n", execution)
else:
    print(response)

See the GitHub repository for full evaluation scripts and test-time scaling experiments.

Training details

Stage	Description
Causal SFT data	Full solutions distilled into (exploration, execution) pairs; execution conditioned on exploration
E²C-SFT	Standard cross-entropy on structured output (prompt tokens masked)
E²C-RL Stage 1	GRPO, rollout=32, temp=1.3, 1 epoch — diversifies exploration
E²C-RL Stage 2	GRPO, rollout=8, temp=1.0, adv_coeff=2.0, 2 epochs — sharpens execution determinism

Exploration tokens receive λ-amplified gradient weight throughout RL training to focus the policy improvement signal on the planning phase.

Citation

@misc{yang2025e2c,
  title     = {Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm},
  author    = {Kaisen Yang and Tinghe Zhang and Rushi Shah and Kaicheng Yang and
               Qinwei Ma and Dianbo Liu and Alex Lamb},
  year      = {2025},
  eprint    = {2509.23946},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url       = {https://arxiv.org/abs/2509.23946}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning

Model tree for TingheOliver/Explore-Execute-Chain-Qwen

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1643)

this model

Paper for TingheOliver/Explore-Execute-Chain-Qwen

Explore-Execute Chain: Towards an Efficient Structured Reasoning Paradigm

Paper • 2509.23946 • Published Sep 28, 2025