S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation
Abstract
S2D2 is a training-free self-speculative decoding framework that improves the accuracy-speed tradeoff in block-diffusion language models by combining parallel block generation with autoregressive verification.
Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7times speedup over autoregressive decoding, and up to 1.57times over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4times faster than the static baseline with slightly higher accuracy.
Community
S2D2 is a training-free self-speculative decoding method for block-diffusion LLMs: the same pretrained model drafts in diffusion mode and verifies in block-size-1 autoregressive mode, improving the accuracy-speed tradeoff over strong confidence-thresholding baselines.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs (2026)
- Locally Coherent Parallel Decoding in Diffusion Language Models (2026)
- DFlash: Block Diffusion for Flash Speculative Decoding (2026)
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching (2026)
- EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models (2026)
- Advancing Block Diffusion Language Models for Test-Time Scaling (2026)
- TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.25702 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper