arxiv:2603.25702

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Published on Mar 26

· Submitted by

Ligong Han on Mar 27

Red Hat AI

Upvote

Authors:

Abstract

S2D2 is a training-free self-speculative decoding framework that improves the accuracy-speed tradeoff in block-diffusion language models by combining parallel block generation with autoregressive verification.

AI-generated summary

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to 4.7times speedup over autoregressive decoding, and up to 1.57times over a tuned dynamic decoding baseline while improving accuracy by up to 4.5 points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is 4.4times faster than the static baseline with slightly higher accuracy.

View arXiv page View PDF GitHub 11 Add to collection

Community

ligongh

Paper submitter 1 day ago

S2D2 is a training-free self-speculative decoding method for block-diffusion LLMs: the same pretrained model drafts in diffusion mode and verifies in block-size-1 autoregressive mode, improving the accuracy-speed tradeoff over strong confidence-thresholding baselines.

librarian-bot

about 18 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.25702

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.25702 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.25702 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.25702 in a Space README.md to link it from this page.