Papers
arxiv:2506.01979

Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Published on May 16
Authors:
,
,
,
,
,

Abstract

SpecBranch enhances speculative decoding by introducing parallel speculative branches and adaptive draft lengths to reduce waiting times and improve speed without altering the sampling distribution.

AI-generated summary

Speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However, the existing SD methods still remain constrained by their serialized execution, which causes the mutual waiting bubbles between the draft and target models. To address this challenge, we draw inspiration from branch prediction in modern processors and propose a novel framework SpecBranch to unlock branch parallelism in SD. Specifically, we first take an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the trade-offs between parallelization and token rollback. Based on the analysis, we introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments across various models and benchmarks show that SpecBranch achieves over 1.8times sim 4.5times speedups against the auto-regressive decoding and reduces rollback tokens by 50\% for poorly aligned models, while maintaining an identical sampling distribution.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.01979 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.01979 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01979 in a Space README.md to link it from this page.

Collections including this paper 1