Enterprise adoption of large language models (LLMs) is often constrained by inference latency — the time it takes to generate responses. Diffusion LLMs promise faster generation by producing multiple tokens in parallel, but the decoding step that decides which masked tokens can be committed simultaneously has been a bottleneck. A new research paper from Kasa, Dai, Negi, and Li introduces Fast-dLLM++, a training-free algorithm that improves throughput by up to 37% without sacrificing accuracy.
The Bottleneck in Diffusion LLM Inference
Diffusion LLMs generate text by starting with a fully masked sequence and iteratively unmasking tokens. The key challenge is determining which tokens can be unmasked in parallel without degrading quality. Prior work, Fast-dLLM, addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory assumed a homogeneous high-confidence threshold. This effectively reduced each candidate set to its weakest selected token, limiting parallelism.
Fréchet Profile Decoding: The Innovation
The authors propose Fréchet profile decoding, which selects parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. This is a heterogeneous-confidence generalization of Fast-dLLM's factor selector — it recovers the previous rule exactly when confidences are equal, and adds a provable heterogeneity bonus when selected tokens have uneven confidences. Importantly, Fast-dLLM++ leaves the model, diffusion process, and cache implementation unchanged, making it a drop-in replacement for existing Fast-dLLM decoding.
Empirical Results
Experiments were conducted using the LLaDA-8B model on four benchmarks:
| Benchmark | Throughput Improvement | Accuracy vs Fast-dLLM |
|---|---|---|
| GSM8K | Up to 37% | Comparable |
| MATH | Up to 37% | Comparable |
| HumanEval | Up to 37% | Comparable |
| MBPP | Up to 37% | Comparable |
As the paper states, > "profile-aware selection improves the accuracy–throughput frontier by exploiting safe parallelism that weakest-token rules miss." The theoretical improvement translates directly into empirical gains, with up to 37% higher throughput at comparable accuracy.
Implications for Enterprise AI
For enterprise technology leaders evaluating LLM deployment, inference speed translates directly into lower costs and faster response times. Fast-dLLM++ requires no additional training or hardware changes — it is a drop-in upgrade for systems already using Fast-dLLM. The code is released publicly (see paper for repository link). While the research focuses on language tasks, the underlying principle of heterogeneous-confidence decoding could apply to any diffusion-based generative model used in data synthesis or document processing within supply chain and logistics applications.
The method's ability to improve throughput without degrading accuracy makes it attractive for real-time AI systems where every millisecond counts. As organizations scale AI across customer service, contract analysis, and operational planning, tools like Fast-dLLM++ can help achieve higher efficiency without compromising on quality.