Fast-dLLM++ Boosts Diffusion LLM Inference Up to 37% With Fréchet Profile Decoding

Researchers propose Fast-dLLM++, a training-free extension to Fast-dLLM that uses Fréchet profile decoding to select parallel token commit sets from the full confidence profile. Experiments on LLaDA-8B show up to 37% higher throughput at comparable accuracy on benchmarks including GSM8K, MATH, HumanEval, and MBPP.

iGEN Editorial

June 16, 2026

Enterprise adoption of large language models (LLMs) is often constrained by inference latency — the time it takes to generate responses. Diffusion LLMs promise faster generation by producing multiple tokens in parallel, but the decoding step that decides which masked tokens can be committed simultaneously has been a bottleneck. A new research paper from Kasa, Dai, Negi, and Li introduces Fast-dLLM++, a training-free algorithm that improves throughput by up to 37% without sacrificing accuracy.

The Bottleneck in Diffusion LLM Inference

Diffusion LLMs generate text by starting with a fully masked sequence and iteratively unmasking tokens. The key challenge is determining which tokens can be unmasked in parallel without degrading quality. Prior work, Fast-dLLM, addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory assumed a homogeneous high-confidence threshold. This effectively reduced each candidate set to its weakest selected token, limiting parallelism.

Fréchet Profile Decoding: The Innovation

The authors propose Fréchet profile decoding, which selects parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. This is a heterogeneous-confidence generalization of Fast-dLLM's factor selector — it recovers the previous rule exactly when confidences are equal, and adds a provable heterogeneity bonus when selected tokens have uneven confidences. Importantly, Fast-dLLM++ leaves the model, diffusion process, and cache implementation unchanged, making it a drop-in replacement for existing Fast-dLLM decoding.

Empirical Results

Experiments were conducted using the LLaDA-8B model on four benchmarks:

Benchmark	Throughput Improvement	Accuracy vs Fast-dLLM
GSM8K	Up to 37%	Comparable
MATH	Up to 37%	Comparable
HumanEval	Up to 37%	Comparable
MBPP	Up to 37%	Comparable

As the paper states, > "profile-aware selection improves the accuracy–throughput frontier by exploiting safe parallelism that weakest-token rules miss." The theoretical improvement translates directly into empirical gains, with up to 37% higher throughput at comparable accuracy.

Implications for Enterprise AI

For enterprise technology leaders evaluating LLM deployment, inference speed translates directly into lower costs and faster response times. Fast-dLLM++ requires no additional training or hardware changes — it is a drop-in upgrade for systems already using Fast-dLLM. The code is released publicly (see paper for repository link). While the research focuses on language tasks, the underlying principle of heterogeneous-confidence decoding could apply to any diffusion-based generative model used in data synthesis or document processing within supply chain and logistics applications.

The method's ability to improve throughput without degrading accuracy makes it attractive for real-time AI systems where every millisecond counts. As organizations scale AI across customer service, contract analysis, and operational planning, tools like Fast-dLLM++ can help achieve higher efficiency without compromising on quality.

Sources:

Fast-dLLM++ Boosts Diffusion LLM Inference Up to 37% With Fréchet Profile Decoding

The Bottleneck in Diffusion LLM Inference

Fréchet Profile Decoding: The Innovation

Empirical Results

Implications for Enterprise AI

Recommended Stories

SafeSpec: New Framework Boosts LLM Safety Without Sacrificing Inference Speed

FastMix: Gradient-Based Data Mixture Optimization Reduces Search Cost in AI Training

New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance