The cost of running large language model inference at scale is immense, with providers incurring costs exceeding $700,000 per day, according to a new paper posted on arXiv. The challenge lies in GPU scheduling: each request requires token-by-token inference, and memory constraints from the Key-Value (KV) cache can force evictions of in-progress requests, wasting prior computation. The paper, authored by Ruicheng Ao, Gan Luo, David Simchi-Levi, and Xinshang Wang, formulates inference as a multi-stage online scheduling problem and introduces a fluid-guided approach to optimize batch composition and memory management.
The Memory Growth Challenge
In LLM inference, generated tokens expand the KV cache. This endogenous memory growth can cause cache overflow, evicting active requests and wasting earlier work. The researchers model this as a multi-stage online scheduling problem with linear iteration times and GPU-resident KV-cache constraints. They introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region.
The Fluid-Guided Approach
Guided by the fluid model, the team designed two algorithms:
- WAIT (Waiting for Accumulated Inference Threshold): a threshold-based admission rule for known output lengths.
- Nested WAIT: extends WAIT to unknown output lengths by regulating how requests advance across decode-stage segments. It adds a safety buffer of moderate scale to hedge against memory-overflow-induced evictions.
Both algorithms approximate the fluid benchmark asymptotically under stated memory conditions.
Simulation Results
In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms. They reduce latency especially in near-overloaded and overloaded regimes.
Implications for Enterprise AI
For technology decision-makers managing LLM inference infrastructure, these findings offer a path to more efficient GPU utilization. By scheduling requests more intelligently, enterprises can reduce costs and improve response times under heavy loads, directly impacting the economics of AI-driven services.