New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

A new paper from researchers including David Simchi-Levi introduces a fluid-guided online scheduling approach for LLM inference that addresses memory constraints from Key-Value cache growth. The WAIT and Nested WAIT algorithms approximate an optimal fluid benchmark, reducing latency in overloaded regimes according to simulations on Llama-2-7B with A100 GPUs.

iGEN Editorial

June 16, 2026

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

The cost of running large language model inference at scale is immense, with providers incurring costs exceeding $700,000 per day, according to a new paper posted on arXiv. The challenge lies in GPU scheduling: each request requires token-by-token inference, and memory constraints from the Key-Value (KV) cache can force evictions of in-progress requests, wasting prior computation. The paper, authored by Ruicheng Ao, Gan Luo, David Simchi-Levi, and Xinshang Wang, formulates inference as a multi-stage online scheduling problem and introduces a fluid-guided approach to optimize batch composition and memory management.

The Memory Growth Challenge

In LLM inference, generated tokens expand the KV cache. This endogenous memory growth can cause cache overflow, evicting active requests and wasting earlier work. The researchers model this as a multi-stage online scheduling problem with linear iteration times and GPU-resident KV-cache constraints. They introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region.

The Fluid-Guided Approach

Guided by the fluid model, the team designed two algorithms:

WAIT (Waiting for Accumulated Inference Threshold): a threshold-based admission rule for known output lengths.
Nested WAIT: extends WAIT to unknown output lengths by regulating how requests advance across decode-stage segments. It adds a safety buffer of moderate scale to hedge against memory-overflow-induced evictions.

Both algorithms approximate the fluid benchmark asymptotically under stated memory conditions.

Simulation Results

In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms. They reduce latency especially in near-overloaded and overloaded regimes.

Implications for Enterprise AI

For technology decision-makers managing LLM inference infrastructure, these findings offer a path to more efficient GPU utilization. By scheduling requests more intelligently, enterprises can reduce costs and improve response times under heavy loads, directly impacting the economics of AI-driven services.

Sources:

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

The Memory Growth Challenge

The Fluid-Guided Approach

Simulation Results

Implications for Enterprise AI

Recommended Stories

OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

How Google’s New Gemini Rates Work and How to Track Your Usage

Anthropic Launches Claude Cowork AI Agent on Mobile, Enabling 24/7 Task Automation Without a Desktop