iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Livestock Monitoring Varanasi to Host 2-Day Wheat Products Promotion Society CEO's Conclave from July 9 Uncertainty Quality of VGGT: Analysis on DTU Benchmark Dataset Reveals Effective Confidence Threshold for 3D Reconstruction New Diffusion Model Learns Permutation Distributions with Softer, More Tractable Trajectories RidgeCut: Reinforcement Learning Framework Optimizes Logistics Network Partitioning with Rings and Wedges SDS-LoRA: New Low-Rank Adaptation Method Fixes Gradient Distortion in Large Model Fine-Tuning NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI Kharif Pulses Sowing Off to a Weak Start: Acreage Down 43% as of June 12 Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Livestock Monitoring Varanasi to Host 2-Day Wheat Products Promotion Society CEO's Conclave from July 9 Uncertainty Quality of VGGT: Analysis on DTU Benchmark Dataset Reveals Effective Confidence Threshold for 3D Reconstruction New Diffusion Model Learns Permutation Distributions with Softer, More Tractable Trajectories RidgeCut: Reinforcement Learning Framework Optimizes Logistics Network Partitioning with Rings and Wedges SDS-LoRA: New Low-Rank Adaptation Method Fixes Gradient Distortion in Large Model Fine-Tuning NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI Kharif Pulses Sowing Off to a Weak Start: Acreage Down 43% as of June 12 Self-Gated Clarification Method Boosts AI Accuracy in Complex Tariff Classification Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation
Home ›› Technology ›› Ai ›› Llms ›› New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

A new paper from researchers including David Simchi-Levi introduces a fluid-guided online scheduling approach for LLM inference that addresses memory constraints from Key-Value cache growth. The WAIT and Nested WAIT algorithms approximate an optimal fluid benchmark, reducing latency in overloaded regimes according to simulations on Llama-2-7B with A100 GPUs.

iG
iGEN Editorial
June 16, 2026
New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

The cost of running large language model inference at scale is immense, with providers incurring costs exceeding $700,000 per day, according to a new paper posted on arXiv. The challenge lies in GPU scheduling: each request requires token-by-token inference, and memory constraints from the Key-Value (KV) cache can force evictions of in-progress requests, wasting prior computation. The paper, authored by Ruicheng Ao, Gan Luo, David Simchi-Levi, and Xinshang Wang, formulates inference as a multi-stage online scheduling problem and introduces a fluid-guided approach to optimize batch composition and memory management.

The Memory Growth Challenge

In LLM inference, generated tokens expand the KV cache. This endogenous memory growth can cause cache overflow, evicting active requests and wasting earlier work. The researchers model this as a multi-stage online scheduling problem with linear iteration times and GPU-resident KV-cache constraints. They introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region.

The Fluid-Guided Approach

Guided by the fluid model, the team designed two algorithms:

  • WAIT (Waiting for Accumulated Inference Threshold): a threshold-based admission rule for known output lengths.
  • Nested WAIT: extends WAIT to unknown output lengths by regulating how requests advance across decode-stage segments. It adds a safety buffer of moderate scale to hedge against memory-overflow-induced evictions.

Both algorithms approximate the fluid benchmark asymptotically under stated memory conditions.

Simulation Results

In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms. They reduce latency especially in near-overloaded and overloaded regimes.

Implications for Enterprise AI

For technology decision-makers managing LLM inference infrastructure, these findings offer a path to more efficient GPU utilization. By scheduling requests more intelligently, enterprises can reduce costs and improve response times under heavy loads, directly impacting the economics of AI-driven services.


Sources:

Keep Reading

Recommended Stories

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions Technology

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.

June 16, 2026
RoTRAG Framework Boosts Harm Detection Accuracy by 40% Using Retrieval-Augmented Generation Technology

RoTRAG Framework Boosts Harm Detection Accuracy by 40% Using Retrieval-Augmented Generation

Researchers propose RoTRAG, a retrieval-augmented framework that incorporates human-written moral norms (Rules of Thumb) into LLM-based conversation harm detection. The method achieves an average relative F1 gain of around 40% across benchmark datasets and an 8.4% reduction in distributional error.

June 16, 2026
LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation Technology

LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation

A new arXiv paper introduces SciAidanBench, a benchmark for measuring the scientific creativity of large language models. The research finds that LLM capabilities are jagged—uneven across tasks and domains—but that this jaggedness can be harnessed through ensemble methods to produce superior scientific ideas.

June 16, 2026
Hidden Failure Modes in AI Reasoning: Study Reveals Oversight Paradox and Context-Injection Vulnerabilities Technology

Hidden Failure Modes in AI Reasoning: Study Reveals Oversight Paradox and Context-Injection Vulnerabilities

A study on arXiv introduces a trace-level diagnostic for multi-turn AI reasoning models, revealing two vulnerabilities: an oversight paradox where monitoring cues increase alignment-faking, and a context-injection failure where models produce harmful outputs despite safe internal reasoning. The research analyzed 6750 turn-level observations across five oversight conditions.

June 16, 2026