Enterprise deployments of large language models (LLMs) face a hidden performance trap: the very act of serving requests can cause its own congestion. New research published on arXiv presents a discrete-time dynamical model that quantifies how persistent GPU memory consumption from key-value caches leads to throughput degradation under high concurrency, with losses reaching up to 50%.
According to the paper "Service-Induced Congestion in Memory-Constrained LLM Serving" by authors Ao, Ruicheng, Dong, Jing, Luo, Gan, and Simchi-Levi, David, each LLM request accumulates persistent GPU memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage increases endogenously over time: the service process itself creates future capacity pressure.
The Memory-Constrained Serving Problem
When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later. This wastes computation and reduces throughput. The researchers model this as a dynamical system that captures admission, memory growth, and eviction under continuous batching. They analyze two regimes: saturated input where requests arrive faster than they can be served, and input-dominated scaling where request arrival rates are proportional to service capacity.
Key Findings on Stability and Throughput Loss
For homogeneous workloads (all requests identical), the paper proves that the eviction-free equilibrium is unstable. Except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set. In this limit cycle, throughput losses can be as large as 50%.
For heterogeneous workloads (different request lengths), the researchers prove a stability criterion in the two-class common-input setting. They explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability.
Design Principles for High-Throughput Serving
The results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, the authors identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput. For enterprise CTOs and AI infrastructure teams, this means careful management of request length distributions and batching policies can avoid the 50% throughput cliff.
| Workload Type | Stability Condition | Throughput Loss |
|---|---|---|
| Homogeneous | Unstable (except exact-capture set) | Up to 50% |
| Heterogeneous (two-class) | Stable under survival-polynomial criterion | Varies |
| Coprime decoding lengths (input-dominated) | Stable | Minimal |
| Non-coprime lengths (input-dominated) | Unstable | Significant |
Implications for Enterprise AI Deployments
The findings underscore that memory-constrained LLM serving is not just a hardware capacity issue but a dynamic instability problem. As organizations scale AI inference for applications like chatbots, code generation, and document analysis, understanding service-induced congestion becomes critical. The paper’s scheduling principles—such as ensuring diversity in request lengths and avoiding synchronized completions—offer actionable guidance without requiring additional GPU memory. While the research is theoretical, it points toward practical batching and admission control strategies that can prevent the worst-case throughput degradation.
Entities mentioned: Authors: Ao (Ruicheng), Dong (Jing), Luo (Gan), Simchi-Levi (David). Concepts: GPU memory, key-value cache, continuous batching, discrete-time dynamical model, eviction-free equilibrium, limit cycle, survival-polynomial mechanism, coprime decoding lengths.
No external quotes, companies, or products are present in the source; the analysis strictly follows the provided abstract.