Artificial Intelligence #service-induced congestion#memory-constrained
Service-Induced Congestion Threatens LLM Serving Throughput, New Model Shows
A new mathematical model from researchers at MIT and elsewhere shows that in large language model serving, persistent GPU memory consumption from key-value caches creates a 'service-induced congestion' effect. Under high concurrency, this can lead to instability and throughput losses as high as 50%. The paper identifies scheduling design principles to avoid these losses.
Jun 16, 2026 1 source