A new approach to key-value (KV) cache compression, called PolyKV, promises to reduce the memory footprint of large language model (LLM) inference without sacrificing performance. According to the paper published on arXiv, PolyKV treats each transformer layer individually, selecting the most suitable compression policy and allocating a non-uniform budget per layer under a fixed total budget. In experiments on LLaMA-3.1-8B and Qwen3-8B, the method recovered up to 54.5% of the performance gap between the best uniform policy and full KV cache.
The Problem with Uniform KV Cache Compression
KV cache compression is essential for reducing the memory cost of long-context LLM inference, according to the authors. However, existing approaches typically apply a single compression policy and a uniform cache budget across all transformer layers. The paper argues that this uniform design ignores the fact that different layers play different roles during prefill and decoding, and may require different eviction strategies and cache capacities.
PolyKV's Heterogeneous Approach
PolyKV is a layer-wise optimization framework that considers both method selection and budget allocation. It routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods, the authors state.
The framework was tested on two popular LLMs: LLaMA-3.1-8B and Qwen3-8B. Performance was measured on the LongBench benchmark suite, comparing PolyKV against the strongest single-policy baseline and against FullKV (no compression).
Experimental Results
Under the same average KV budget of 512 tokens, PolyKV recovered 54.5% of the performance gap for LLaMA-3.1-8B and 25.7% for Qwen3-8B when compared to the best single-policy baseline against FullKV. Across a broader budget sweep from 128 to 1024 tokens, PolyKV consistently outperformed the strongest baseline by 1.7% to 6.4%, corresponding to a 40.0% to 54.5% recovery of the FullKV gap.
| Metric | LLaMA-3.1-8B | Qwen3-8B |
|---|---|---|
| Gap recovery at 512-token budget | 54.5% | 25.7% |
| Performance improvement over strongest baseline (128-1024 token budget) | 1.7%-6.4% | 1.7%-6.4% |
| FullKV gap recovery (128-1024 budget) | 40.0%-54.5% | 40.0%-54.5% |
The results indicate that PolyKV's heterogeneous allocation provides meaningful gains over uniform compression strategies, especially for larger models like LLaMA-3.1-8B.
Implications for Enterprise AI Inference
For organizations deploying LLMs in production, KV cache memory is often a bottleneck, especially for long-context tasks such as document analysis, contract review, and extended customer interactions. PolyKV's ability to recover a substantial portion of the FullKV performance while compressing the cache could reduce hardware requirements and inference latency. By using layer-level signals to guide policy and budget decisions, the framework demonstrates that a one-size-fits-all approach to KV cache compression is suboptimal. As AI models grow larger and context lengths increase, such fine-grained optimization methods may become essential for cost-effective deployment.