PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.

iGEN Editorial

June 16, 2026

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

A new approach to key-value (KV) cache compression, called PolyKV, promises to reduce the memory footprint of large language model (LLM) inference without sacrificing performance. According to the paper published on arXiv, PolyKV treats each transformer layer individually, selecting the most suitable compression policy and allocating a non-uniform budget per layer under a fixed total budget. In experiments on LLaMA-3.1-8B and Qwen3-8B, the method recovered up to 54.5% of the performance gap between the best uniform policy and full KV cache.

The Problem with Uniform KV Cache Compression

KV cache compression is essential for reducing the memory cost of long-context LLM inference, according to the authors. However, existing approaches typically apply a single compression policy and a uniform cache budget across all transformer layers. The paper argues that this uniform design ignores the fact that different layers play different roles during prefill and decoding, and may require different eviction strategies and cache capacities.

PolyKV's Heterogeneous Approach

PolyKV is a layer-wise optimization framework that considers both method selection and budget allocation. It routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods, the authors state.

The framework was tested on two popular LLMs: LLaMA-3.1-8B and Qwen3-8B. Performance was measured on the LongBench benchmark suite, comparing PolyKV against the strongest single-policy baseline and against FullKV (no compression).

Experimental Results

Under the same average KV budget of 512 tokens, PolyKV recovered 54.5% of the performance gap for LLaMA-3.1-8B and 25.7% for Qwen3-8B when compared to the best single-policy baseline against FullKV. Across a broader budget sweep from 128 to 1024 tokens, PolyKV consistently outperformed the strongest baseline by 1.7% to 6.4%, corresponding to a 40.0% to 54.5% recovery of the FullKV gap.

Metric	LLaMA-3.1-8B	Qwen3-8B
Gap recovery at 512-token budget	54.5%	25.7%
Performance improvement over strongest baseline (128-1024 token budget)	1.7%-6.4%	1.7%-6.4%
FullKV gap recovery (128-1024 budget)	40.0%-54.5%	40.0%-54.5%

The results indicate that PolyKV's heterogeneous allocation provides meaningful gains over uniform compression strategies, especially for larger models like LLaMA-3.1-8B.

Implications for Enterprise AI Inference

For organizations deploying LLMs in production, KV cache memory is often a bottleneck, especially for long-context tasks such as document analysis, contract review, and extended customer interactions. PolyKV's ability to recover a substantial portion of the FullKV performance while compressing the cache could reduce hardware requirements and inference latency. By using layer-level signals to guide policy and budget decisions, the framework demonstrates that a one-size-fits-all approach to KV cache compression is suboptimal. As AI models grow larger and context lengths increase, such fine-grained optimization methods may become essential for cost-effective deployment.

Sources:

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

The Problem with Uniform KV Cache Compression

PolyKV's Heterogeneous Approach

Experimental Results

Implications for Enterprise AI Inference

Recommended Stories

SafeSpec: New Framework Boosts LLM Safety Without Sacrificing Inference Speed

OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints