Artificial Intelligence #llm#ai
OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring
A new method called Optimal Brain Cache (OBCache) treats key-value cache eviction as a layer-wise structured pruning problem. By measuring token saliency through perturbation in attention outputs, OBCache outperforms heuristic-based approaches on LLaMA and Qwen models, consistently improving long-context accuracy according to the paper.
Jun 16, 2026 1 source