Topic
pruning
OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring
A new method called Optimal Brain Cache (OBCache) treats key-value cache eviction as a layer-wise structured pruning problem. By measuring token saliency through perturbation in attention outputs, OBCache outperforms heuristic-based approaches on LLaMA and Qwen models, consistently improving long-context accuracy according to the paper.
Pruning Optimisations Boost LUT-Based Neural Network Scalability and Efficiency
Researchers propose a pruning-optimised Look-Up Table (LUT) matrix multiplication unit (LUT-MU) to address scalability limits in LUT-based neural networks. Deployed on FPGAs, it delivers up to 1.6x throughput improvement and 4.2x energy efficiency gains over CUDA-based implementations, with 1.3 to 2.6x resource savings versus original MADDNESS-based networks.
Multi-Granular Node Pruning for Efficient Causal Circuit Discovery in LLMs
A research paper introduces a node-level pruning framework for causal circuit discovery in large language models, using learnable masks across multiple granularities. The method achieves smaller circuits than prior techniques and reduces memory footprint by 5-10x by avoiding intermediate activation storage.