pruning

3 stories

OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring

A new method called Optimal Brain Cache (OBCache) treats key-value cache eviction as a layer-wise structured pruning problem. By measuring token saliency through perturbation in attention outputs, OBCache outperforms heuristic-based approaches on LLaMA and Qwen models, consistently improving long-context accuracy according to the paper.

Jun 16, 2026 1 source

Pruning Optimisations Boost LUT-Based Neural Network Scalability and Efficiency

Technology

Artificial Intelligence #neural networks#pruning

Pruning Optimisations Boost LUT-Based Neural Network Scalability and Efficiency

Researchers propose a pruning-optimised Look-Up Table (LUT) matrix multiplication unit (LUT-MU) to address scalability limits in LUT-based neural networks. Deployed on FPGAs, it delivers up to 1.6x throughput improvement and 4.2x energy efficiency gains over CUDA-based implementations, with 1.3 to 2.6x resource savings versus original MADDNESS-based networks.

Jun 16, 2026 1 source

Multi-Granular Node Pruning for Efficient Causal Circuit Discovery in LLMs

Technology

Artificial Intelligence #causal#circuit

Multi-Granular Node Pruning for Efficient Causal Circuit Discovery in LLMs

A research paper introduces a node-level pruning framework for causal circuit discovery in large language models, using learnable masks across multiple granularities. The method achieves smaller circuits than prior techniques and reduces memory footprint by 5-10x by avoiding intermediate activation storage.

Jun 16, 2026 1 source