Topic
inference
SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions
Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.
MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance
Researchers propose MA-SBI, a misspecification-aware simulation-based inference framework that leverages unstructured side-channel information—such as regime labels or policy bulletins—to correct posterior estimates without requiring ground-truth parameter pairs. The method matches oracle performance on hide-the-calibration benchmarks and improves log-likelihood on real COVID epidemiological data.
Service-Induced Congestion Threatens LLM Serving Throughput, New Model Shows
A new mathematical model from researchers at MIT and elsewhere shows that in large language model serving, persistent GPU memory consumption from key-value caches creates a 'service-induced congestion' effect. Under high concurrency, this can lead to instability and throughput losses as high as 50%. The paper identifies scheduling design principles to avoid these losses.
PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%
PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.
New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining
A research paper on arXiv introduces a retrieval-augmented reliability-aware inference framework that reduces visual hallucinations in multimodal large language models. By using an external evidence database and reliability indicators, the system improves accepted prediction accuracy from 85.84% to 88.88% at 89.04% coverage, without retraining the model.
New VeriAttn Technique Accelerates Verifiable LLM Inference on TEE-GPU Systems
Researchers propose VeriAttn, a communication-efficient TEE-GPU attention mechanism for verifiable LLM inference. By offloading attention computations to the GPU while the TEE performs verification, VeriAttn achieves 2.60-3.38x acceleration for prefill and 3.86-5.42x for decoding over the TSDP baseline on Intel TDX.