Home ›› Topics ›› inference

Topic

inference

15 stories

Artificial Intelligence #artificial intelligence#llm

SafeSpec: New Framework Boosts LLM Safety Without Sacrificing Inference Speed

Researchers propose SafeSpec, a safety-aware speculative inference framework that attaches a latent safety head to jointly evaluate semantic validity and safety in a single forward pass. On Qwen3-32B, it reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, addressing the fundamental incompatibility between existing safety methods and speculative decoding.

Jun 21, 2026 1 source

New Framework MACR Resolves Knowledge Conflicts in LLMs Using Multi-Agent Reasoning

Technology

Artificial Intelligence #llms#knowledge conflict

New Framework MACR Resolves Knowledge Conflicts in LLMs Using Multi-Agent Reasoning

A research paper proposes MACR, a novel framework for resolving knowledge conflicts in large language models (LLMs). Unlike existing approaches that privilege either internal parametric knowledge or external context, MACR uses an adaptive knowledge assessment and a multi-agent reasoning system to explicitly identify and resolve inconsistencies. Empirical results show MACR significantly outperforms state-of-the-art benchmarks while providing interpretable conflict resolutions.

Jun 20, 2026 1 source

M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

Technology

Artificial Intelligence #multimodal#ai serving

M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference

Researchers have developed M*, a universal serving system for composite AI models that integrates diverse components like vision encoders and language backbones. Using a novel 'Walk Graph' abstraction, M* achieves significant performance improvements: 20% lower latency for text-to-image, up to 2.7x higher throughput for text-to-speech, and 12.5x faster robotic planning rollouts compared to existing baselines.

Jun 16, 2026 1 source

New Frontier Simulator Cuts LLM Inference Latency Error to Under 3% for Disaggregated Serving

Technology

Artificial Intelligence #llm#inference

New Frontier Simulator Cuts LLM Inference Latency Error to Under 3% for Disaggregated Serving

Researchers introduce Frontier, a discrete-event simulator for modern LLM inference serving that models disaggregated execution, runtime optimizations, and stateful workloads. On a 16-H800 GPU testbed, Frontier achieves average throughput error below 4% and reduces end-to-end latency error from 44.9% to 6.4% under co-location, and from 51.7% to 2.6% under disaggregation. The simulator scales to over 1K GPUs on commodity CPUs and enables new use cases like SLA-dependent Pareto frontier exploration.

Jun 16, 2026 1 source

Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning

Technology

Artificial Intelligence #parallel test-time scaling#multi-sequence verifiers

Multi-Sequence Verifiers Cut Inference Latency in Half for LLM Reasoning

A new paper by Kim et al. introduces the Multi-Sequence Verifier (MSV), a lightweight verifier that improves calibration for parallel test-time scaling in large language models. MSV enhances best-of-N selection accuracy by up to 6% and enables early-stopping strategies that achieve the same accuracy with less than half the inference latency.

Jun 16, 2026 1 source

OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring

Technology

Artificial Intelligence #llm#ai

OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring

A new method called Optimal Brain Cache (OBCache) treats key-value cache eviction as a layer-wise structured pruning problem. By measuring token saliency through perturbation in attention outputs, OBCache outperforms heuristic-based approaches on LLaMA and Qwen models, consistently improving long-context accuracy according to the paper.

Jun 16, 2026 1 source

PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

Technology

Artificial Intelligence #llm#tool execution

PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

A new system called PASTE reduces average task completion time for AI agents by 43.5% by parallelizing tool execution with LLM generation. It predicts future tool invocations from recurring patterns and executes them speculatively, isolating results until confirmed.

Jun 16, 2026 1 source

Technology

Artificial Intelligence #diffusion llm#inference

Fast-dLLM++ Boosts Diffusion LLM Inference Up to 37% With Fréchet Profile Decoding

Researchers propose Fast-dLLM++, a training-free extension to Fast-dLLM that uses Fréchet profile decoding to select parallel token commit sets from the full confidence profile. Experiments on LLaDA-8B show up to 37% higher throughput at comparable accuracy on benchmarks including GSM8K, MATH, HumanEval, and MBPP.

Jun 16, 2026 1 source

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

Technology

Artificial Intelligence #llm#inference

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

A new paper from researchers including David Simchi-Levi introduces a fluid-guided online scheduling approach for LLM inference that addresses memory constraints from Key-Value cache growth. The WAIT and Nested WAIT algorithms approximate an optimal fluid benchmark, reducing latency in overloaded regimes according to simulations on Llama-2-7B with A100 GPUs.

Jun 16, 2026 1 source

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Technology

Artificial Intelligence #llm#inference

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.

Jun 16, 2026 1 source

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

Technology

Software #simulation-based inference#misspecification

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

Researchers propose MA-SBI, a misspecification-aware simulation-based inference framework that leverages unstructured side-channel information—such as regime labels or policy bulletins—to correct posterior estimates without requiring ground-truth parameter pairs. The method matches oracle performance on hide-the-calibration benchmarks and improves log-likelihood on real COVID epidemiological data.

Jun 16, 2026 1 source

Service-Induced Congestion Threatens LLM Serving Throughput, New Model Shows

Technology

Artificial Intelligence #service-induced congestion#memory-constrained

Service-Induced Congestion Threatens LLM Serving Throughput, New Model Shows

A new mathematical model from researchers at MIT and elsewhere shows that in large language model serving, persistent GPU memory consumption from key-value caches creates a 'service-induced congestion' effect. Under high concurrency, this can lead to instability and throughput losses as high as 50%. The paper identifies scheduling design principles to avoid these losses.

Jun 16, 2026 1 source

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

Technology

Artificial Intelligence #kv cache#compression

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.

Jun 16, 2026 1 source

New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining

Technology

Artificial Intelligence #artificial intelligence#multimodal systems

New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining

A research paper on arXiv introduces a retrieval-augmented reliability-aware inference framework that reduces visual hallucinations in multimodal large language models. By using an external evidence database and reliability indicators, the system improves accepted prediction accuracy from 85.84% to 88.88% at 89.04% coverage, without retraining the model.

Jun 16, 2026 1 source

New VeriAttn Technique Accelerates Verifiable LLM Inference on TEE-GPU Systems

Technology

Artificial Intelligence #llm#inference

New VeriAttn Technique Accelerates Verifiable LLM Inference on TEE-GPU Systems

Researchers propose VeriAttn, a communication-efficient TEE-GPU attention mechanism for verifiable LLM inference. By offloading attention computations to the GPU while the TEE performs verification, VeriAttn achieves 2.60-3.38x acceleration for prefill and 3.86-5.42x for decoding over the TSDP baseline on Intel TDX.

Jun 16, 2026 2 sources