iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
DYNA Framework Uses Temporal Knowledge Graphs to Reduce LLM Forgetting Without Retraining RAMS: Resource-Adaptive Model Switching for Embedded Edge Perception Under Load Open-SWE-Traces: 207K Multilingual Trajectories Set New Standard for Autonomous Software Engineering Agents Infant-Inspired Noise Boosts Deep RL Exploration, Research from arXiv Shows Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation Apple explains why Siri AI took so long: first version ready last year but rebuilt from ground up New LLM Framework Detects Phishing Emails with Over 90% Accuracy Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges DYNA Framework Uses Temporal Knowledge Graphs to Reduce LLM Forgetting Without Retraining RAMS: Resource-Adaptive Model Switching for Embedded Edge Perception Under Load Open-SWE-Traces: 207K Multilingual Trajectories Set New Standard for Autonomous Software Engineering Agents Infant-Inspired Noise Boosts Deep RL Exploration, Research from arXiv Shows Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation Apple explains why Siri AI took so long: first version ready last year but rebuilt from ground up New LLM Framework Detects Phishing Emails with Over 90% Accuracy Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges
Home ›› Topics ›› inference

Topic

inference

6 stories
SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions Technology
Artificial Intelligence #llm#inference

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.

Jun 16, 2026 1 source
MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance Technology
Software #simulation-based inference#misspecification

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

Researchers propose MA-SBI, a misspecification-aware simulation-based inference framework that leverages unstructured side-channel information—such as regime labels or policy bulletins—to correct posterior estimates without requiring ground-truth parameter pairs. The method matches oracle performance on hide-the-calibration benchmarks and improves log-likelihood on real COVID epidemiological data.

Jun 16, 2026 1 source
Service-Induced Congestion Threatens LLM Serving Throughput, New Model Shows Technology
Artificial Intelligence #service-induced congestion#memory-constrained

Service-Induced Congestion Threatens LLM Serving Throughput, New Model Shows

A new mathematical model from researchers at MIT and elsewhere shows that in large language model serving, persistent GPU memory consumption from key-value caches creates a 'service-induced congestion' effect. Under high concurrency, this can lead to instability and throughput losses as high as 50%. The paper identifies scheduling design principles to avoid these losses.

Jun 16, 2026 1 source
PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5% Technology
Artificial Intelligence #kv cache#compression

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.

Jun 16, 2026 1 source
New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining Technology
Artificial Intelligence #artificial intelligence#multimodal systems

New Framework Reduces Visual Hallucinations in Multimodal AI Systems Without Retraining

A research paper on arXiv introduces a retrieval-augmented reliability-aware inference framework that reduces visual hallucinations in multimodal large language models. By using an external evidence database and reliability indicators, the system improves accepted prediction accuracy from 85.84% to 88.88% at 89.04% coverage, without retraining the model.

Jun 16, 2026 1 source
New VeriAttn Technique Accelerates Verifiable LLM Inference on TEE-GPU Systems Technology
Artificial Intelligence #llm#inference

New VeriAttn Technique Accelerates Verifiable LLM Inference on TEE-GPU Systems

Researchers propose VeriAttn, a communication-efficient TEE-GPU attention mechanism for verifiable LLM inference. By offloading attention computations to the GPU while the TEE performs verification, VeriAttn achieves 2.60-3.38x acceleration for prefill and 3.86-5.42x for decoding over the TSDP baseline on Intel TDX.

Jun 16, 2026 2 sources