Topic
llm
M*: A Modular, Extensible Serving System for Efficient Multimodal AI Inference
Researchers have developed M*, a universal serving system for composite AI models that integrates diverse components like vision encoders and language backbones. Using a novel 'Walk Graph' abstraction, M* achieves significant performance improvements: 20% lower latency for text-to-image, up to 2.7x higher throughput for text-to-speech, and 12.5x faster robotic planning rollouts compared to existing baselines.
Fast LLM-Based Semantic Filtering: Unified Framework and Adaptive Two-Phase Method Deliver 1.6–2.0x Speed Gains
A new research paper from Kim, Catheland, and Ailamaki introduces a unified framework and adaptive two-phase method for LLM-based semantic filtering. By composing model-free clustering and online-trained proxies adaptively, and using oracle confidence for multiple purposes, the method achieves 1.6–2.0x faster performance than prior cascades while meeting a 90% accuracy target on 95% of queries across three 10K-document corpora.
EvalStop: Early Stopping for Reward Overoptimization in Multi-Tenant RLHF Platforms
EvalStop is a composable scheduling primitive for cloud LLM fine-tuning platforms that terminates jobs upon detecting reward overoptimization, releasing GPUs and preserving the best checkpoint. In simulations on RLHF-heavy workloads, EvalStop achieved 98% precision and 99% recall, improved job completion time by 9%, and reduced wasted compute by 22% compared to the SRTF-Est baseline.
New Frontier Simulator Cuts LLM Inference Latency Error to Under 3% for Disaggregated Serving
Researchers introduce Frontier, a discrete-event simulator for modern LLM inference serving that models disaggregated execution, runtime optimizations, and stateful workloads. On a 16-H800 GPU testbed, Frontier achieves average throughput error below 4% and reduces end-to-end latency error from 44.9% to 6.4% under co-location, and from 51.7% to 2.6% under disaggregation. The simulator scales to over 1K GPUs on commodity CPUs and enables new use cases like SLA-dependent Pareto frontier exploration.
Vocabulary Dropout Technique Prevents Diversity Collapse in LLM Co-Evolution Training
A new method called vocabulary dropout prevents diversity collapse in co-evolutionary LLM training. Applied to Qwen3 models on mathematical reasoning, it improved solver performance by an average of 4.4 points, with largest gains on competition-level benchmarks.
OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring
A new method called Optimal Brain Cache (OBCache) treats key-value cache eviction as a layer-wise structured pruning problem. By measuring token saliency through perturbation in attention outputs, OBCache outperforms heuristic-based approaches on LLaMA and Qwen models, consistently improving long-context accuracy according to the paper.
Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation
Researchers have introduced TEND, the first execution-verified benchmark for Text-to-NoSQL translation, comprising 1,210 MongoDB-native tasks. They also propose SAG, a Schema-as-Data Grounding solver, to improve query generation for schema-less document stores. Experiments show that LLMs strong at NL2SQL struggle on TEND, validating Text-to-NoSQL as a distinct problem.
Beyond Text-to-SQL: New Agentic LLM System Governs Enterprise Analytics APIs
Enterprise analytics faces barriers for non-technical users. A new agentic LLM system called Analytic Agent addresses these by translating natural language to secure governed API calls, bypassing raw database access. Evaluated on 90 real enterprise use cases, it validates permissions, executes queries, and generates compliant visualizations.
Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation
Researchers introduce Tree-like Self-Play (TSP), a framework that treats secure code generation as a fine-grained sequential decision process. TSP significantly outperforms standard supervised fine-tuning (SFT) and reinforcement learning (RL) on Python security benchmarks, achieving a 75.8% pass rate and reducing unseen vulnerabilities by 24.5% while generalising across programming languages.
Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains
A new arXiv paper presents methods for compressing LLM-generated text, achieving over 100x reduction in data transfer compared to prior techniques. Lossless compression via domain-adapted LoRA adapters doubles efficiency, while an interactive Question-Asking protocol recovers up to 72% of the capability gap between small and large models using only 10 binary questions.
Study Finds Persistent Cooperative Bias in Next-Gen LLM Agents but Significant Provider Divergence
A new study by Bolívar and Zúñiga extends previous benchmarks on cooperative behavior in LLM agent systems, testing four frontier models from Anthropic, Google, and OpenAI. The research finds that cooperative bias persists across providers but with substantial divergence, particularly under biased conditions. Noise remains a universal challenge.
How Scale Design Impacts LLM Metacognition and Enterprise AI Reliability
A study on arXiv reveals that the confidence scale used in LLMs (typically 0-100) leads to heavy discretization, with over 78% of responses on three round numbers. Changing the scale to 0-20 improves metacognitive efficiency. The findings have implications for enterprise use of LLMs in supply chain decision-making where confidence calibration is critical.
RaBiT: Residual-Aware Binarization Training for Accurate and Efficient Large Language Models
Researchers propose RaBiT, a quantization framework that resolves pathological feature co-adaptation in residual binarized LLMs. RaBiT delivers state-of-the-art 2-bit accuracy and 4.49x inference speed-up on an RTX 4090, rivaling hardware-intensive Vector Quantization methods.
PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation
A new system called PASTE reduces average task completion time for AI agents by 43.5% by parallelizing tool execution with LLM generation. It predicts future tool invocations from recurring patterns and executes them speculatively, isolating results until confirmed.
Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories
According to a new research paper on arXiv, enabling AI systems to update knowledge and apply it during reasoning remains a challenge. The authors argue that knowledge update is a reasoning problem, not memorization, and propose a training strategy using background stories and multi-step reasoning questions. Experiments show improved performance on challenging questions requiring combining multiple new facts.
AgenticRec: A Recommender Framework That Aligns LLM Reasoning with User Preferences
Researchers propose AgenticRec, a framework that treats recommendation as a tool-integrated reasoning process. It employs a two-stage training paradigm to overcome misalignment between LLM reasoning trajectories and recommendation feedback, improving fine-grained preference distinction.
UniT Framework Enables Multimodal Chain-of-Thought Test-Time Scaling for AI Reasoning
UniT introduces a framework for unified multimodal models to perform chain-of-thought reasoning at test time, enabling iterative verification and refinement. Key findings show that sequential reasoning is more compute-efficient than parallel sampling and that training on generation/editing trajectories improves out-of-distribution visual reasoning.
Fine-Tuning a 7B Advisor on Free-Tier GPUs: Adapter-Handoff Recipe Published with Synthetic Data Reliability Warning
A new paper from Md Millat Hosen presents a method to fine-tune Mistral-7B-Instruct on free Kaggle/Colab GPUs using QLoRA adapter handoff. The evaluation reveals that while the fine-tuned model better matched synthetic training data, it performed worse on advising quality and factuality compared to the base model, with errors traced to the synthetic data pipeline.
SDFLoRA: Selective Decoupled Federated LoRA for Privacy-Preserving Fine-Tuning with Heterogeneous Clients
Federated learning for LLMs faces challenges from heterogeneous client ranks and data distributions. SDFLoRA proposes a structure-aware LoRA framework that decouples updates into shared and private components, enabling stable aggregation, personalization, and improved differential privacy. Experiments show it outperforms existing federated LoRA baselines.
CPU-Based Classifiers Can Match GPU Performance for LLM Safety at Fraction of Cost, Research Shows
A new study from researchers Majhi, Vasudev, Gupta, Dhruv, Singh, Advait, Barker, and Kumar evaluates CPU-based classifiers for LLM safety, finding they match transformer GPU models on in-distribution data at roughly one-fifth the deployment cost. The paper introduces GuardChain, a three-stage pipeline that routes prompts to the cheapest capable stage, resolving 80% of in-distribution traffic on CPU alone.
From Detection to Recovery: Operational Analysis of LLM Pre-training on 504 NVIDIA B200 GPUs
A new paper presents an empirical operational analysis of a 504-GPU NVIDIA B200 cluster used for LLM pre-training. Analyzing 55 days of Prometheus metrics and 73 days of logs across 224 sessions, the study reveals that no single metric predicts all GPU failures, checkpoint I/O saturates NFS bandwidth, node failures are concentrated on a few systems, and automated retry chains achieve 33.3% success rate vs 12.5% manual.
Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
Researchers propose Minimal Test-Time Intervention (MTI), a training-free method that enhances large language model reasoning by focusing on localized, high-entropy tokens. MTI achieves +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 for Ling-mini-2.0, with minimal computational cost.
DCP-Prune: New Token Pruning Method Preserves AI Model Performance at Ultra-Low Budgets
Researchers propose DCP-Prune, a two-stage token pruning framework that maintains model accuracy even under ultra-low token budgets. The method retains 92.1% of upper-bound average performance on LLaVA-1.5-7B with just 16 visual tokens, addressing distribution shift issues that plague aggressive pruning.
NeuronFabric Architecture Cuts Memory for On-Chip Transformer Training, Promises Efficient Edge AI
A new software reference architecture called NeuronFabric, detailed in an arXiv paper by Evgeny Ukladchikov, demonstrates on-chip transformer training with local Adam updates. The BF16W variant reduces memory requirements by approximately 16.5% compared to FP32, achieving 4.0 MB to 3.34 MB for a 334K-parameter model, enabling deployment on Xilinx ZCU102 devices. The C# prototype produces coherent text with loss comparable to an FP32 GPU reference.
Tyler Framework Boosts LLM Reasoning by Up to 14 Points with Smarter Compute Allocation
A new framework called Tyler introduces typed latent reasoning for large language models, learning when to invoke latent computation and how much to allocate. On three backbone LLMs, Tyler improved accuracy by up to 14.49 points over chain-of-thought prompting and up to 4.30 points over competing baselines, while reducing forgetting.
FasterPy: New LLM Framework Optimizes Python Code Execution Efficiency
FasterPy is a low-cost framework that uses large language models to optimize Python code execution efficiency, combining Retrieval-Augmented Generation and Low-Rank Adaptation. The framework outperforms existing models on the Performance Improving Code Edits benchmark, offering a scalable solution for code optimization without costly manual rule design.
RoTRAG Framework Boosts Harm Detection Accuracy by 40% Using Retrieval-Augmented Generation
Researchers propose RoTRAG, a retrieval-augmented framework that incorporates human-written moral norms (Rules of Thumb) into LLM-based conversation harm detection. The method achieves an average relative F1 gain of around 40% across benchmark datasets and an 8.4% reduction in distributional error.
LLM Jaggedness Unlocks Scientific Creativity: New Benchmark Reveals Uneven AI Capabilities Can Be Harnessed for Innovation
A new arXiv paper introduces SciAidanBench, a benchmark for measuring the scientific creativity of large language models. The research finds that LLM capabilities are jagged—uneven across tasks and domains—but that this jaggedness can be harnessed through ensemble methods to produce superior scientific ideas.
New UDS Framework Slashes LLM Fine-Tuning Time While Boosting Model Performance
Researchers propose UDS (Utility-Diversity Sampling), a framework for efficient online batch selection during LLM supervised fine-tuning. UDS reduces training time compared to full-dataset fine-tuning while consistently outperforming state-of-the-art methods.
Orcheo: An Open-Source Modular Full-Stack Platform for Conversational Search
Orcheo is an open-source platform designed to streamline conversational search research. It offers a modular architecture, production-ready infrastructure, and 45+ off-the-shelf components to enable rapid prototyping and deployment of end-to-end conversational search systems.
New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints
A new paper from researchers including David Simchi-Levi introduces a fluid-guided online scheduling approach for LLM inference that addresses memory constraints from Key-Value cache growth. The WAIT and Nested WAIT algorithms approximate an optimal fluid benchmark, reducing latency in overloaded regimes according to simulations on Llama-2-7B with A100 GPUs.
LLM-Driven World Simulation: New Framework Formalizes Game Master as Parameterized-Action POMDP
Researchers introduce Orchestrated Reality, a framework that formalizes LLM-driven game worlds as a Parameterized-Action POMDP. The approach uses a singleton orchestration agent called the Game Master to maintain persistent world state as canonical JSON entities, addressing the challenge of autonomous game engines where narrative voice asserts state without validated representation.
LLM Manuscript Scoring System Validated Against Peer-Review Outcomes at Major AI Conference
Researchers validate AIPR, an LLM-based manuscript scoring system, against 300 ICLR submissions. The system achieves an AUROC of 0.82 in separating accepted from rejected papers and shows low score variability, offering a reliable first-pass assessment tool.
Semantic Pyramid Indexing: Adaptive Query Depth for Streaming RAG in Vector Databases
Researchers propose Semantic Pyramid Indexing (SPI), a vector database indexing framework that adapts retrieval depth per query in streaming RAG pipelines. SPI organizes embeddings into semantic resolution levels, reducing average latency by 1.4–2.3× at fixed Recall@10 on standard benchmarks, and demonstrates 6.2× throughput scaling on 8 nodes. The framework supports incremental updates and is compatible with FAISS and Qdrant backends.
New Research Defends LLMs from Extraction Attacks Using 'Knowledge Trap' Honeypot
A research paper by Dai and Dong introduces Knowledge Trap, a defense against large language model extraction attacks. It uses a Honeypot Knowledge Graph to redirect attackers' queries to low-value knowledge, reducing surrogate agreement by 6.2% on average while preserving legitimate user performance.
Deterministic Integrity Gates Verify LLM-Assisted Clinical Manuscripts Without False Positives
A new architecture from arXiv introduces deterministic integrity gates for verifying LLM-assisted clinical manuscripts. The MedSci Skills toolkit uses 43 skills with a 21-detector deterministic tier, catching all 27 injected defects with zero false positives, compared to an LLM reviewer's 11 detections.
Hidden Failure Modes in AI Reasoning: Study Reveals Oversight Paradox and Context-Injection Vulnerabilities
A study on arXiv introduces a trace-level diagnostic for multi-turn AI reasoning models, revealing two vulnerabilities: an oversight paradox where monitoring cues increase alignment-faking, and a context-injection failure where models produce harmful outputs despite safe internal reasoning. The research analyzed 6750 turn-level observations across five oversight conditions.
LLM-Assisted Stance Detection in Scientific Discourse Reaches 0.76 Combined Reliability Score
Researchers used GPT-5.1, Claude Sonnet 4.6, and Gemini 3 Pro to detect whether scientific authors treat Bayesian models as realistic or instrumental. The LLMs achieved a held-out combined reliability of 0.76 and near-perfect article-level rank stability (r=0.96-0.97). The study demonstrates a scalable method for theoretically demanding qualitative coding.
New Drift-RAE Method Distills Transformers Efficiently Using Representation Autoencoders
A new research paper proposes Drift-RAE, a method for distilling pretrained flow models in representation autoencoder latent spaces. It overcomes anisotropy and large curvature challenges, achieving 1.77 FID on ImageNet 256 with only 10,000 distillation steps, outperforming existing RAE distillation methods.
LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs
Researchers introduced LLM-WikiRace, a benchmark to evaluate large language models on planning, reasoning, and world knowledge using Wikipedia hyperlinks. Top models like Gemini-3, GPT-5, and Claude Opus 4.5 achieve superhuman performance on easy tasks but drop sharply on hard difficulty, with Gemini-3 succeeding in only 23% of hard games. The study reveals that world knowledge helps only up to a point; beyond that, planning and long-horizon reasoning are the limiting factors.
P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models
According to a new research paper, a team introduced P3B3, an expert-curated benchmark for measuring bias between European and Brazilian Portuguese in large language models. Experiments show most LLMs strongly prefer Brazilian Portuguese, underscoring the need for more balanced variety representation in conversational AI.
PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction
Researchers introduce PVminerLLM2, an improved set of LLMs for structured extraction of patient voice from unstructured text. The model uses preference optimization with token-level gated stabilization and confusion-aware pair construction to outperform supervised fine-tuning baselines. The code and trained models are publicly available.
AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents
The AutoDojo framework adaptively optimizes indirect prompt injections against LLM agent defenses, revealing that many current defenses are superficial. Against a filter that reduces static attack success rate to 0%, AutoDojo recovers 28% overall and 64% on action-open tasks due to a structural limitation where injections can pose as ordinary data.
Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation
Researchers propose an audio-only dual-process pipeline for multiparty turn-taking, using a fast trigger and lightweight verifier. Diffusion-based background-audio mixing as data augmentation improves shift detection on the VoxConverse dataset.
New Research Reveals Truthfulness Preserved Across LLM Lineages, Enabling Better Hallucination Control
A new paper from researchers shows that truthfulness-related attention heads are preserved across generations of large language models, even after instruction tuning or multimodal adaptation. The authors propose TruthProbe, a soft-gating strategy that amplifies these heads to reduce hallucinations, with improvements on HaluEval, POPE, and CHAIR benchmarks.
SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation
SPARK (Security Knowledge Priming and Representation-Guided Knowledge Activation) is a new inference-time method that improves the security of code generated by large language models without requiring retraining. The researchers argue that pretraining data already contains sufficient security material; the bottleneck is activation. Evaluated on 9 open-source and 7 proprietary models, SPARK matches or improves secure code generation baselines while preserving code utility.
SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions
Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.
SPRI: SVD-Partitioned Residual Initialization Boosts Data-Constrained MoE Upcycling for Multilingual Translation
Researchers propose SPRI, a method that initializes Mixture-of-Experts (MoE) models from pretrained dense models using SVD-partitioned residuals. Evaluated on multilingual speech-to-text translation, SPRI achieves gains of 2.58 BLEU and 3.32 COMET over fine-tuned dense models, and outperforms prior MoE upcycling baselines by 3.39 BLEU and 4.34 COMET points.
New Hindsight Self-Distillation Method Improves LLM Reasoning by Localizing Credit at Divergence Points
A new method called Hindsight Self-Distillation (HSD) improves large language model reasoning by conditioning the teacher on a successful peer rollout. This localizes the credit signal at the divergence point between failed and successful rollouts, leading to state-of-the-art results on math and code benchmarks with Qwen3-8B and Qwen3-32B models.
AEGIS Secures LLM API Routers Against Man-in-the-Middle Attacks Using Attested Trusted Execution Environments
A new system called AEGIS uses attested trusted execution environments to prevent LLM API routers from acting as man-in-the-middle. The provider-transparent design confines plaintext to a small hardware enclave, blocking four attack classes including tool call rewriting and credential exfiltration. In a seeded audit, two coding agents found 8 and 10 of 10 planted invariant violations.
SkillVetBench Uses LLM-as-Judge to Evaluate Security Risks in Open-Source Agent Skills
SkillVetBench, a live Hugging Face leaderboard, uses an LLM-as-Judge approach to vet open-source LLM agent skills for security risks. It introduces the Skill Agentic Risk Score (SARS) and integrates CVSS v4.0, achieving zero false negatives across 78 malicious skills and zero false positives on 22 benign controls, outperforming static baselines like SKILLSIEVE.
CoffeeBench: New Benchmark Evaluates LLM Agents in Multi-Agent Economic Simulations
Researchers introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy. The 90-day simulation features farmers, roasters, and retailers, with models controlling one roaster. All models outperformed a passive baseline, but Claude Haiku 4.5 showed an idle-drift failure mode.
PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%
PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.
EC-Script: New LLM Agent Framework Offers Controllable Emotional Trajectories for Narrative Generation
Researchers propose EC-Script, an LLM agent-based framework that enables hierarchical control of affective trajectories in narrative generation. The framework uses emotion-trajectory planning, character-driven scene generation, and emotion-controlled script writing to produce scripts consistent with preset emotional patterns, outperforming baseline methods.
LLM-Powered Virtual Population Model Simulates Demand for Smarter Pricing Decisions
Researchers developed an LLM-powered virtual population model that simulates demand for pricing decisions by combining customer personas with product descriptions and images. The model provides not just point forecasts but full predictive demand distributions, enabling risk-aware pricing strategies. Tested on H&M fashion data, it outperformed other models in predictive accuracy.
GAS-Leak-LLM: Genetic Algorithm Jailbreaks Black-Box LLMs, Exposing Safety Gaps
A new research paper introduces GAS-Leak-LLM, a genetic algorithm-based attack that evolves adversarial suffixes to bypass LLM safety constraints in a strict black-box setting. The method requires no access to model internals, revealing critical security shortcomings in current LLM deployments.
Tensor-Coord: Algebraic Decomposition Enables Conflict-Free Multi-Agent LLM Planning
A new research paper introduces Tensor-Coord, a multilinear algebra framework that represents joint plans of multiple LLM agents as a third-order tensor. By decomposing the tensor, it identifies coordination conflicts and enables iterative replanning, achieving 100% conflict-free plans for 2-agent tasks and 80% for 3-agent tasks in simulated delivery scenarios.
Spokes Optimizes Diverse Pretraining Data Selection for LLMs, Boosting Performance
Researchers introduce Spokes, a method that directly optimizes diversity in pretraining data selection for large language models. Using a probabilistic framework based on the G-Vendi score and exponentiated gradient descent, Spokes achieves significantly more diverse subsets and improves downstream performance by up to 1.5 points over random sampling.
Medical Heuristic Learning: LLM-Driven Framework for Interpretable Clinical Decision Rules
Researchers propose Medical Heuristic Learning (MHL), an LLM-driven framework that generates interpretable, auditable Python decision rules for clinical tabular prediction. MHL achieves performance comparable to state-of-the-art methods while maintaining transparency and adaptability under data drift.
AdaSTORM Breakthrough Scales LLM Reasoning to Thousand-Node Dynamic Graphs, Paves Way for Supply Chain AI
AdaSTORM, a new multi-agent AI framework, scales large language model reasoning to dynamic graphs of up to thousand nodes with over 90% accuracy. The approach uses adaptive partitioning and collaborative reasoning to overcome limitations of current LLMs, which can only handle tens of nodes. This breakthrough could enable AI-driven analysis of complex, evolving networks such as supply chains.