iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
AnchorEdit: Autoregressive Diffusion Tackles Identity Drift in Multi-Turn Image Editing Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation Study Reveals 27 Error Types in LLM Text-to-SQL, Introduces MapleDoctor Repair Framework Stop treating AI as the strategy — focus on business outcomes instead Beyond Text-to-SQL: New Agentic LLM System Governs Enterprise Analytics APIs Pruning Optimisations Boost LUT-Based Neural Network Scalability and Efficiency Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation Research Proposes Task-Based Neurons to Enhance Neural Network Feature Representation EV-WM: Event-Verified World Models Boost Long-Horizon Robotic Manipulation for Industrial Automation Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains AnchorEdit: Autoregressive Diffusion Tackles Identity Drift in Multi-Turn Image Editing Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation Study Reveals 27 Error Types in LLM Text-to-SQL, Introduces MapleDoctor Repair Framework Stop treating AI as the strategy — focus on business outcomes instead Beyond Text-to-SQL: New Agentic LLM System Governs Enterprise Analytics APIs Pruning Optimisations Boost LUT-Based Neural Network Scalability and Efficiency Tree-like Self-Play Framework Teaches LLMs to Fix Security Flaws in Code Generation Research Proposes Task-Based Neurons to Enhance Neural Network Feature Representation EV-WM: Event-Verified World Models Boost Long-Horizon Robotic Manipulation for Industrial Automation Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains
Home ›› Technology ›› Ai ›› Llms ›› PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

A new system called PASTE reduces average task completion time for AI agents by 43.5% by parallelizing tool execution with LLM generation. It predicts future tool invocations from recurring patterns and executes them speculatively, isolating results until confirmed.

iG
iGEN Editorial
June 16, 2026
PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

Enterprise AI agents that execute complex tasks rely on a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, meaning tool latency remains on the critical path, slowing down overall performance. For technology leaders deploying AI agents in high-throughput environments, each millisecond of latency compounds across thousands of sessions. A new system from researchers aims to change that by rethinking how tools and language models interact.

The Latency Problem in AI Agents

According to the paper presented on arXiv (paper ID 2603.18897), modern LLM-powered agents operate through a sequential loop: the model generates a response, then a tool executes, then the model generates again. This serialization leaves tool execution exposed on the task critical path. The result is that any delay in tool execution — whether from API calls, database queries, or external service invocations — directly increases end-to-end task completion time.

The researchers note that this is a structural inefficiency inherent in current agent serving architectures. The problem is especially acute for workloads that involve frequent tool calls, such as deep research, coding, and scientific-agent tasks.

How PASTE Works

PASTE — short for tool-aware agent-serving system — addresses this inefficiency by parallelizing tool execution and LLM generation. The system predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. Importantly, PASTE isolates speculative results until confirmed by the LLM, avoiding the risk of acting on incorrect predictions.

To prevent shifting bottlenecks to the GPU, PASTE jointly schedules tool execution and returning LLM sessions. This coordination ensures that parallel execution does not degrade model throughput.

Results and Metrics

Across workloads including deep research, coding, and scientific-agent tasks, PASTE delivers significant latency reductions. The key findings are:

PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

Metric Improvement
Average task completion time reduction 43.5%
Observed tool latency reduction 1.8x
Workload types Deep research, coding, scientific-agent

Implications for Enterprise AI Deployments

For CTOs and technology procurement leaders evaluating agent-based systems, the PASTE approach demonstrates that significant latency gains are achievable without fundamentally changing the underlying LLM or tools — only the orchestration layer. The 43.5% reduction in task completion time directly translates to faster user responses and higher throughput per GPU.

Enterprises deploying AI agents for customer support, data analysis, or process automation stand to benefit from similar parallelization strategies. While PASTE is a research system, its principles — speculative execution, pattern prediction, and joint scheduling — are applicable to production agent serving infrastructure.

The paper is available on arXiv and is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International.


Sources:

Keep Reading

Recommended Stories

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints Technology

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

A new paper from researchers including David Simchi-Levi introduces a fluid-guided online scheduling approach for LLM inference that addresses memory constraints from Key-Value cache growth. The WAIT and Nested WAIT algorithms approximate an optimal fluid benchmark, reducing latency in overloaded regimes according to simulations on Llama-2-7B with A100 GPUs.

June 16, 2026
SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions Technology

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Researchers have developed SMEPilot, an LLM inference engine that leverages Arm Scalable Matrix Extension (SME) to optimize execution on CPUs. By selecting CPU-only, SME-only, or cooperative SME+CPU execution per operator shape, SMEPilot improves end-to-end inference by up to 3.94x across multiple models and platforms.

June 16, 2026
AgenticRec: A Recommender Framework That Aligns LLM Reasoning with User Preferences Technology

AgenticRec: A Recommender Framework That Aligns LLM Reasoning with User Preferences

Researchers propose AgenticRec, a framework that treats recommendation as a tool-integrated reasoning process. It employs a two-stage training paradigm to overcome misalignment between LLM reasoning trajectories and recommendation feedback, improving fine-grained preference distinction.

June 16, 2026
PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5% Technology

PolyKV: Layer-Wise KV Cache Compression Boosts LLM Inference Efficiency by Up to 54.5%

PolyKV is a new framework for compressing the key-value cache in large language model inference. It selects a compression policy per transformer layer and allocates non-uniform cache budgets, outperforming uniform approaches. On LongBench tasks, PolyKV recovers 40%-54.5% of the performance gap between the strongest single-policy baseline and full KV cache.

June 16, 2026