PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

A new system called PASTE reduces average task completion time for AI agents by 43.5% by parallelizing tool execution with LLM generation. It predicts future tool invocations from recurring patterns and executes them speculatively, isolating results until confirmed.

iGEN Editorial

June 16, 2026

PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

Enterprise AI agents that execute complex tasks rely on a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, meaning tool latency remains on the critical path, slowing down overall performance. For technology leaders deploying AI agents in high-throughput environments, each millisecond of latency compounds across thousands of sessions. A new system from researchers aims to change that by rethinking how tools and language models interact.

The Latency Problem in AI Agents

According to the paper presented on arXiv (paper ID 2603.18897), modern LLM-powered agents operate through a sequential loop: the model generates a response, then a tool executes, then the model generates again. This serialization leaves tool execution exposed on the task critical path. The result is that any delay in tool execution — whether from API calls, database queries, or external service invocations — directly increases end-to-end task completion time.

The researchers note that this is a structural inefficiency inherent in current agent serving architectures. The problem is especially acute for workloads that involve frequent tool calls, such as deep research, coding, and scientific-agent tasks.

How PASTE Works

PASTE — short for tool-aware agent-serving system — addresses this inefficiency by parallelizing tool execution and LLM generation. The system predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. Importantly, PASTE isolates speculative results until confirmed by the LLM, avoiding the risk of acting on incorrect predictions.

To prevent shifting bottlenecks to the GPU, PASTE jointly schedules tool execution and returning LLM sessions. This coordination ensures that parallel execution does not degrade model throughput.

Results and Metrics

Across workloads including deep research, coding, and scientific-agent tasks, PASTE delivers significant latency reductions. The key findings are:

PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

Metric	Improvement
Average task completion time reduction	43.5%
Observed tool latency reduction	1.8x
Workload types	Deep research, coding, scientific-agent

Implications for Enterprise AI Deployments

For CTOs and technology procurement leaders evaluating agent-based systems, the PASTE approach demonstrates that significant latency gains are achievable without fundamentally changing the underlying LLM or tools — only the orchestration layer. The 43.5% reduction in task completion time directly translates to faster user responses and higher throughput per GPU.

Enterprises deploying AI agents for customer support, data analysis, or process automation stand to benefit from similar parallelization strategies. While PASTE is a research system, its principles — speculative execution, pattern prediction, and joint scheduling — are applicable to production agent serving infrastructure.

The paper is available on arXiv and is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International.

Sources:

PASTE System Cuts AI Agent Latency by 43.5% via Parallel Tool Execution and LLM Generation

The Latency Problem in AI Agents

How PASTE Works

Results and Metrics

Implications for Enterprise AI Deployments

Recommended Stories

OBCache Prunes KV Cache for Efficient Long-Context LLM Inference with Output-Aware Scoring

New Fluid-Guided Algorithm Optimizes LLM Inference Scheduling Under Memory Constraints

SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents