Enterprise AI agents that execute complex tasks rely on a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, meaning tool latency remains on the critical path, slowing down overall performance. For technology leaders deploying AI agents in high-throughput environments, each millisecond of latency compounds across thousands of sessions. A new system from researchers aims to change that by rethinking how tools and language models interact.
The Latency Problem in AI Agents
According to the paper presented on arXiv (paper ID 2603.18897), modern LLM-powered agents operate through a sequential loop: the model generates a response, then a tool executes, then the model generates again. This serialization leaves tool execution exposed on the task critical path. The result is that any delay in tool execution — whether from API calls, database queries, or external service invocations — directly increases end-to-end task completion time.
The researchers note that this is a structural inefficiency inherent in current agent serving architectures. The problem is especially acute for workloads that involve frequent tool calls, such as deep research, coding, and scientific-agent tasks.
How PASTE Works
PASTE — short for tool-aware agent-serving system — addresses this inefficiency by parallelizing tool execution and LLM generation. The system predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. Importantly, PASTE isolates speculative results until confirmed by the LLM, avoiding the risk of acting on incorrect predictions.
To prevent shifting bottlenecks to the GPU, PASTE jointly schedules tool execution and returning LLM sessions. This coordination ensures that parallel execution does not degrade model throughput.
Results and Metrics
Across workloads including deep research, coding, and scientific-agent tasks, PASTE delivers significant latency reductions. The key findings are:
PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.
| Metric | Improvement |
|---|---|
| Average task completion time reduction | 43.5% |
| Observed tool latency reduction | 1.8x |
| Workload types | Deep research, coding, scientific-agent |
Implications for Enterprise AI Deployments
For CTOs and technology procurement leaders evaluating agent-based systems, the PASTE approach demonstrates that significant latency gains are achievable without fundamentally changing the underlying LLM or tools — only the orchestration layer. The 43.5% reduction in task completion time directly translates to faster user responses and higher throughput per GPU.
Enterprises deploying AI agents for customer support, data analysis, or process automation stand to benefit from similar parallelization strategies. While PASTE is a research system, its principles — speculative execution, pattern prediction, and joint scheduling — are applicable to production agent serving infrastructure.
The paper is available on arXiv and is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International.