Enterprises deploying AI agents for automation tasks often encounter reliability issues: agents go off-track, consume excessive tokens, or fail to complete sequences. A new paper on arXiv argues these problems are not merely implementation bugs but architectural consequences of the dominant design pattern that gives the LLM the role of orchestrator.
According to the paper titled "LLM-as-Code Agentic Programming for Agent Harness," every major LLM agent framework allows the model to decide what to do next, when to call tools, and when to stop. The researchers identify three persistent issues: token explosion, control-flow hallucination, and unreliable completion. They write: "A better prompt or a stronger model cannot guarantee the reliability of the LLM agent."
"A better prompt or a stronger model cannot guarantee the reliability of the LLM agent."
The fundamental problem, the authors argue, is assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. To solve this, they propose Agentic Programming, a paradigm in which the program governs all control flow, and the LLM is itself part of it—an adaptive component they call LLM-as-Code. The LLM is invoked only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program's execution path.
With control in the program, the LLM's context is built from the execution history's call tree and forms a directed acyclic graph (DAG). Each call's context length is then determined by its call depth rather than by accumulation over steps. This design prevents context length from growing unboundedly, reducing token consumption and improving determinism.
| Characteristic | Traditional LLM Agent Frameworks | Agentic Programming (LLM-as-Code) |
|---|---|---|
| Control Flow | LLM decides next action, tool calls, and stop | Program governs all control flow via deterministic code |
| LLM Role | Orchestrator with full autonomy | Adaptive component invoked only for reasoning/generation |
| Context Construction | Accumulates over steps, unbounded growth | Built from call tree (DAG), bounded by call depth |
| Reliability | Prone to token explosion, hallucination, incomplete tasks | Improved stability in long sequences |
The paper presents a case study of computer-use agents—such as those that automate GUI interactions. The authors found that the Agentic Programming design is "practical, not just a theoretical stance," and that it "substantially improve[s] the stability of long visual operation sequences."
For enterprise technology leaders evaluating AI agents for supply chain automation or logistics workflows, the findings suggest that architectural choices matter as much as model selection. By separating control flow from probabilistic reasoning, organizations can build agents that complete multi-step tasks with greater predictability. The LLM-as-Code approach keeps the flexibility of large language models where needed while ensuring that the overall process remains under deterministic governance.
The research was conducted by a team including Qi, Junjia, Fu, Zichuan, Gao, Jingtong, Zhang, Wenlin, Yan, Hanyu, Wu, Zhao, and Xiangyu. The full paper is available on arXiv.
As enterprises seek to deploy AI agents in production environments—from customs documentation to warehouse robotics—the reliability guarantees offered by Agentic Programming could reduce operational risks. The paper provides a concrete architectural pattern that addresses the root causes of agent instability, offering a pathway to more trustworthy autonomous systems.