Enterprise AI systems increasingly rely on language models to perform complex reasoning tasks. But when it comes to algorithmic reasoning, a fundamental question remains: Is it better for an AI to reason in natural language or in code? A new paper on arXiv from researchers Tong, Terry, Feng, Yu, Goel, Surbhi, Roth, and Dan attempts to isolate the factors that contribute to performance gains in tool-augmented language models.
The study addresses a key difficulty in comparing natural-language reasoning with code-execution pipelines: the comparison changes both the intermediate representation (language vs. code) and the execution mechanism (simulated in context vs. deterministic external execution). To separate these factors, the authors designed an intermediate intervention where the model expresses its reasoning as executable code, but a language model simulates that code in context to produce an answer.
Benchmark Results
The researchers evaluated their approach on a 40-task verifiable algorithmic benchmark. The results are striking:
| Condition | Performance vs. Natural-Language Baseline |
|---|---|
| Deterministic code execution | +31.6 percentage points |
| Intermediate intervention (code representation, simulated execution) | +0.15 percentage points |
According to the paper, deterministic code execution outperforms natural-language reasoning by 31.6 percentage points. However, the intermediate intervention — which keeps the code representation but uses simulated rather than deterministic execution — was not meaningfully different from natural-language reasoning, with a difference of only 0.15 percentage points.
Implications for AI Reasoning Design
The results suggest that, in the evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage. Instead, the performance gains require reliable external execution. The authors formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in their disentangled trace-generation/execution regime.
To further validate their theory, the team performed a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations. This reconstruction recovered performance comparable to the original natural-language reasoning pipeline, reinforcing their conclusion.
Takeaway for Enterprise AI
For enterprise technology leaders evaluating language model architectures for algorithmic tasks, the study provides evidence that embedding code execution capabilities — not just code-like reasoning — may be critical for achieving top performance. The findings underscore the importance of integrating deterministic execution engines rather than relying solely on in-context simulation.
The full paper, title "Is Code Better Than Language for Algorithmic Reasoning", is available on arXiv with code and data.