Large language models (LLMs) are increasingly deployed for complex tasks that require multi-step planning, yet their ability to navigate real-world knowledge graphs remains poorly understood. A new benchmark, LLM-WikiRace, introduced by researchers including Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, and Ilija Bogunovic, aims to quantify this capability. The benchmark requires models to navigate from a source Wikipedia page to a target page by selecting hyperlinks step by step, testing look-ahead planning and reasoning about real-world concept connections.
Benchmark Design and Model Evaluation
According to the arXiv paper, LLM-WikiRace evaluates a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5. On easy-level games, these frontier models achieved the strongest results, even demonstrating superhuman performance. However, performance on hard difficulty dropped sharply. The best-performing model, Gemini-3, succeeded in only 23% of hard games, highlighting substantial remaining challenges.
| Difficulty | Best Model Success Rate |
|---|---|
| Easy | Superhuman (Gemini-3, GPT-5, Claude Opus 4.5) |
| Hard | 23% (Gemini-3) |
Critical Analysis of Model Behavior
The researchers found that world knowledge is a necessary ingredient for success, but only up to a point. Beyond that threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis revealed that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. The paper notes that LLM-WikiRace is a simple benchmark that reveals clear limitations in current reasoning systems.
Implications for Enterprise AI Planning
While LLM-WikiRace is an academic benchmark, its insights directly inform enterprise use cases that demand multi-step planning over complex knowledge structures—such as supply chain routing, logistics optimization, and trade documentation workflows. The finding that world knowledge alone is insufficient underscores the need for LLMs with robust planning architectures, especially in high-stakes environments where replanning after errors is critical. The researchers have released their code and leaderboard at https://llmwikirace.github.io, offering an open arena for further progress.
The benchmark's emphasis on replanning after failure is particularly relevant for autonomous systems in logistics and trade, where unexpected disruptions require dynamic rerouting. Current frontier models, despite their impressive narrow capabilities, still have much to prove in long-horizon planning tasks.