LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs

Researchers introduced LLM-WikiRace, a benchmark to evaluate large language models on planning, reasoning, and world knowledge using Wikipedia hyperlinks. Top models like Gemini-3, GPT-5, and Claude Opus 4.5 achieve superhuman performance on easy tasks but drop sharply on hard difficulty, with Gemini-3 succeeding in only 23% of hard games. The study reveals that world knowledge helps only up to a point; beyond that, planning and long-horizon reasoning are the limiting factors.

iGEN Editorial

June 16, 2026

LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs

Large language models (LLMs) are increasingly deployed for complex tasks that require multi-step planning, yet their ability to navigate real-world knowledge graphs remains poorly understood. A new benchmark, LLM-WikiRace, introduced by researchers including Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, and Ilija Bogunovic, aims to quantify this capability. The benchmark requires models to navigate from a source Wikipedia page to a target page by selecting hyperlinks step by step, testing look-ahead planning and reasoning about real-world concept connections.

Benchmark Design and Model Evaluation

According to the arXiv paper, LLM-WikiRace evaluates a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5. On easy-level games, these frontier models achieved the strongest results, even demonstrating superhuman performance. However, performance on hard difficulty dropped sharply. The best-performing model, Gemini-3, succeeded in only 23% of hard games, highlighting substantial remaining challenges.

Difficulty	Best Model Success Rate
Easy	Superhuman (Gemini-3, GPT-5, Claude Opus 4.5)
Hard	23% (Gemini-3)

Critical Analysis of Model Behavior

The researchers found that world knowledge is a necessary ingredient for success, but only up to a point. Beyond that threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis revealed that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. The paper notes that LLM-WikiRace is a simple benchmark that reveals clear limitations in current reasoning systems.

Implications for Enterprise AI Planning

While LLM-WikiRace is an academic benchmark, its insights directly inform enterprise use cases that demand multi-step planning over complex knowledge structures—such as supply chain routing, logistics optimization, and trade documentation workflows. The finding that world knowledge alone is insufficient underscores the need for LLMs with robust planning architectures, especially in high-stakes environments where replanning after errors is critical. The researchers have released their code and leaderboard at https://llmwikirace.github.io, offering an open arena for further progress.

The benchmark's emphasis on replanning after failure is particularly relevant for autonomous systems in logistics and trade, where unexpected disruptions require dynamic rerouting. Current frontier models, despite their impressive narrow capabilities, still have much to prove in long-horizon planning tasks.

Sources:

LLM-WikiRace Benchmark Reveals Frontier AI Models Still Struggle with Planning Over Knowledge Graphs

Benchmark Design and Model Evaluation

Critical Analysis of Model Behavior

Implications for Enterprise AI Planning

Recommended Stories

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

MA-ProofBench: New Benchmark Tests LLMs on Formal Theorem Proving in Mathematical Analysis

AgentLeak Benchmark Reveals Internal Channel Privacy Leaks in Multi-Agent LLM Systems

The Chatbot That Foretold Why People Share Secrets With ChatGPT