StateGen Platform Generates Synthetic Training Data for Tool-Augmented LLMs with 9.66/10 Hallucination Score

Researchers introduce StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations for tool-augmented LLMs. The platform uses a four-role LLM loop and an authoritative state manager to eliminate tool-call hallucinations, achieving a 9.66/10 score across 64,698 evaluated conversations.

iGEN Editorial

June 16, 2026

StateGen Platform Generates Synthetic Training Data for Tool-Augmented LLMs with 9.66/10 Hallucination Score

Enterprise teams building AI agents that interact with external tools face a chronic shortage of high-quality training data. Manual annotation is expensive, production data carries privacy risks, and public datasets rarely capture multi-turn tool use. According to a paper published on arXiv, researchers have developed StateGen, a synthetic data generation platform designed to address this gap.

StateGen orchestrates a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The architectural core is an authoritative state manager that maintains a structured world-state object across conversation turns. The paper describes this as enforcing a "backend-is-truth" invariant, which by construction eliminates the dominant class of tool-call hallucinations.

How StateGen Works

The platform produces scored, reasoning-trace-rich training conversations. The four roles interact as follows:

Persona-conditioned user simulator: Generates diverse user queries based on a 23-dimensional trait vector, enabling persona-driven variation.
Agent under test: The LLM being trained to use tools.
State-grounded tool simulator: Simulates tool responses based on the shared state object, ensuring consistency.
Multi-axis LLM judge: Evaluates the conversation on multiple criteria, providing a score.

StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing the same state object. This allows the platform to generate data for complex workflows where multiple agents collaborate.

Performance and Evaluation

The researchers reported results on 64,698 evaluated conversations across three production corpora. Key metrics include:

Metric	Value
Tool-call hallucination score	9.66 / 10
Persona trait vector dimensions	23
Evaluated conversations	64,698
External systems compared	8

A cleanly separated train and golden evaluation set split confirmed that the generated data is not memorization bait, as shown by per-criterion gap analysis.

Comparison with Existing Platforms

According to the paper, comparison with eight external systems revealed that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring. StateGen unifies all these capabilities in one platform.

Implications for Enterprise AI

For organizations developing tool-augmented LLMs for supply chain, logistics, or trade applications, StateGen offers a way to generate large volumes of realistic training data without exposing sensitive production data. The platform's ability to produce scored conversations with reasoning traces could accelerate the development of AI agents that reliably interact with APIs, databases, and enterprise systems. The 23-dimensional persona vector also allows fine-grained control over user behavior, enabling the simulation of diverse scenarios that reflect real-world usage patterns.

Sources:

StateGen Platform Generates Synthetic Training Data for Tool-Augmented LLMs with 9.66/10 Hallucination Score

How StateGen Works

Performance and Evaluation

Comparison with Existing Platforms

Implications for Enterprise AI

Recommended Stories

AgentLeak Benchmark Reveals Internal Channel Privacy Leaks in Multi-Agent LLM Systems

Multi-Agent RL System MAMO Automates Weight Selection for Constrained Optimization Problems

Before the Pull Request: Mining Multi-Agent Coordination to Solve the Trust Gap in AI Coding Agents

Can In-Context Learning Enable Efficient Data Exploration for Enterprise AI?