Enterprise teams building AI agents that interact with external tools face a chronic shortage of high-quality training data. Manual annotation is expensive, production data carries privacy risks, and public datasets rarely capture multi-turn tool use. According to a paper published on arXiv, researchers have developed StateGen, a synthetic data generation platform designed to address this gap.
StateGen orchestrates a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The architectural core is an authoritative state manager that maintains a structured world-state object across conversation turns. The paper describes this as enforcing a "backend-is-truth" invariant, which by construction eliminates the dominant class of tool-call hallucinations.
How StateGen Works
The platform produces scored, reasoning-trace-rich training conversations. The four roles interact as follows:
- Persona-conditioned user simulator: Generates diverse user queries based on a 23-dimensional trait vector, enabling persona-driven variation.
- Agent under test: The LLM being trained to use tools.
- State-grounded tool simulator: Simulates tool responses based on the shared state object, ensuring consistency.
- Multi-axis LLM judge: Evaluates the conversation on multiple criteria, providing a score.
StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing the same state object. This allows the platform to generate data for complex workflows where multiple agents collaborate.
Performance and Evaluation
The researchers reported results on 64,698 evaluated conversations across three production corpora. Key metrics include:
| Metric | Value |
|---|---|
| Tool-call hallucination score | 9.66 / 10 |
| Persona trait vector dimensions | 23 |
| Evaluated conversations | 64,698 |
| External systems compared | 8 |
A cleanly separated train and golden evaluation set split confirmed that the generated data is not memorization bait, as shown by per-criterion gap analysis.
Comparison with Existing Platforms
According to the paper, comparison with eight external systems revealed that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring. StateGen unifies all these capabilities in one platform.
Implications for Enterprise AI
For organizations developing tool-augmented LLMs for supply chain, logistics, or trade applications, StateGen offers a way to generate large volumes of realistic training data without exposing sensitive production data. The platform's ability to produce scored conversations with reasoning traces could accelerate the development of AI agents that reliably interact with APIs, databases, and enterprise systems. The 23-dimensional persona vector also allows fine-grained control over user behavior, enabling the simulation of diverse scenarios that reflect real-world usage patterns.