Large language models (LLMs) used in agentic workflows often struggle to reason over long contexts, especially when evidence is scattered across many turns of tool use. Standard supervised fine-tuning (SFT) masks tool responses and only trains turn-level tool selection, creating a supervision blind spot for signals that span distant segments. According to a paper on arXiv, researchers have developed a new method called Agent Context Compilation (ACC) to address this gap.
ACC converts trajectories from agents—those used in search, software engineering, and database querying—into long-context question-answer pairs. The method combines the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes dependencies between the question and evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. The paper states that ACC is a simple approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data.
The researchers validated ACC on two challenging benchmarks: MRCR (multi-turn coreference resolution) and GraphWalks (graph traversal over extended contexts). They trained Qwen3-30B-A3B, a 30-billion-parameter model with 3 billion active parameters, using ACC. The results are shown in the table below:
| Benchmark | ACC-trained Qwen3-30B-A3B | Baseline (same model) | Larger model Qwen3-235B-A22B |
|---|---|---|---|
| MRCR | 68.3 (+18.1) | 50.2 | 72.1 |
| GraphWalks | 77.5 (+7.6) | 69.9 | 79.8 |
The ACC-trained model achieved scores of 68.3 on MRCR (an improvement of 18.1 points) and 77.5 on GraphWalks (an improvement of 7.6 points). These results are comparable to those of Qwen3-235B-A22B, a model with 235 billion parameters and 22 billion active parameters—roughly 8 times larger in total parameters. At the same time, the ACC-trained model preserved its general capabilities on benchmarks including GPQA, MMLU-Pro, AIME, and IFEval, according to the paper.
Further mechanism analysis revealed that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization. This suggests that training on compiled trajectories encourages the model to allocate attention more effectively across long-range dependencies, a key requirement for enterprise applications that involve processing extended documents, multi-step reasoning, or historical transaction logs.
For enterprise technology leaders evaluating AI for complex workflows, ACC offers a data-efficient way to improve long-context reasoning without expensive manual curation. The method's compatibility with existing training pipelines means it could be integrated into custom LLM deployments for tasks such as contract analysis, supply chain event resolution, or multi-document intelligence—though the paper itself does not test those domains. The research, authored by Su, Qisheng, Fang, Zhen, Huang, Shiting, Zeng, Yu, Zhao, Yiming, Kou, Zhang, Ziao, Chen, Lin, Zehui, Wu, Lijun, and Feng, is available on arXiv under the title "ACC: Compiling Agent Trajectories for Long-Context Training."