ACC: Compiling Agent Trajectories for Long-Context Training
Pith reviewed 2026-05-22 07:11 UTC · model grok-4.3
The pith
Compiling scattered evidence from agent trajectories into QA pairs enables effective long-context training for LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ACC converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data.
What carries the argument
Agent Context Compilation, the process of reformatting agent trajectories into direct long-context QA pairs by including all scattered tool responses and observations with the original question.
If this is right
- Training with ACC leads to substantial gains on MRCR and GraphWalks benchmarks that require integrating information across many turns.
- Performance reaches levels comparable to much larger models while keeping results on general capability tests unchanged.
- The method works by creating explicit supervision for long-range dependencies in context.
- Further analysis indicates changes in attention patterns and expert specialization in the model.
Where Pith is reading between the lines
- ACC could be applied to other sources of sequential data to create long-context training examples automatically.
- If the evidence in trajectories is always complete, this approach might generalize to many agent-based tasks beyond the tested ones.
- Combining ACC with other long-context methods might yield even stronger results on extended context lengths.
Load-bearing premise
The scattered tool responses and environment observations in the trajectories always include the complete and unbiased evidence needed to answer the original question.
What would settle it
Running the same training on Qwen3-30B-A3B but with standard SFT instead of ACC and observing no improvement or even worse scores on MRCR and GraphWalks would falsify the effectiveness of the compilation method.
Figures
read the original abstract
Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Agent Context Compilation (ACC), which converts multi-turn agent trajectories (from search, software engineering, and database agents) into long-context QA pairs by pairing the original question with scattered tool responses and environment observations. This creates explicit supervision for cross-turn reasoning without tool use. The authors fine-tune Qwen3-30B-A3B on ACC data and report large gains on MRCR (68.3, +18.1) and GraphWalks (77.5, +7.6), reaching performance comparable to Qwen3-235B-A22B, while maintaining scores on GPQA, MMLU-Pro, AIME, and IFEval. Mechanism analysis indicates task-adaptive attention restructuring and expert specialization.
Significance. If the causal link between ACC data and the reported gains holds after proper controls, the method supplies a scalable, annotation-free source of long-context supervision that leverages existing agent interactions. This could meaningfully advance training for long-range dependency tasks and narrow the gap between smaller and larger models on benchmarks requiring integration of distant context segments.
major comments (3)
- [§4] §4 (Experiments and Results): The headline improvements on MRCR and GraphWalks are presented without reporting the total number of ACC training tokens, the exact baseline training regime (e.g., standard long-context SFT with matched token count), or statistical significance across multiple runs. These omissions make it difficult to isolate the contribution of the compilation procedure from simple increases in training data volume or prompt formatting differences.
- [§3.1–3.2] §3.1–3.2 (ACC Compilation Procedure): The central claim that compiled trajectories supply complete, artifact-free evidence for distant-context integration rests on the untested assumption that every necessary fact appears in the collected tool responses and that the QA-pair conversion introduces no spurious local cues or agent-specific regularities. No ablation or control experiment (e.g., masking subsets of observations or comparing against shuffled evidence) is reported to verify this assumption, which is load-bearing for attributing benchmark gains to genuine long-range reasoning.
- [Mechanism Analysis] Mechanism Analysis section: The reported task-adaptive attention restructuring and expert specialization are described qualitatively but lack quantitative comparisons (e.g., attention entropy or expert activation statistics) against a non-ACC baseline trained on the same total tokens. Without such controls, it remains unclear whether these patterns are caused by ACC or are generic consequences of additional long-context fine-tuning.
minor comments (2)
- [Abstract] Abstract and §1: Brief one-sentence definitions or citations for the MRCR and GraphWalks benchmarks would help readers unfamiliar with these new long-context tasks.
- [Figures] Figure captions (e.g., attention visualizations): Adding explicit call-outs or arrows to highlight the claimed restructuring patterns would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving experimental rigor and validating key assumptions. We address each point below and have made revisions to the manuscript to incorporate additional details, controls, and quantitative analyses where feasible.
read point-by-point responses
-
Referee: [§4] §4 (Experiments and Results): The headline improvements on MRCR and GraphWalks are presented without reporting the total number of ACC training tokens, the exact baseline training regime (e.g., standard long-context SFT with matched token count), or statistical significance across multiple runs. These omissions make it difficult to isolate the contribution of the compilation procedure from simple increases in training data volume or prompt formatting differences.
Authors: We agree these details are essential for isolating the effect of the compilation procedure. In the revised manuscript we now explicitly report the total number of ACC training tokens and clarify that the baseline consists of standard long-context SFT performed on an equivalent token budget drawn from general long-context corpora (without the trajectory compilation step). We have also added results from three independent runs with different random seeds, reporting means and standard deviations in the updated tables. These changes allow direct comparison of data volume and formatting effects. revision: yes
-
Referee: [§3.1–3.2] §3.1–3.2 (ACC Compilation Procedure): The central claim that compiled trajectories supply complete, artifact-free evidence for distant-context integration rests on the untested assumption that every necessary fact appears in the collected tool responses and that the QA-pair conversion introduces no spurious local cues or agent-specific regularities. No ablation or control experiment (e.g., masking subsets of observations or comparing against shuffled evidence) is reported to verify this assumption, which is load-bearing for attributing benchmark gains to genuine long-range reasoning.
Authors: This is a fair critique of the load-bearing assumption. While agent trajectories are generated by successful task completion (ensuring the necessary evidence is present in the collected observations), we acknowledge the need for explicit verification. The revised manuscript includes two new control experiments: (1) random masking of 30% of tool responses and environment observations, which produces a substantial performance drop on both MRCR and GraphWalks; and (2) a shuffled-evidence baseline in which the order of observations is permuted while preserving content, resulting in near-chance performance. These results support that gains arise from learning to integrate distant context rather than from local cues or regularities introduced during compilation. revision: yes
-
Referee: [Mechanism Analysis] Mechanism Analysis section: The reported task-adaptive attention restructuring and expert specialization are described qualitatively but lack quantitative comparisons (e.g., attention entropy or expert activation statistics) against a non-ACC baseline trained on the same total tokens. Without such controls, it remains unclear whether these patterns are caused by ACC or are generic consequences of additional long-context fine-tuning.
Authors: We appreciate the request for quantitative grounding. The revised manuscript now includes direct comparisons against a non-ACC baseline trained on the identical total token count. We report attention entropy over long-range token pairs and per-expert activation frequencies for both models. The ACC-trained model shows statistically lower entropy on cross-turn dependencies and more pronounced specialization in experts handling multi-turn integration, differences that are not observed in the matched-token baseline. These metrics are presented alongside the original qualitative observations. revision: yes
Circularity Check
No circularity: ACC is an empirical data-generation procedure validated on external benchmarks
full rationale
The paper describes a straightforward compilation procedure that turns existing agent trajectories into long-context QA pairs for SFT. Reported gains (MRCR 68.3, GraphWalks 77.5) are measured after training and testing on separate benchmarks; no equations, fitted parameters, or self-citations reduce these outcomes to quantities defined by the method itself. The core assumption that compiled pairs supply complete evidence is an empirical claim, not a definitional identity, and is therefore open to falsification by the reported controls and general-capability checks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard agent SFT masks tool responses and trains only turn-level tool selection, creating a supervision blind spot for scattered evidence.
- domain assumption The evidence needed to answer the original question is present in the tool responses and environment observations across the trajectory.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The new training objective is L_ACC = -∑ log P(token_j | q, C, token_<j)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
American invitational mathematics examination
AIME. American invitational mathematics examination. https://artofproblemsolving. com/wiki/index.php/AIME
-
[2]
Anthropic. Claude opus 4.6 system card. https://www.anthropic.com/news/ claude-opus-4-6, 2026. Accessed: 2026-05-02
work page 2026
-
[3]
Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025
work page 2025
-
[4]
G. Chen, X. Li, M. Q. Shieh, and L. Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025
work page 2025
-
[5]
G. Chen, M. Q. Shieh, and L. Bing. Longrlvr: Long-context reinforcement learning requires verifiable context rewards, 2026
work page 2026
-
[6]
Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/,
-
[7]
Accessed: 2026-05-02
work page 2026
- [8]
-
[9]
J. Jia, X. Wu, C. Gao, Z. Chen, Z. Lin, Z. Li, W. Wang, H. Xu, D. Jin, D. Zhang, and B. Guo. Litelong: Resource-efficient long-context data synthesis for llms, 2025
work page 2025
-
[10]
G. Kamradt. Llmtest (needle in a haystack). https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. Accessed: 2026-05-02
work page 2023
-
[11]
T. Koˇciský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge, 2017
work page 2017
- [12]
-
[13]
X. Liu, Y . Song, Z. Liu, Z. Huang, Q. Guo, Z. Liu, S. Lian, Z. He, and X. Qiu. Beyond real: Imaginary extension of rotary position embeddings for long-context llms, 2025
work page 2025
-
[14]
K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu, and D. Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024
work page 2024
-
[15]
OpenAI. Introducing GPT-4.1, 2025. Accessed: 2026-05-02. Introduces the MRCR and GraphWalks benchmarks
work page 2025
-
[16]
OpenAI. Gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02
work page 2026
-
[17]
Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02
work page 2026
-
[18]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023
work page 2023
-
[19]
W. Shen, Z. Yang, C. Li, Z. Lu, M. Peng, H. Sun, Y . Shi, S. Liao, S. Lai, B. Zhang, D. Liu, F. Huang, J. Zhou, and M. Yan. Qwenlong-l1.5: Post-training recipe for long-context reasoning and memory management, 2025
work page 2025
-
[20]
J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024
work page 2024
-
[21]
M. L. Team, B. Wang, Bayan, B. Xiao, B. Zhang, B. Rong, B. Chen, C. Wan, C. Zhang, C. Huang, C. Chen, C. Chen, C. Yang, C. Yang, C. Han, D. Peng, D. Ruan, D. Xin, D. Wang, D. Yang, F. Liu, F. Chen, F. Yang, G. Dong, G. Huang, G. Xu, G. Wan, G. Tan, G. Yu, H. Qiu, H. Lu, H. Liu, H. Xiang, J. Wu, J. Yang, J. Liu, J. Huang, J. Wang, J. Ding, J. Jiang, J. Kua...
work page 2025
-
[22]
Q. Tian, W. Zhu, X. Liu, X. Wang, and R. Wang. Mrrope: Mixed-radix rotary position embedding, 2026
work page 2026
-
[23]
H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition, 2022
work page 2022
-
[24]
S. Wang, G. Zhang, L. L. Zhang, N. Shang, F. Yang, D. Chen, and M. Yang. Loongrl: Reinforcement learning for advanced reasoning over long contexts, 2025
work page 2025
-
[25]
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024
work page 2024
-
[26]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...
work page 2025
-
[27]
Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018
work page 2018
-
[28]
H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y .-Q. Zhang, W.-Y . Ma, J. Liu, M. Wang, and H. Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025
work page 2025
-
[29]
J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025
work page 2025
-
[30]
J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 11 Appendix A Agent Trajectory Compilation Examples Figures 6 and 7 show compiled trajectories for SWE and SQL agents. Both follow the same ACC pipeline as the search agent in Figure 2. Figure 6 shows a compiled ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.