pith. sign in

arxiv: 2605.21850 · v1 · pith:HMIM27PXnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI

ACC: Compiling Agent Trajectories for Long-Context Training

Pith reviewed 2026-05-22 07:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords long-context reasoningagent trajectoriessupervised fine-tuningLLM trainingcontext integrationtool responsesbenchmark evaluation
0
0 comments X

The pith

Compiling scattered evidence from agent trajectories into QA pairs enables effective long-context training for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors propose Agent Context Compilation to transform the multi-turn outputs from agents solving problems into long-context question-answer pairs. Standard training only supervises tool selection and masks the responses, leaving the integration of distant information unlearned. By making the full evidence explicit in training examples, ACC provides direct supervision for reasoning over long contexts. This matters because it turns abundant agent interaction data into useful training signals without needing expensive long-document curation.

Core claim

ACC converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data.

What carries the argument

Agent Context Compilation, the process of reformatting agent trajectories into direct long-context QA pairs by including all scattered tool responses and observations with the original question.

If this is right

  • Training with ACC leads to substantial gains on MRCR and GraphWalks benchmarks that require integrating information across many turns.
  • Performance reaches levels comparable to much larger models while keeping results on general capability tests unchanged.
  • The method works by creating explicit supervision for long-range dependencies in context.
  • Further analysis indicates changes in attention patterns and expert specialization in the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • ACC could be applied to other sources of sequential data to create long-context training examples automatically.
  • If the evidence in trajectories is always complete, this approach might generalize to many agent-based tasks beyond the tested ones.
  • Combining ACC with other long-context methods might yield even stronger results on extended context lengths.

Load-bearing premise

The scattered tool responses and environment observations in the trajectories always include the complete and unbiased evidence needed to answer the original question.

What would settle it

Running the same training on Qwen3-30B-A3B but with standard SFT instead of ACC and observing no improvement or even worse scores on MRCR and GraphWalks would falsify the effectiveness of the compilation method.

Figures

Figures reproduced from arXiv: 2605.21850 by Feng Zhao, Kou Shi, Lijun Wu, Lin Chen, Qisheng Su, Shiting Huang, Yiming Zhao, Yu Zeng, Zehui Chen, Zhen Fang, Ziao Zhang.

Figure 1
Figure 1. Figure 1: Overview of ACC. Multi-turn agent trajectories (Search, SWE, SQL) are compiled into long-context QA pairs by assembling tool responses and environment contexts. dependencies that actual problem solving creates. These limitations severely restrict scalable training for long-span reasoning and motivate the exploration of alternative supervision sources. Agents produce massive multi-turn trajectories when sol… view at source ↗
Figure 2
Figure 2. Figure 2: Search Agent Trajectory Compilation Example. The top section shows the original ques￾tion and ground truth answer. The middle section shows the original agentic trajectory (documents visited are highlighted in blue, documents returned by search but never visited are highlighted in red). The bottom section shows the ACC compiled QA. Examples for SWE and SQL agents are provided in Appendix A. 4 Experiments 4… view at source ↗
Figure 3
Figure 3. Figure 3: Token length distribution of the ACC train￾ing data. We bin the samples by token count and plot the per-bin frequency for the training data compiled from each agent type [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two-dimensional UMAP projection of training queries [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attention distance (top) and expert routing frequency (bottom) changes after ACC training [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SWE Agent Trajectory Compilation Example. The top section shows the original question and ground truth answer. The middle section shows the original agentic trajectory. At each turn the agent opens a single file from the provided codebase snapshot and decides either to (Examine) it for understanding or to (Modify) it to fix the bug. The bottom section shows the ACC-compiled QA, where only the opened eviden… view at source ↗
Figure 7
Figure 7. Figure 7: SQL Agent Trajectory Compilation Example. The top section shows the original question and ground truth answer. The middle section shows the original agentic trajectory, where the agent executes a recursive SQL query to traverse the referral graph. The bottom section shows the ACC-compiled QA, where the complete contents of the relevant database table are assembled into the provided long-context background.… view at source ↗
read the original abstract

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Agent Context Compilation (ACC), which converts multi-turn agent trajectories (from search, software engineering, and database agents) into long-context QA pairs by pairing the original question with scattered tool responses and environment observations. This creates explicit supervision for cross-turn reasoning without tool use. The authors fine-tune Qwen3-30B-A3B on ACC data and report large gains on MRCR (68.3, +18.1) and GraphWalks (77.5, +7.6), reaching performance comparable to Qwen3-235B-A22B, while maintaining scores on GPQA, MMLU-Pro, AIME, and IFEval. Mechanism analysis indicates task-adaptive attention restructuring and expert specialization.

Significance. If the causal link between ACC data and the reported gains holds after proper controls, the method supplies a scalable, annotation-free source of long-context supervision that leverages existing agent interactions. This could meaningfully advance training for long-range dependency tasks and narrow the gap between smaller and larger models on benchmarks requiring integration of distant context segments.

major comments (3)
  1. [§4] §4 (Experiments and Results): The headline improvements on MRCR and GraphWalks are presented without reporting the total number of ACC training tokens, the exact baseline training regime (e.g., standard long-context SFT with matched token count), or statistical significance across multiple runs. These omissions make it difficult to isolate the contribution of the compilation procedure from simple increases in training data volume or prompt formatting differences.
  2. [§3.1–3.2] §3.1–3.2 (ACC Compilation Procedure): The central claim that compiled trajectories supply complete, artifact-free evidence for distant-context integration rests on the untested assumption that every necessary fact appears in the collected tool responses and that the QA-pair conversion introduces no spurious local cues or agent-specific regularities. No ablation or control experiment (e.g., masking subsets of observations or comparing against shuffled evidence) is reported to verify this assumption, which is load-bearing for attributing benchmark gains to genuine long-range reasoning.
  3. [Mechanism Analysis] Mechanism Analysis section: The reported task-adaptive attention restructuring and expert specialization are described qualitatively but lack quantitative comparisons (e.g., attention entropy or expert activation statistics) against a non-ACC baseline trained on the same total tokens. Without such controls, it remains unclear whether these patterns are caused by ACC or are generic consequences of additional long-context fine-tuning.
minor comments (2)
  1. [Abstract] Abstract and §1: Brief one-sentence definitions or citations for the MRCR and GraphWalks benchmarks would help readers unfamiliar with these new long-context tasks.
  2. [Figures] Figure captions (e.g., attention visualizations): Adding explicit call-outs or arrows to highlight the claimed restructuring patterns would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving experimental rigor and validating key assumptions. We address each point below and have made revisions to the manuscript to incorporate additional details, controls, and quantitative analyses where feasible.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments and Results): The headline improvements on MRCR and GraphWalks are presented without reporting the total number of ACC training tokens, the exact baseline training regime (e.g., standard long-context SFT with matched token count), or statistical significance across multiple runs. These omissions make it difficult to isolate the contribution of the compilation procedure from simple increases in training data volume or prompt formatting differences.

    Authors: We agree these details are essential for isolating the effect of the compilation procedure. In the revised manuscript we now explicitly report the total number of ACC training tokens and clarify that the baseline consists of standard long-context SFT performed on an equivalent token budget drawn from general long-context corpora (without the trajectory compilation step). We have also added results from three independent runs with different random seeds, reporting means and standard deviations in the updated tables. These changes allow direct comparison of data volume and formatting effects. revision: yes

  2. Referee: [§3.1–3.2] §3.1–3.2 (ACC Compilation Procedure): The central claim that compiled trajectories supply complete, artifact-free evidence for distant-context integration rests on the untested assumption that every necessary fact appears in the collected tool responses and that the QA-pair conversion introduces no spurious local cues or agent-specific regularities. No ablation or control experiment (e.g., masking subsets of observations or comparing against shuffled evidence) is reported to verify this assumption, which is load-bearing for attributing benchmark gains to genuine long-range reasoning.

    Authors: This is a fair critique of the load-bearing assumption. While agent trajectories are generated by successful task completion (ensuring the necessary evidence is present in the collected observations), we acknowledge the need for explicit verification. The revised manuscript includes two new control experiments: (1) random masking of 30% of tool responses and environment observations, which produces a substantial performance drop on both MRCR and GraphWalks; and (2) a shuffled-evidence baseline in which the order of observations is permuted while preserving content, resulting in near-chance performance. These results support that gains arise from learning to integrate distant context rather than from local cues or regularities introduced during compilation. revision: yes

  3. Referee: [Mechanism Analysis] Mechanism Analysis section: The reported task-adaptive attention restructuring and expert specialization are described qualitatively but lack quantitative comparisons (e.g., attention entropy or expert activation statistics) against a non-ACC baseline trained on the same total tokens. Without such controls, it remains unclear whether these patterns are caused by ACC or are generic consequences of additional long-context fine-tuning.

    Authors: We appreciate the request for quantitative grounding. The revised manuscript now includes direct comparisons against a non-ACC baseline trained on the identical total token count. We report attention entropy over long-range token pairs and per-expert activation frequencies for both models. The ACC-trained model shows statistically lower entropy on cross-turn dependencies and more pronounced specialization in experts handling multi-turn integration, differences that are not observed in the matched-token baseline. These metrics are presented alongside the original qualitative observations. revision: yes

Circularity Check

0 steps flagged

No circularity: ACC is an empirical data-generation procedure validated on external benchmarks

full rationale

The paper describes a straightforward compilation procedure that turns existing agent trajectories into long-context QA pairs for SFT. Reported gains (MRCR 68.3, GraphWalks 77.5) are measured after training and testing on separate benchmarks; no equations, fitted parameters, or self-citations reduce these outcomes to quantities defined by the method itself. The core assumption that compiled pairs supply complete evidence is an empirical claim, not a definitional identity, and is therefore open to falsification by the reported controls and general-capability checks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that agent trajectories naturally contain all necessary evidence scattered across turns and that standard SFT leaves this evidence unused; no free parameters or new invented entities are introduced in the abstract description.

axioms (2)
  • domain assumption Standard agent SFT masks tool responses and trains only turn-level tool selection, creating a supervision blind spot for scattered evidence.
    This premise is stated directly in the abstract as the motivation for ACC.
  • domain assumption The evidence needed to answer the original question is present in the tool responses and environment observations across the trajectory.
    This is required for the compiled QA pairs to provide valid supervision.

pith-pipeline@v0.9.0 · 5865 in / 1517 out tokens · 62063 ms · 2026-05-22T07:11:46.597687+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    American invitational mathematics examination

    AIME. American invitational mathematics examination. https://artofproblemsolving. com/wiki/index.php/AIME

  2. [2]

    Claude opus 4.6 system card

    Anthropic. Claude opus 4.6 system card. https://www.anthropic.com/news/ claude-opus-4-6, 2026. Accessed: 2026-05-02

  3. [3]

    Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

  4. [4]

    G. Chen, X. Li, M. Q. Shieh, and L. Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025

  5. [5]

    G. Chen, M. Q. Shieh, and L. Bing. Longrlvr: Long-context reinforcement learning requires verifiable context rewards, 2026

  6. [6]

    Gemini 3.1 pro

    Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/,

  7. [7]

    Accessed: 2026-05-02

  8. [8]

    Hsieh, S

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024

  9. [9]

    J. Jia, X. Wu, C. Gao, Z. Chen, Z. Lin, Z. Li, W. Wang, H. Xu, D. Jin, D. Zhang, and B. Guo. Litelong: Resource-efficient long-context data synthesis for llms, 2025

  10. [10]

    G. Kamradt. Llmtest (needle in a haystack). https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. Accessed: 2026-05-02

  11. [11]

    Koˇciský, J

    T. Koˇciský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge, 2017

  12. [12]

    Lahoti, K

    A. Lahoti, K. Y . Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu. Mamba-3: Improved sequence modeling using state space principles, 2026

  13. [13]

    X. Liu, Y . Song, Z. Liu, Z. Huang, Q. Guo, Z. Liu, S. Lian, Z. He, and X. Qiu. Beyond real: Imaginary extension of rotary position embeddings for long-context llms, 2025

  14. [14]

    K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu, and D. Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024

  15. [15]

    Introducing GPT-4.1, 2025

    OpenAI. Introducing GPT-4.1, 2025. Accessed: 2026-05-02. Introduces the MRCR and GraphWalks benchmarks

  16. [16]

    OpenAI. Gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02

  17. [17]

    Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02

  18. [18]

    D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023

  19. [19]

    W. Shen, Z. Yang, C. Li, Z. Lu, M. Peng, H. Sun, Y . Shi, S. Liao, S. Lai, B. Zhang, D. Liu, F. Huang, J. Zhou, and M. Yan. Qwenlong-l1.5: Post-training recipe for long-context reasoning and memory management, 2025

  20. [20]

    J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024

  21. [21]

    M. L. Team, B. Wang, Bayan, B. Xiao, B. Zhang, B. Rong, B. Chen, C. Wan, C. Zhang, C. Huang, C. Chen, C. Chen, C. Yang, C. Yang, C. Han, D. Peng, D. Ruan, D. Xin, D. Wang, D. Yang, F. Liu, F. Chen, F. Yang, G. Dong, G. Huang, G. Xu, G. Wan, G. Tan, G. Yu, H. Qiu, H. Lu, H. Liu, H. Xiang, J. Wu, J. Yang, J. Liu, J. Huang, J. Wang, J. Ding, J. Jiang, J. Kua...

  22. [22]

    Q. Tian, W. Zhu, X. Liu, X. Wang, and R. Wang. Mrrope: Mixed-radix rotary position embedding, 2026

  23. [23]

    Trivedi, N

    H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

  24. [24]

    S. Wang, G. Zhang, L. L. Zhang, N. Shang, F. Yang, D. Chen, and M. Yang. Loongrl: Reinforcement learning for advanced reasoning over long contexts, 2025

  25. [25]

    Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

  26. [26]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

  27. [27]

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

  28. [28]

    H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y .-Q. Zhang, W.-Y . Ma, J. Liu, M. Wang, and H. Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025

  29. [29]

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025

  30. [30]

    J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 11 Appendix A Agent Trajectory Compilation Examples Figures 6 and 7 show compiled trajectories for SWE and SQL agents. Both follow the same ACC pipeline as the search agent in Figure 2. Figure 6 shows a compiled ...