ACC: Compiling Agent Trajectories for Long-Context Training

Feng Zhao; Kou Shi; Lijun Wu; Lin Chen; Qisheng Su; Shiting Huang; Yiming Zhao; Yu Zeng; Zehui Chen; Zhen Fang

arxiv: 2605.21850 · v1 · pith:HMIM27PXnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI

ACC: Compiling Agent Trajectories for Long-Context Training

Qisheng Su , Zhen Fang , Shiting Huang , Yu Zeng , Yiming Zhao , Kou Shi , Ziao Zhang , Lin Chen

show 3 more authors

Zehui Chen Lijun Wu Feng Zhao

This is my paper

Pith reviewed 2026-05-22 07:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-context reasoningagent trajectoriessupervised fine-tuningLLM trainingcontext integrationtool responsesbenchmark evaluation

0 comments

The pith

Compiling scattered evidence from agent trajectories into QA pairs enables effective long-context training for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors propose Agent Context Compilation to transform the multi-turn outputs from agents solving problems into long-context question-answer pairs. Standard training only supervises tool selection and masks the responses, leaving the integration of distant information unlearned. By making the full evidence explicit in training examples, ACC provides direct supervision for reasoning over long contexts. This matters because it turns abundant agent interaction data into useful training signals without needing expensive long-document curation.

Core claim

ACC converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data.

What carries the argument

Agent Context Compilation, the process of reformatting agent trajectories into direct long-context QA pairs by including all scattered tool responses and observations with the original question.

If this is right

Training with ACC leads to substantial gains on MRCR and GraphWalks benchmarks that require integrating information across many turns.
Performance reaches levels comparable to much larger models while keeping results on general capability tests unchanged.
The method works by creating explicit supervision for long-range dependencies in context.
Further analysis indicates changes in attention patterns and expert specialization in the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

ACC could be applied to other sources of sequential data to create long-context training examples automatically.
If the evidence in trajectories is always complete, this approach might generalize to many agent-based tasks beyond the tested ones.
Combining ACC with other long-context methods might yield even stronger results on extended context lengths.

Load-bearing premise

The scattered tool responses and environment observations in the trajectories always include the complete and unbiased evidence needed to answer the original question.

What would settle it

Running the same training on Qwen3-30B-A3B but with standard SFT instead of ACC and observing no improvement or even worse scores on MRCR and GraphWalks would falsify the effectiveness of the compilation method.

Figures

Figures reproduced from arXiv: 2605.21850 by Feng Zhao, Kou Shi, Lijun Wu, Lin Chen, Qisheng Su, Shiting Huang, Yiming Zhao, Yu Zeng, Zehui Chen, Zhen Fang, Ziao Zhang.

**Figure 1.** Figure 1: Overview of ACC. Multi-turn agent trajectories (Search, SWE, SQL) are compiled into long-context QA pairs by assembling tool responses and environment contexts. dependencies that actual problem solving creates. These limitations severely restrict scalable training for long-span reasoning and motivate the exploration of alternative supervision sources. Agents produce massive multi-turn trajectories when sol… view at source ↗

**Figure 2.** Figure 2: Search Agent Trajectory Compilation Example. The top section shows the original question and ground truth answer. The middle section shows the original agentic trajectory (documents visited are highlighted in blue, documents returned by search but never visited are highlighted in red). The bottom section shows the ACC compiled QA. Examples for SWE and SQL agents are provided in Appendix A. 4 Experiments 4… view at source ↗

**Figure 3.** Figure 3: Token length distribution of the ACC training data. We bin the samples by token count and plot the per-bin frequency for the training data compiled from each agent type [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Two-dimensional UMAP projection of training queries [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Attention distance (top) and expert routing frequency (bottom) changes after ACC training [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: SWE Agent Trajectory Compilation Example. The top section shows the original question and ground truth answer. The middle section shows the original agentic trajectory. At each turn the agent opens a single file from the provided codebase snapshot and decides either to (Examine) it for understanding or to (Modify) it to fix the bug. The bottom section shows the ACC-compiled QA, where only the opened eviden… view at source ↗

**Figure 7.** Figure 7: SQL Agent Trajectory Compilation Example. The top section shows the original question and ground truth answer. The middle section shows the original agentic trajectory, where the agent executes a recursive SQL query to traverse the referral graph. The bottom section shows the ACC-compiled QA, where the complete contents of the relevant database table are assembled into the provided long-context background.… view at source ↗

read the original abstract

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACC repurposes agent trajectories into long-context QA data and reports solid gains on MRCR and GraphWalks for a mid-size model, but the causal link to true distant-context integration still needs tighter controls.

read the letter

The main thing here is that the authors turn the natural output of agent runs—tool calls and environment observations scattered over many turns—into direct long-context QA pairs. This gives supervision for integrating evidence that standard agent SFT leaves on the table because it only trains tool selection at each step. The result is a simple data recipe that can be layered on top of existing long-context methods without extra human labels.

Referee Report

3 major / 2 minor

Summary. The paper proposes Agent Context Compilation (ACC), which converts multi-turn agent trajectories (from search, software engineering, and database agents) into long-context QA pairs by pairing the original question with scattered tool responses and environment observations. This creates explicit supervision for cross-turn reasoning without tool use. The authors fine-tune Qwen3-30B-A3B on ACC data and report large gains on MRCR (68.3, +18.1) and GraphWalks (77.5, +7.6), reaching performance comparable to Qwen3-235B-A22B, while maintaining scores on GPQA, MMLU-Pro, AIME, and IFEval. Mechanism analysis indicates task-adaptive attention restructuring and expert specialization.

Significance. If the causal link between ACC data and the reported gains holds after proper controls, the method supplies a scalable, annotation-free source of long-context supervision that leverages existing agent interactions. This could meaningfully advance training for long-range dependency tasks and narrow the gap between smaller and larger models on benchmarks requiring integration of distant context segments.

major comments (3)

[§4] §4 (Experiments and Results): The headline improvements on MRCR and GraphWalks are presented without reporting the total number of ACC training tokens, the exact baseline training regime (e.g., standard long-context SFT with matched token count), or statistical significance across multiple runs. These omissions make it difficult to isolate the contribution of the compilation procedure from simple increases in training data volume or prompt formatting differences.
[§3.1–3.2] §3.1–3.2 (ACC Compilation Procedure): The central claim that compiled trajectories supply complete, artifact-free evidence for distant-context integration rests on the untested assumption that every necessary fact appears in the collected tool responses and that the QA-pair conversion introduces no spurious local cues or agent-specific regularities. No ablation or control experiment (e.g., masking subsets of observations or comparing against shuffled evidence) is reported to verify this assumption, which is load-bearing for attributing benchmark gains to genuine long-range reasoning.
[Mechanism Analysis] Mechanism Analysis section: The reported task-adaptive attention restructuring and expert specialization are described qualitatively but lack quantitative comparisons (e.g., attention entropy or expert activation statistics) against a non-ACC baseline trained on the same total tokens. Without such controls, it remains unclear whether these patterns are caused by ACC or are generic consequences of additional long-context fine-tuning.

minor comments (2)

[Abstract] Abstract and §1: Brief one-sentence definitions or citations for the MRCR and GraphWalks benchmarks would help readers unfamiliar with these new long-context tasks.
[Figures] Figure captions (e.g., attention visualizations): Adding explicit call-outs or arrows to highlight the claimed restructuring patterns would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving experimental rigor and validating key assumptions. We address each point below and have made revisions to the manuscript to incorporate additional details, controls, and quantitative analyses where feasible.

read point-by-point responses

Referee: [§4] §4 (Experiments and Results): The headline improvements on MRCR and GraphWalks are presented without reporting the total number of ACC training tokens, the exact baseline training regime (e.g., standard long-context SFT with matched token count), or statistical significance across multiple runs. These omissions make it difficult to isolate the contribution of the compilation procedure from simple increases in training data volume or prompt formatting differences.

Authors: We agree these details are essential for isolating the effect of the compilation procedure. In the revised manuscript we now explicitly report the total number of ACC training tokens and clarify that the baseline consists of standard long-context SFT performed on an equivalent token budget drawn from general long-context corpora (without the trajectory compilation step). We have also added results from three independent runs with different random seeds, reporting means and standard deviations in the updated tables. These changes allow direct comparison of data volume and formatting effects. revision: yes
Referee: [§3.1–3.2] §3.1–3.2 (ACC Compilation Procedure): The central claim that compiled trajectories supply complete, artifact-free evidence for distant-context integration rests on the untested assumption that every necessary fact appears in the collected tool responses and that the QA-pair conversion introduces no spurious local cues or agent-specific regularities. No ablation or control experiment (e.g., masking subsets of observations or comparing against shuffled evidence) is reported to verify this assumption, which is load-bearing for attributing benchmark gains to genuine long-range reasoning.

Authors: This is a fair critique of the load-bearing assumption. While agent trajectories are generated by successful task completion (ensuring the necessary evidence is present in the collected observations), we acknowledge the need for explicit verification. The revised manuscript includes two new control experiments: (1) random masking of 30% of tool responses and environment observations, which produces a substantial performance drop on both MRCR and GraphWalks; and (2) a shuffled-evidence baseline in which the order of observations is permuted while preserving content, resulting in near-chance performance. These results support that gains arise from learning to integrate distant context rather than from local cues or regularities introduced during compilation. revision: yes
Referee: [Mechanism Analysis] Mechanism Analysis section: The reported task-adaptive attention restructuring and expert specialization are described qualitatively but lack quantitative comparisons (e.g., attention entropy or expert activation statistics) against a non-ACC baseline trained on the same total tokens. Without such controls, it remains unclear whether these patterns are caused by ACC or are generic consequences of additional long-context fine-tuning.

Authors: We appreciate the request for quantitative grounding. The revised manuscript now includes direct comparisons against a non-ACC baseline trained on the identical total token count. We report attention entropy over long-range token pairs and per-expert activation frequencies for both models. The ACC-trained model shows statistically lower entropy on cross-turn dependencies and more pronounced specialization in experts handling multi-turn integration, differences that are not observed in the matched-token baseline. These metrics are presented alongside the original qualitative observations. revision: yes

Circularity Check

0 steps flagged

No circularity: ACC is an empirical data-generation procedure validated on external benchmarks

full rationale

The paper describes a straightforward compilation procedure that turns existing agent trajectories into long-context QA pairs for SFT. Reported gains (MRCR 68.3, GraphWalks 77.5) are measured after training and testing on separate benchmarks; no equations, fitted parameters, or self-citations reduce these outcomes to quantities defined by the method itself. The core assumption that compiled pairs supply complete evidence is an empirical claim, not a definitional identity, and is therefore open to falsification by the reported controls and general-capability checks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that agent trajectories naturally contain all necessary evidence scattered across turns and that standard SFT leaves this evidence unused; no free parameters or new invented entities are introduced in the abstract description.

axioms (2)

domain assumption Standard agent SFT masks tool responses and trains only turn-level tool selection, creating a supervision blind spot for scattered evidence.
This premise is stated directly in the abstract as the motivation for ACC.
domain assumption The evidence needed to answer the original question is present in the tool responses and environment observations across the trajectory.
This is required for the compiled QA pairs to provide valid supervision.

pith-pipeline@v0.9.0 · 5865 in / 1517 out tokens · 62063 ms · 2026-05-22T07:11:46.597687+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The new training objective is L_ACC = -∑ log P(token_j | q, C, token_<j)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

American invitational mathematics examination

AIME. American invitational mathematics examination. https://artofproblemsolving. com/wiki/index.php/AIME

work page
[2]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. https://www.anthropic.com/news/ claude-opus-4-6, 2026. Accessed: 2026-05-02

work page 2026
[3]

Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

work page 2025
[4]

G. Chen, X. Li, M. Q. Shieh, and L. Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025

work page 2025
[5]

G. Chen, M. Q. Shieh, and L. Bing. Longrlvr: Long-context reinforcement learning requires verifiable context rewards, 2026

work page 2026
[6]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/,

work page
[7]

Accessed: 2026-05-02

work page 2026
[8]

Hsieh, S

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024

work page 2024
[9]

J. Jia, X. Wu, C. Gao, Z. Chen, Z. Lin, Z. Li, W. Wang, H. Xu, D. Jin, D. Zhang, and B. Guo. Litelong: Resource-efficient long-context data synthesis for llms, 2025

work page 2025
[10]

G. Kamradt. Llmtest (needle in a haystack). https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. Accessed: 2026-05-02

work page 2023
[11]

Koˇciský, J

T. Koˇciský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge, 2017

work page 2017
[12]

Lahoti, K

A. Lahoti, K. Y . Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu. Mamba-3: Improved sequence modeling using state space principles, 2026

work page 2026
[13]

X. Liu, Y . Song, Z. Liu, Z. Huang, Q. Guo, Z. Liu, S. Lian, Z. He, and X. Qiu. Beyond real: Imaginary extension of rotary position embeddings for long-context llms, 2025

work page 2025
[14]

K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu, and D. Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024

work page 2024
[15]

Introducing GPT-4.1, 2025

OpenAI. Introducing GPT-4.1, 2025. Accessed: 2026-05-02. Introduces the MRCR and GraphWalks benchmarks

work page 2025
[16]

OpenAI. Gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02

work page 2026
[17]

Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02

work page 2026
[18]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023

work page 2023
[19]

W. Shen, Z. Yang, C. Li, Z. Lu, M. Peng, H. Sun, Y . Shi, S. Liao, S. Lai, B. Zhang, D. Liu, F. Huang, J. Zhou, and M. Yan. Qwenlong-l1.5: Post-training recipe for long-context reasoning and memory management, 2025

work page 2025
[20]

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024

work page 2024
[21]

M. L. Team, B. Wang, Bayan, B. Xiao, B. Zhang, B. Rong, B. Chen, C. Wan, C. Zhang, C. Huang, C. Chen, C. Chen, C. Yang, C. Yang, C. Han, D. Peng, D. Ruan, D. Xin, D. Wang, D. Yang, F. Liu, F. Chen, F. Yang, G. Dong, G. Huang, G. Xu, G. Wan, G. Tan, G. Yu, H. Qiu, H. Lu, H. Liu, H. Xiang, J. Wu, J. Yang, J. Liu, J. Huang, J. Wang, J. Ding, J. Jiang, J. Kua...

work page 2025
[22]

Q. Tian, W. Zhu, X. Liu, X. Wang, and R. Wang. Mrrope: Mixed-radix rotary position embedding, 2026

work page 2026
[23]

Trivedi, N

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

work page 2022
[24]

S. Wang, G. Zhang, L. L. Zhang, N. Shang, F. Yang, D. Chen, and M. Yang. Loongrl: Reinforcement learning for advanced reasoning over long contexts, 2025

work page 2025
[25]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

work page 2024
[26]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page 2025
[27]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

work page 2018
[28]

H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y .-Q. Zhang, W.-Y . Ma, J. Liu, M. Wang, and H. Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025

work page 2025
[29]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025

work page 2025
[30]

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 11 Appendix A Agent Trajectory Compilation Examples Figures 6 and 7 show compiled trajectories for SWE and SQL agents. Both follow the same ACC pipeline as the search agent in Figure 2. Figure 6 shows a compiled ...

work page 2023

[1] [1]

American invitational mathematics examination

AIME. American invitational mathematics examination. https://artofproblemsolving. com/wiki/index.php/AIME

work page

[2] [2]

Claude opus 4.6 system card

Anthropic. Claude opus 4.6 system card. https://www.anthropic.com/news/ claude-opus-4-6, 2026. Accessed: 2026-05-02

work page 2026

[3] [3]

Y . Bai, S. Tu, J. Zhang, H. Peng, X. Wang, X. Lv, S. Cao, J. Xu, L. Hou, Y . Dong, J. Tang, and J. Li. Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

work page 2025

[4] [4]

G. Chen, X. Li, M. Q. Shieh, and L. Bing. Longpo: Long context self-evolution of large language models through short-to-long preference optimization, 2025

work page 2025

[5] [5]

G. Chen, M. Q. Shieh, and L. Bing. Longrlvr: Long-context reinforcement learning requires verifiable context rewards, 2026

work page 2026

[6] [6]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. https://deepmind.google/models/gemini/pro/,

work page

[7] [7]

Accessed: 2026-05-02

work page 2026

[8] [8]

Hsieh, S

C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y . Zhang, and B. Ginsburg. Ruler: What’s the real context size of your long-context language models?, 2024

work page 2024

[9] [9]

J. Jia, X. Wu, C. Gao, Z. Chen, Z. Lin, Z. Li, W. Wang, H. Xu, D. Jin, D. Zhang, and B. Guo. Litelong: Resource-efficient long-context data synthesis for llms, 2025

work page 2025

[10] [10]

G. Kamradt. Llmtest (needle in a haystack). https://github.com/gkamradt/LLMTest_ NeedleInAHaystack, 2023. Accessed: 2026-05-02

work page 2023

[11] [11]

Koˇciský, J

T. Koˇciský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge, 2017

work page 2017

[12] [12]

Lahoti, K

A. Lahoti, K. Y . Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu. Mamba-3: Improved sequence modeling using state space principles, 2026

work page 2026

[13] [13]

X. Liu, Y . Song, Z. Liu, Z. Huang, Q. Guo, Z. Liu, S. Lian, Z. He, and X. Qiu. Beyond real: Imaginary extension of rotary position embeddings for long-context llms, 2025

work page 2025

[14] [14]

K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu, and D. Lin. Longwanjuan: Towards systematic measurement for long text quality, 2024

work page 2024

[15] [15]

Introducing GPT-4.1, 2025

OpenAI. Introducing GPT-4.1, 2025. Accessed: 2026-05-02. Introduces the MRCR and GraphWalks benchmarks

work page 2025

[16] [16]

OpenAI. Gpt-5.4. https://openai.com/index/introducing-gpt-5-4/ , 2026. Ac- cessed: 2026-05-02

work page 2026

[17] [17]

Qwen Team. Qwen3.5. https://qwen.ai/blog?id=qwen3.5, 2026. Accessed: 2026-05-02

work page 2026

[18] [18]

D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof qa benchmark, 2023

work page 2023

[19] [19]

W. Shen, Z. Yang, C. Li, Z. Lu, M. Peng, H. Sun, Y . Shi, S. Liao, S. Lai, B. Zhang, D. Liu, F. Huang, J. Zhou, and M. Yan. Qwenlong-l1.5: Post-training recipe for long-context reasoning and memory management, 2025

work page 2025

[20] [20]

J. Tang, Y . Zhao, K. Zhu, G. Xiao, B. Kasikci, and S. Han. Quest: Query-aware sparsity for efficient long-context llm inference, 2024

work page 2024

[21] [21]

M. L. Team, B. Wang, Bayan, B. Xiao, B. Zhang, B. Rong, B. Chen, C. Wan, C. Zhang, C. Huang, C. Chen, C. Chen, C. Yang, C. Yang, C. Han, D. Peng, D. Ruan, D. Xin, D. Wang, D. Yang, F. Liu, F. Chen, F. Yang, G. Dong, G. Huang, G. Xu, G. Wan, G. Tan, G. Yu, H. Qiu, H. Lu, H. Liu, H. Xiang, J. Wu, J. Yang, J. Liu, J. Huang, J. Wang, J. Ding, J. Jiang, J. Kua...

work page 2025

[22] [22]

Q. Tian, W. Zhu, X. Liu, X. Wang, and R. Wang. Mrrope: Mixed-radix rotary position embedding, 2026

work page 2026

[23] [23]

Trivedi, N

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Musique: Multihop questions via single-hop question composition, 2022

work page 2022

[24] [24]

S. Wang, G. Zhang, L. L. Zhang, N. Shang, F. Yang, D. Chen, and M. Yang. Loongrl: Reinforcement learning for advanced reasoning over long contexts, 2025

work page 2025

[25] [25]

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024

work page 2024

[26] [26]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page 2025

[27] [27]

Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018

work page 2018

[28] [28]

H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y .-Q. Zhang, W.-Y . Ma, J. Liu, M. Wang, and H. Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025

work page 2025

[29] [29]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . X. Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native sparse attention: Hardware-aligned and natively trainable sparse attention, 2025

work page 2025

[30] [30]

J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y . Luan, D. Zhou, and L. Hou. Instruction- following evaluation for large language models, 2023. 11 Appendix A Agent Trajectory Compilation Examples Figures 6 and 7 show compiled trajectories for SWE and SQL agents. Both follow the same ACC pipeline as the search agent in Figure 2. Figure 6 shows a compiled ...

work page 2023