Parallel Context Compaction for Long-Horizon LLM Agent Serving

Burak Topcu; Chita Das; Mahmut Taylan Kandemir; Musa Cim

arxiv: 2605.23296 · v1 · pith:4QIDI6R5new · submitted 2026-05-22 · 💻 cs.AI

Parallel Context Compaction for Long-Horizon LLM Agent Serving

Musa Cim , Burak Topcu , Chita Das , Mahmut Taylan Kandemir This is my paper

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.AI

keywords parallel compactioncontext compactionlong-horizon LLM agentsagent servingsummarizationHotpotQALoCoMo

0 comments

The pith

Parallel compaction gives operators fine-grained control over summary volume in LLM agents while reducing wall time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon LLM agents accumulate conversation histories that exceed context windows, and traditional summarization stalls inference for tens of seconds while producing unpredictable output volumes. The paper introduces parallel compaction to summarize blocks concurrently rather than in one sequential blocking call. Across four backbones from 8B to 120B parameters mixing dense and MoE models on HotpotQA and LoCoMo benchmarks, parallel compaction delivers predictable summary volume control and enables per-block prompt engineering. At matched compaction decode volume it shortens end-to-end wall time and raises throughput over the sequential baseline. A sympathetic reader cares because the method directly tackles the unpredictability and latency that currently limit reliable long-running agents.

Core claim

Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

What carries the argument

Parallel compaction, which processes multiple summarization blocks concurrently instead of a single sequential blocking call.

If this is right

Operators obtain fine-grained control over summary volume by writing targeted prompts for each block.
End-to-end wall time decreases without any increase in total decode volume.
Compaction throughput rises across both dense and mixture-of-experts architectures.
Retained knowledge becomes more consistent across independent runs of the same history.
The technique applies equally to reasoning and non-reasoning models on multi-hop QA and long-context dialogue tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Parallel compaction could support dynamic per-block length targets that adapt during an ongoing agent session.
It might combine with selective retention filters to further reduce information loss while keeping latency low.
The same block-wise parallelism could apply to other lossy maintenance steps such as memory pruning in multi-agent systems.
A direct test would compare fact-recall accuracy on held-out questions after parallel versus sequential compaction of the same histories.

Load-bearing premise

Parallel compaction preserves equivalent summarization quality and retained knowledge to the sequential baseline without introducing new inconsistencies or information loss across runs.

What would settle it

An experiment that feeds identical conversation histories to both parallel and sequential compaction, then measures factual coverage and output-token variance across repeated runs to check for degradation or increased inconsistency in the parallel case.

Figures

Figures reproduced from arXiv: 2605.23296 by Burak Topcu, Chita Das, Mahmut Taylan Kandemir, Musa Cim.

**Figure 2.** Figure 2: Output tokens vs. input length for three prompt variants and four backbones under Sequential compaction. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Output tokens per run across 10 repeated runs at [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy per model and configuration on (a) HotpotQA and (b) LoCoMo. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: GPT-OSS-120B output tokens vs. block size across [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce \textbf{parallel compaction} for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Parallel compaction is a straightforward systems tweak for agent context, but the abstract provides no numbers or quality checks so the benefit remains unproven.

read the letter

The paper's main move is to run context compaction in parallel blocks rather than one long sequential call. This is presented as giving operators better control over summary length and cutting wall time at the same decode volume on HotpotQA and LoCoMo. The evaluation runs across four backbones from 8B to 120B, mixing dense and MoE models, which is a reasonable breadth for an engineering claim. That part is useful if you're building agent serving stacks and want to see how the idea behaves on different architectures. The framing of the problem—unpredictable output volume and lossy summarization stalling inference—is also clear and practical. The soft spot is exactly what the stress-test note flags: the abstract states performance gains at matched volume but gives zero quantitative results, no variance numbers, and no description of how they measured whether parallel summaries keep the same retained knowledge or avoid new inconsistencies. Without downstream accuracy, overlap metrics, or human checks showing equivalence to the sequential baseline, the throughput claim does not yet establish a usable improvement. If the full paper contains those checks and the numbers hold, the work becomes more interesting for production readers. As it stands from the abstract, the central assumption is untested. This is aimed at people who deploy long-horizon agents and care about serving latency. A systems-oriented reader could pull the parallel idea and the model sweep even if the results need tightening. It deserves peer review because the problem is concrete and the setup is straightforward to evaluate, though the paper would need to add the missing quality data before acceptance.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes parallel compaction as a method for managing growing context in long-horizon LLM agents via LLM-based summarization. It claims that, unlike sequential synchronous compaction, the parallel approach grants operators fine-grained control over summary volume, supports targeted per-block prompt engineering, and—at matched compaction decode volume—reduces end-to-end wall time while increasing compaction throughput. The claims are supported by characterization experiments on HotpotQA and LoCoMo using four backbones (8B–120B, dense and MoE, reasoning and non-reasoning models).

Significance. If the quality-equivalence assumption holds, the technique would address a practical serving bottleneck by making compaction latency more predictable and controllable. The multi-model, multi-benchmark evaluation design is a strength, as is the explicit focus on matched decode volume rather than raw token count.

major comments (2)

[Abstract and evaluation sections] The central performance claim (lower wall time and higher throughput at matched decode volume) is load-bearing only if parallel summaries preserve equivalent retained knowledge and avoid new cross-block inconsistencies relative to the sequential baseline. No downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks are reported to substantiate this equivalence on HotpotQA or LoCoMo.
[Abstract] The abstract states that parallel compaction was characterized across the listed models and benchmarks, yet supplies no quantitative tables, variance statistics, or error analysis for either throughput or quality. Without these data the reader cannot assess whether the reported gains are statistically reliable or practically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quality validation and clearer quantitative presentation. We respond to each major comment below, indicating where revisions will be made.

read point-by-point responses

Referee: [Abstract and evaluation sections] The central performance claim (lower wall time and higher throughput at matched decode volume) is load-bearing only if parallel summaries preserve equivalent retained knowledge and avoid new cross-block inconsistencies relative to the sequential baseline. No downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks are reported to substantiate this equivalence on HotpotQA or LoCoMo.

Authors: We acknowledge that the manuscript does not report downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks comparing retained knowledge between parallel and sequential compaction. The work centers on serving characteristics (predictable volume control and wall-time reduction at matched decode volume) rather than end-task quality equivalence. The same summarization model and prompt templates are used in both conditions, but we agree this leaves the equivalence assumption untested. We will revise the manuscript to add an explicit limitations discussion noting the absence of these metrics and the scope of the current evaluation. revision: yes
Referee: [Abstract] The abstract states that parallel compaction was characterized across the listed models and benchmarks, yet supplies no quantitative tables, variance statistics, or error analysis for either throughput or quality. Without these data the reader cannot assess whether the reported gains are statistically reliable or practically meaningful.

Authors: Abstracts are concise overviews and conventionally omit tables, variance statistics, and error bars; the full evaluation sections of the manuscript present the multi-model, multi-benchmark results with the relevant quantitative data. We will revise the abstract to reference the key observed improvements (e.g., wall-time reduction at matched decode volume) and direct readers to the detailed results and statistics in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparison to explicit baseline

full rationale

The paper introduces parallel compaction and reports direct wall-time and throughput measurements against a described sequential synchronous baseline on HotpotQA and LoCoMo. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. All central claims are grounded in external benchmark runs rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new entities introduced; the work is an algorithmic and systems proposal relying on standard LLM inference assumptions.

pith-pipeline@v0.9.0 · 5723 in / 935 out tokens · 41701 ms · 2026-05-25T04:34:19.094786+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

[1]

Plan-and-write: Structure-guided length control for llms without model retraining,

A. Akinfaderin, S. Subramanian, and A. Sehwag, “Plan-and-write: Structure-guided length control for llms without model retraining, ” arXiv preprint arXiv:2511.01807, 2025

work page arXiv 2025
[2]

Claude code: Best practices for agentic coding,

Anthropic, “Claude code: Best practices for agentic coding, ”

work page
[3]

Available: https://platform.claude.com/docs/en/ build-with-claude/compaction

[Online]. Available: https://platform.claude.com/docs/en/ build-with-claude/compaction

work page
[4]

Effective context engineering for ai agents,

——, “Effective context engineering for ai agents, ” https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents, 2025, accessed: 2026-05-11

work page 2025
[5]

Context length alone hurts llm performance despite perfect retrieval,

Y. Du, M. Tian, S. Ronanki, S. Rongali, S. Bodapati, A. Galstyan, A. Wells, R. Schwartz, E. A. Huerta, and H. Peng, “Context length alone hurts llm performance despite perfect retrieval, ”arXiv preprint arXiv:2510.05381, 2025

work page arXiv 2025
[6]

Cartridges: Lightweight and general-purpose long context representations via self-study,

S. Eyuboglu, R. Ehrlich, S. Arora, N. Guha, D. Zinsley, E. Liu, W. Tennien, A. Rudra, J. Zou, A. Mirhoseiniet al., “Cartridges: Lightweight and general-purpose long context representations via self-study, ”arXiv preprint arXiv:2506.06266, 2025

work page arXiv 2025
[7]

Context rot: How increasing input tokens impacts llm performance,

K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance, ” Chroma, Tech. Rep., July

work page
[8]

Available: https://research.trychroma.com/context-rot

[Online]. Available: https://research.trychroma.com/context-rot

work page
[9]

Llmlingua: Com- pressing prompts for accelerated inference of large language models,

H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: Com- pressing prompts for accelerated inference of large language models, ” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 13 358–13 376

work page 2023
[10]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1658–1677

work page 2024
[11]

ACON : Optimizing context compression for long-horizon LLM agents, 2025

M. Kang, W.-N. Chen, D. Han, H. A. Inan, L. Wutschitz, Y. Chen, R. Sim, and S. Rajmohan, “Acon: Optimizing context compression for long-horizon llm agents, ”arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025
[12]

Langchain: Building applications with llms through composability,

LangChain, “Langchain: Building applications with llms through composability, ” 2023. [Online]. Available: https: //github.com/langchain-ai/langchain

work page 2023
[13]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts, ”Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024

work page 2024
[14]

Llamaindex: A data framework for llm applications,

LlamaIndex, “Llamaindex: A data framework for llm applications, ”

work page
[15]

Available: https://github.com/run-llama/llama_index

[Online]. Available: https://github.com/run-llama/llama_index

work page
[16]

Evaluating very long-term conversational memory of llm agents,

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of llm agents, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 851– 13 870

work page 2024
[17]

and Yoon, Seunghyun and Sch

A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon, and H. Schütze, “Nolima: Long-context evaluation beyond literal matching, ”arXiv preprint arXiv:2502.05167, 2025

work page arXiv 2025
[18]

Codex prompting guide,

OpenAI, “Codex prompting guide, ” 2025, [Online]. Available: https://developers.openai.com/cookbook/examples/gpt-5/codex_ prompting_guide

work page 2025
[19]

Context-aware hierarchical merging for long document summarization,

L. Ou and M. Lapata, “Context-aware hierarchical merging for long document summarization, ” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 5534–5561

work page 2025
[20]

Memgpt: towards llms as operating systems

C. Packer, V. Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez, “Memgpt: towards llms as operating systems. ” 2023

work page 2023
[21]

Llm evaluators recognize and favor their own generations,

A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations, ”Advances in Neural Information Processing Systems, vol. 37, pp. 68 772–68 802, 2024

work page 2024
[22]

LoCoMo-MC10: A 10-choice multiple-choice version of locomo,

Percena, “LoCoMo-MC10: A 10-choice multiple-choice version of locomo, ” https://huggingface.co/datasets/Percena/locomo-mc10, 2026, hugging Face dataset. Accessed: 2026-05-19

work page 2026
[23]

On context utilization in summarization with large language models,

M. Ravaut, A. Sun, N. Chen, and S. Joty, “On context utilization in summarization with large language models, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2764–2781

work page 2024
[24]

Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios,

X. Wu, M. Wang, Y. Liu, X. Shi, H. Yan, L. Xiangju, J. Zhu, and W. Zhang, “Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios, ” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 16 445– 16 468

work page 2025
[25]

Longgenbench: Bench- marking long-form generation in long context llms,

Y. Wu, M. S. Hee, Z. Hu, and R. K.-W. Lee, “Longgenbench: Bench- marking long-form generation in long context llms, ” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 6851–6872

work page 2025
[26]

Can llms track their output length? a dynamic feedback mechanism for precise length regulation,

M. Xiao, A. Wang, Q. Hu, Z. Miao, H. Shen, L. Wang, W. Luo, and J. Su, “Can llms track their output length? a dynamic feedback mechanism for precise length regulation, ”arXiv preprint arXiv:2601.01768, 2026

work page arXiv 2026
[27]

Long context scaling: Divide and conquer via multi-agent question-driven collaboration,

S. Xiao, Z. Lin, W. Gao, H. Chen, and Y. Zhang, “Long context scaling: Divide and conquer via multi-agent question-driven collaboration, ” arXiv preprint arXiv:2505.20625, 2025

work page arXiv 2025
[28]

Prompt-based one-shot exact length-controlled generation with llms,

J. Xie and H.-y. Lee, “Prompt-based one-shot exact length-controlled generation with llms, ”arXiv preprint arXiv:2508.13805, 2025

work page arXiv 2025
[29]

Beyond goldfish memory: Long-term open-domain conversation,

J. Xu, A. Szlam, and J. Weston, “Beyond goldfish memory: Long-term open-domain conversation, ” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 5180–5197

work page 2022
[30]

Pride and prejudice: Llm amplifies self-bias in self-refinement,

W. Xu, G. Zhu, X. Zhao, L. Pan, L. Li, and W. Wang, “Pride and prejudice: Llm amplifies self-bias in self-refinement, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 15 474–15 492

work page 2024
[31]

When does divide and conquer work for long context llm? a noise decomposition framework,

Z. Xu, S. Zhu, J. Wang, J. Wang, B. Athiwaratkun, C. Wang, J. Zou, and C. Zhang, “When does divide and conquer work for long context llm? a noise decomposition framework, ”arXiv preprint arXiv:2506.16411, 2025

work page arXiv 2025
[32]

Hotpotqa: A dataset for diverse, explainable multi- hop question answering,

Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi- hop question answering, ” inProceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2369–2380

work page 2018
[33]

Lifebench: Evaluating length instruction following in large language models,

W. Zhang, Z. Zhou, K. Wang, J. Fang, R. Xu, Y. Zhang, R. Wang, G. Zhang, X. Li, L. Sunet al., “Lifebench: Evaluating length instruction following in large language models, ”Advances in Neural Information Processing Systems, vol. 38, 2026

work page 2026
[34]

Demystify verbosity compen- sation behavior of large language models,

Y. Zhang, S. S. S. Das, and R. Zhang, “Demystify verbosity compen- sation behavior of large language models, ” inProceedings of the 2nd Workshop on Uncertainty-A ware NLP (UncertaiNLP 2025), 2025, pp. 160–178

work page 2025
[35]

Chain of agents: Large language models collaborating on long-context tasks,

Y. Zhang, R. Sun, Y. Chen, T. Pfister, R. Zhang, and S. Ö. Arık, “Chain of agents: Large language models collaborating on long-context tasks, ”Advances in Neural Information Processing Systems, vol. 37, pp. 132 208–132 237, 2024

work page 2024
[36]

Llm ×mapreduce: Simplified long-sequence processing using large language models,

Z. Zhou, C. Li, X. Chen, S. Wang, Y. Chao, Z. Li, H. Wang, R. An, Q. Shi, Z. Tan, X. Han, X. Shi, Z. Liu, and M. Sun, “Llm ×mapreduce: Simplified long-sequence processing using large language models, ”

work page
[37]

Available: https://arxiv.org/abs/2410.09342

[Online]. Available: https://arxiv.org/abs/2410.09342

work page arXiv
[38]

Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026

A. Zweiger, X. Fu, H. Guo, and Y. Kim, “Fast kv compaction via attention matching, ”arXiv preprint arXiv:2602.16284, 2026. 11

work page internal anchor Pith review arXiv 2026

[1] [1]

Plan-and-write: Structure-guided length control for llms without model retraining,

A. Akinfaderin, S. Subramanian, and A. Sehwag, “Plan-and-write: Structure-guided length control for llms without model retraining, ” arXiv preprint arXiv:2511.01807, 2025

work page arXiv 2025

[2] [2]

Claude code: Best practices for agentic coding,

Anthropic, “Claude code: Best practices for agentic coding, ”

work page

[3] [3]

Available: https://platform.claude.com/docs/en/ build-with-claude/compaction

[Online]. Available: https://platform.claude.com/docs/en/ build-with-claude/compaction

work page

[4] [4]

Effective context engineering for ai agents,

——, “Effective context engineering for ai agents, ” https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents, 2025, accessed: 2026-05-11

work page 2025

[5] [5]

Context length alone hurts llm performance despite perfect retrieval,

Y. Du, M. Tian, S. Ronanki, S. Rongali, S. Bodapati, A. Galstyan, A. Wells, R. Schwartz, E. A. Huerta, and H. Peng, “Context length alone hurts llm performance despite perfect retrieval, ”arXiv preprint arXiv:2510.05381, 2025

work page arXiv 2025

[6] [6]

Cartridges: Lightweight and general-purpose long context representations via self-study,

S. Eyuboglu, R. Ehrlich, S. Arora, N. Guha, D. Zinsley, E. Liu, W. Tennien, A. Rudra, J. Zou, A. Mirhoseiniet al., “Cartridges: Lightweight and general-purpose long context representations via self-study, ”arXiv preprint arXiv:2506.06266, 2025

work page arXiv 2025

[7] [7]

Context rot: How increasing input tokens impacts llm performance,

K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance, ” Chroma, Tech. Rep., July

work page

[8] [8]

Available: https://research.trychroma.com/context-rot

[Online]. Available: https://research.trychroma.com/context-rot

work page

[9] [9]

Llmlingua: Com- pressing prompts for accelerated inference of large language models,

H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: Com- pressing prompts for accelerated inference of large language models, ” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 13 358–13 376

work page 2023

[10] [10]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,

H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1658–1677

work page 2024

[11] [11]

ACON : Optimizing context compression for long-horizon LLM agents, 2025

M. Kang, W.-N. Chen, D. Han, H. A. Inan, L. Wutschitz, Y. Chen, R. Sim, and S. Rajmohan, “Acon: Optimizing context compression for long-horizon llm agents, ”arXiv preprint arXiv:2510.00615, 2025

work page arXiv 2025

[12] [12]

Langchain: Building applications with llms through composability,

LangChain, “Langchain: Building applications with llms through composability, ” 2023. [Online]. Available: https: //github.com/langchain-ai/langchain

work page 2023

[13] [13]

Lost in the middle: How language models use long contexts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts, ”Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024

work page 2024

[14] [14]

Llamaindex: A data framework for llm applications,

LlamaIndex, “Llamaindex: A data framework for llm applications, ”

work page

[15] [15]

Available: https://github.com/run-llama/llama_index

[Online]. Available: https://github.com/run-llama/llama_index

work page

[16] [16]

Evaluating very long-term conversational memory of llm agents,

A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of llm agents, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 851– 13 870

work page 2024

[17] [17]

and Yoon, Seunghyun and Sch

A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon, and H. Schütze, “Nolima: Long-context evaluation beyond literal matching, ”arXiv preprint arXiv:2502.05167, 2025

work page arXiv 2025

[18] [18]

Codex prompting guide,

OpenAI, “Codex prompting guide, ” 2025, [Online]. Available: https://developers.openai.com/cookbook/examples/gpt-5/codex_ prompting_guide

work page 2025

[19] [19]

Context-aware hierarchical merging for long document summarization,

L. Ou and M. Lapata, “Context-aware hierarchical merging for long document summarization, ” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 5534–5561

work page 2025

[20] [20]

Memgpt: towards llms as operating systems

C. Packer, V. Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez, “Memgpt: towards llms as operating systems. ” 2023

work page 2023

[21] [21]

Llm evaluators recognize and favor their own generations,

A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations, ”Advances in Neural Information Processing Systems, vol. 37, pp. 68 772–68 802, 2024

work page 2024

[22] [22]

LoCoMo-MC10: A 10-choice multiple-choice version of locomo,

Percena, “LoCoMo-MC10: A 10-choice multiple-choice version of locomo, ” https://huggingface.co/datasets/Percena/locomo-mc10, 2026, hugging Face dataset. Accessed: 2026-05-19

work page 2026

[23] [23]

On context utilization in summarization with large language models,

M. Ravaut, A. Sun, N. Chen, and S. Joty, “On context utilization in summarization with large language models, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2764–2781

work page 2024

[24] [24]

Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios,

X. Wu, M. Wang, Y. Liu, X. Shi, H. Yan, L. Xiangju, J. Zhu, and W. Zhang, “Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios, ” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 16 445– 16 468

work page 2025

[25] [25]

Longgenbench: Bench- marking long-form generation in long context llms,

Y. Wu, M. S. Hee, Z. Hu, and R. K.-W. Lee, “Longgenbench: Bench- marking long-form generation in long context llms, ” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 6851–6872

work page 2025

[26] [26]

Can llms track their output length? a dynamic feedback mechanism for precise length regulation,

M. Xiao, A. Wang, Q. Hu, Z. Miao, H. Shen, L. Wang, W. Luo, and J. Su, “Can llms track their output length? a dynamic feedback mechanism for precise length regulation, ”arXiv preprint arXiv:2601.01768, 2026

work page arXiv 2026

[27] [27]

Long context scaling: Divide and conquer via multi-agent question-driven collaboration,

S. Xiao, Z. Lin, W. Gao, H. Chen, and Y. Zhang, “Long context scaling: Divide and conquer via multi-agent question-driven collaboration, ” arXiv preprint arXiv:2505.20625, 2025

work page arXiv 2025

[28] [28]

Prompt-based one-shot exact length-controlled generation with llms,

J. Xie and H.-y. Lee, “Prompt-based one-shot exact length-controlled generation with llms, ”arXiv preprint arXiv:2508.13805, 2025

work page arXiv 2025

[29] [29]

Beyond goldfish memory: Long-term open-domain conversation,

J. Xu, A. Szlam, and J. Weston, “Beyond goldfish memory: Long-term open-domain conversation, ” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 5180–5197

work page 2022

[30] [30]

Pride and prejudice: Llm amplifies self-bias in self-refinement,

W. Xu, G. Zhu, X. Zhao, L. Pan, L. Li, and W. Wang, “Pride and prejudice: Llm amplifies self-bias in self-refinement, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 15 474–15 492

work page 2024

[31] [31]

When does divide and conquer work for long context llm? a noise decomposition framework,

Z. Xu, S. Zhu, J. Wang, J. Wang, B. Athiwaratkun, C. Wang, J. Zou, and C. Zhang, “When does divide and conquer work for long context llm? a noise decomposition framework, ”arXiv preprint arXiv:2506.16411, 2025

work page arXiv 2025

[32] [32]

Hotpotqa: A dataset for diverse, explainable multi- hop question answering,

Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi- hop question answering, ” inProceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2369–2380

work page 2018

[33] [33]

Lifebench: Evaluating length instruction following in large language models,

W. Zhang, Z. Zhou, K. Wang, J. Fang, R. Xu, Y. Zhang, R. Wang, G. Zhang, X. Li, L. Sunet al., “Lifebench: Evaluating length instruction following in large language models, ”Advances in Neural Information Processing Systems, vol. 38, 2026

work page 2026

[34] [34]

Demystify verbosity compen- sation behavior of large language models,

Y. Zhang, S. S. S. Das, and R. Zhang, “Demystify verbosity compen- sation behavior of large language models, ” inProceedings of the 2nd Workshop on Uncertainty-A ware NLP (UncertaiNLP 2025), 2025, pp. 160–178

work page 2025

[35] [35]

Chain of agents: Large language models collaborating on long-context tasks,

Y. Zhang, R. Sun, Y. Chen, T. Pfister, R. Zhang, and S. Ö. Arık, “Chain of agents: Large language models collaborating on long-context tasks, ”Advances in Neural Information Processing Systems, vol. 37, pp. 132 208–132 237, 2024

work page 2024

[36] [36]

Llm ×mapreduce: Simplified long-sequence processing using large language models,

Z. Zhou, C. Li, X. Chen, S. Wang, Y. Chao, Z. Li, H. Wang, R. An, Q. Shi, Z. Tan, X. Han, X. Shi, Z. Liu, and M. Sun, “Llm ×mapreduce: Simplified long-sequence processing using large language models, ”

work page

[37] [37]

Available: https://arxiv.org/abs/2410.09342

[Online]. Available: https://arxiv.org/abs/2410.09342

work page arXiv

[38] [38]

Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026

A. Zweiger, X. Fu, H. Guo, and Y. Kim, “Fast kv compaction via attention matching, ”arXiv preprint arXiv:2602.16284, 2026. 11

work page internal anchor Pith review arXiv 2026