Parallel Context Compaction for Long-Horizon LLM Agent Serving
Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3
The pith
Parallel compaction gives operators fine-grained control over summary volume in LLM agents while reducing wall time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.
What carries the argument
Parallel compaction, which processes multiple summarization blocks concurrently instead of a single sequential blocking call.
If this is right
- Operators obtain fine-grained control over summary volume by writing targeted prompts for each block.
- End-to-end wall time decreases without any increase in total decode volume.
- Compaction throughput rises across both dense and mixture-of-experts architectures.
- Retained knowledge becomes more consistent across independent runs of the same history.
- The technique applies equally to reasoning and non-reasoning models on multi-hop QA and long-context dialogue tasks.
Where Pith is reading between the lines
- Parallel compaction could support dynamic per-block length targets that adapt during an ongoing agent session.
- It might combine with selective retention filters to further reduce information loss while keeping latency low.
- The same block-wise parallelism could apply to other lossy maintenance steps such as memory pruning in multi-agent systems.
- A direct test would compare fact-recall accuracy on held-out questions after parallel versus sequential compaction of the same histories.
Load-bearing premise
Parallel compaction preserves equivalent summarization quality and retained knowledge to the sequential baseline without introducing new inconsistencies or information loss across runs.
What would settle it
An experiment that feeds identical conversation histories to both parallel and sequential compaction, then measures factual coverage and output-token variance across repeated runs to check for degradation or increased inconsistency in the parallel case.
Figures
read the original abstract
Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce \textbf{parallel compaction} for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes parallel compaction as a method for managing growing context in long-horizon LLM agents via LLM-based summarization. It claims that, unlike sequential synchronous compaction, the parallel approach grants operators fine-grained control over summary volume, supports targeted per-block prompt engineering, and—at matched compaction decode volume—reduces end-to-end wall time while increasing compaction throughput. The claims are supported by characterization experiments on HotpotQA and LoCoMo using four backbones (8B–120B, dense and MoE, reasoning and non-reasoning models).
Significance. If the quality-equivalence assumption holds, the technique would address a practical serving bottleneck by making compaction latency more predictable and controllable. The multi-model, multi-benchmark evaluation design is a strength, as is the explicit focus on matched decode volume rather than raw token count.
major comments (2)
- [Abstract and evaluation sections] The central performance claim (lower wall time and higher throughput at matched decode volume) is load-bearing only if parallel summaries preserve equivalent retained knowledge and avoid new cross-block inconsistencies relative to the sequential baseline. No downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks are reported to substantiate this equivalence on HotpotQA or LoCoMo.
- [Abstract] The abstract states that parallel compaction was characterized across the listed models and benchmarks, yet supplies no quantitative tables, variance statistics, or error analysis for either throughput or quality. Without these data the reader cannot assess whether the reported gains are statistically reliable or practically meaningful.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting the need for quality validation and clearer quantitative presentation. We respond to each major comment below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract and evaluation sections] The central performance claim (lower wall time and higher throughput at matched decode volume) is load-bearing only if parallel summaries preserve equivalent retained knowledge and avoid new cross-block inconsistencies relative to the sequential baseline. No downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks are reported to substantiate this equivalence on HotpotQA or LoCoMo.
Authors: We acknowledge that the manuscript does not report downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks comparing retained knowledge between parallel and sequential compaction. The work centers on serving characteristics (predictable volume control and wall-time reduction at matched decode volume) rather than end-task quality equivalence. The same summarization model and prompt templates are used in both conditions, but we agree this leaves the equivalence assumption untested. We will revise the manuscript to add an explicit limitations discussion noting the absence of these metrics and the scope of the current evaluation. revision: yes
-
Referee: [Abstract] The abstract states that parallel compaction was characterized across the listed models and benchmarks, yet supplies no quantitative tables, variance statistics, or error analysis for either throughput or quality. Without these data the reader cannot assess whether the reported gains are statistically reliable or practically meaningful.
Authors: Abstracts are concise overviews and conventionally omit tables, variance statistics, and error bars; the full evaluation sections of the manuscript present the multi-model, multi-benchmark results with the relevant quantitative data. We will revise the abstract to reference the key observed improvements (e.g., wall-time reduction at matched decode volume) and direct readers to the detailed results and statistics in the body of the paper. revision: yes
Circularity Check
No circularity; empirical comparison to explicit baseline
full rationale
The paper introduces parallel compaction and reports direct wall-time and throughput measurements against a described sequential synchronous baseline on HotpotQA and LoCoMo. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. All central claims are grounded in external benchmark runs rather than any self-referential reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Plan-and-write: Structure-guided length control for llms without model retraining,
A. Akinfaderin, S. Subramanian, and A. Sehwag, “Plan-and-write: Structure-guided length control for llms without model retraining, ” arXiv preprint arXiv:2511.01807, 2025
-
[2]
Claude code: Best practices for agentic coding,
Anthropic, “Claude code: Best practices for agentic coding, ”
-
[3]
Available: https://platform.claude.com/docs/en/ build-with-claude/compaction
[Online]. Available: https://platform.claude.com/docs/en/ build-with-claude/compaction
-
[4]
Effective context engineering for ai agents,
——, “Effective context engineering for ai agents, ” https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents, 2025, accessed: 2026-05-11
work page 2025
-
[5]
Context length alone hurts llm performance despite perfect retrieval,
Y. Du, M. Tian, S. Ronanki, S. Rongali, S. Bodapati, A. Galstyan, A. Wells, R. Schwartz, E. A. Huerta, and H. Peng, “Context length alone hurts llm performance despite perfect retrieval, ”arXiv preprint arXiv:2510.05381, 2025
-
[6]
Cartridges: Lightweight and general-purpose long context representations via self-study,
S. Eyuboglu, R. Ehrlich, S. Arora, N. Guha, D. Zinsley, E. Liu, W. Tennien, A. Rudra, J. Zou, A. Mirhoseiniet al., “Cartridges: Lightweight and general-purpose long context representations via self-study, ”arXiv preprint arXiv:2506.06266, 2025
-
[7]
Context rot: How increasing input tokens impacts llm performance,
K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance, ” Chroma, Tech. Rep., July
-
[8]
Available: https://research.trychroma.com/context-rot
[Online]. Available: https://research.trychroma.com/context-rot
-
[9]
Llmlingua: Com- pressing prompts for accelerated inference of large language models,
H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: Com- pressing prompts for accelerated inference of large language models, ” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 13 358–13 376
work page 2023
-
[10]
Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,
H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1658–1677
work page 2024
-
[11]
ACON : Optimizing context compression for long-horizon LLM agents, 2025
M. Kang, W.-N. Chen, D. Han, H. A. Inan, L. Wutschitz, Y. Chen, R. Sim, and S. Rajmohan, “Acon: Optimizing context compression for long-horizon llm agents, ”arXiv preprint arXiv:2510.00615, 2025
-
[12]
Langchain: Building applications with llms through composability,
LangChain, “Langchain: Building applications with llms through composability, ” 2023. [Online]. Available: https: //github.com/langchain-ai/langchain
work page 2023
-
[13]
Lost in the middle: How language models use long contexts,
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts, ”Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024
work page 2024
-
[14]
Llamaindex: A data framework for llm applications,
LlamaIndex, “Llamaindex: A data framework for llm applications, ”
-
[15]
Available: https://github.com/run-llama/llama_index
[Online]. Available: https://github.com/run-llama/llama_index
-
[16]
Evaluating very long-term conversational memory of llm agents,
A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of llm agents, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 851– 13 870
work page 2024
-
[17]
A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon, and H. Schütze, “Nolima: Long-context evaluation beyond literal matching, ”arXiv preprint arXiv:2502.05167, 2025
-
[18]
OpenAI, “Codex prompting guide, ” 2025, [Online]. Available: https://developers.openai.com/cookbook/examples/gpt-5/codex_ prompting_guide
work page 2025
-
[19]
Context-aware hierarchical merging for long document summarization,
L. Ou and M. Lapata, “Context-aware hierarchical merging for long document summarization, ” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 5534–5561
work page 2025
-
[20]
Memgpt: towards llms as operating systems
C. Packer, V. Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez, “Memgpt: towards llms as operating systems. ” 2023
work page 2023
-
[21]
Llm evaluators recognize and favor their own generations,
A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations, ”Advances in Neural Information Processing Systems, vol. 37, pp. 68 772–68 802, 2024
work page 2024
-
[22]
LoCoMo-MC10: A 10-choice multiple-choice version of locomo,
Percena, “LoCoMo-MC10: A 10-choice multiple-choice version of locomo, ” https://huggingface.co/datasets/Percena/locomo-mc10, 2026, hugging Face dataset. Accessed: 2026-05-19
work page 2026
-
[23]
On context utilization in summarization with large language models,
M. Ravaut, A. Sun, N. Chen, and S. Joty, “On context utilization in summarization with large language models, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2764–2781
work page 2024
-
[24]
X. Wu, M. Wang, Y. Liu, X. Shi, H. Yan, L. Xiangju, J. Zhu, and W. Zhang, “Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios, ” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 16 445– 16 468
work page 2025
-
[25]
Longgenbench: Bench- marking long-form generation in long context llms,
Y. Wu, M. S. Hee, Z. Hu, and R. K.-W. Lee, “Longgenbench: Bench- marking long-form generation in long context llms, ” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 6851–6872
work page 2025
-
[26]
Can llms track their output length? a dynamic feedback mechanism for precise length regulation,
M. Xiao, A. Wang, Q. Hu, Z. Miao, H. Shen, L. Wang, W. Luo, and J. Su, “Can llms track their output length? a dynamic feedback mechanism for precise length regulation, ”arXiv preprint arXiv:2601.01768, 2026
-
[27]
Long context scaling: Divide and conquer via multi-agent question-driven collaboration,
S. Xiao, Z. Lin, W. Gao, H. Chen, and Y. Zhang, “Long context scaling: Divide and conquer via multi-agent question-driven collaboration, ” arXiv preprint arXiv:2505.20625, 2025
-
[28]
Prompt-based one-shot exact length-controlled generation with llms,
J. Xie and H.-y. Lee, “Prompt-based one-shot exact length-controlled generation with llms, ”arXiv preprint arXiv:2508.13805, 2025
-
[29]
Beyond goldfish memory: Long-term open-domain conversation,
J. Xu, A. Szlam, and J. Weston, “Beyond goldfish memory: Long-term open-domain conversation, ” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 5180–5197
work page 2022
-
[30]
Pride and prejudice: Llm amplifies self-bias in self-refinement,
W. Xu, G. Zhu, X. Zhao, L. Pan, L. Li, and W. Wang, “Pride and prejudice: Llm amplifies self-bias in self-refinement, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 15 474–15 492
work page 2024
-
[31]
When does divide and conquer work for long context llm? a noise decomposition framework,
Z. Xu, S. Zhu, J. Wang, J. Wang, B. Athiwaratkun, C. Wang, J. Zou, and C. Zhang, “When does divide and conquer work for long context llm? a noise decomposition framework, ”arXiv preprint arXiv:2506.16411, 2025
-
[32]
Hotpotqa: A dataset for diverse, explainable multi- hop question answering,
Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi- hop question answering, ” inProceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2369–2380
work page 2018
-
[33]
Lifebench: Evaluating length instruction following in large language models,
W. Zhang, Z. Zhou, K. Wang, J. Fang, R. Xu, Y. Zhang, R. Wang, G. Zhang, X. Li, L. Sunet al., “Lifebench: Evaluating length instruction following in large language models, ”Advances in Neural Information Processing Systems, vol. 38, 2026
work page 2026
-
[34]
Demystify verbosity compen- sation behavior of large language models,
Y. Zhang, S. S. S. Das, and R. Zhang, “Demystify verbosity compen- sation behavior of large language models, ” inProceedings of the 2nd Workshop on Uncertainty-A ware NLP (UncertaiNLP 2025), 2025, pp. 160–178
work page 2025
-
[35]
Chain of agents: Large language models collaborating on long-context tasks,
Y. Zhang, R. Sun, Y. Chen, T. Pfister, R. Zhang, and S. Ö. Arık, “Chain of agents: Large language models collaborating on long-context tasks, ”Advances in Neural Information Processing Systems, vol. 37, pp. 132 208–132 237, 2024
work page 2024
-
[36]
Llm ×mapreduce: Simplified long-sequence processing using large language models,
Z. Zhou, C. Li, X. Chen, S. Wang, Y. Chao, Z. Li, H. Wang, R. An, Q. Shi, Z. Tan, X. Han, X. Shi, Z. Liu, and M. Sun, “Llm ×mapreduce: Simplified long-sequence processing using large language models, ”
-
[37]
Available: https://arxiv.org/abs/2410.09342
[Online]. Available: https://arxiv.org/abs/2410.09342
-
[38]
Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026
A. Zweiger, X. Fu, H. Guo, and Y. Kim, “Fast kv compaction via attention matching, ”arXiv preprint arXiv:2602.16284, 2026. 11
work page internal anchor Pith review arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.