pith. sign in

arxiv: 2605.23296 · v1 · pith:4QIDI6R5new · submitted 2026-05-22 · 💻 cs.AI

Parallel Context Compaction for Long-Horizon LLM Agent Serving

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.AI
keywords parallel compactioncontext compactionlong-horizon LLM agentsagent servingsummarizationHotpotQALoCoMo
0
0 comments X

The pith

Parallel compaction gives operators fine-grained control over summary volume in LLM agents while reducing wall time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long-horizon LLM agents accumulate conversation histories that exceed context windows, and traditional summarization stalls inference for tens of seconds while producing unpredictable output volumes. The paper introduces parallel compaction to summarize blocks concurrently rather than in one sequential blocking call. Across four backbones from 8B to 120B parameters mixing dense and MoE models on HotpotQA and LoCoMo benchmarks, parallel compaction delivers predictable summary volume control and enables per-block prompt engineering. At matched compaction decode volume it shortens end-to-end wall time and raises throughput over the sequential baseline. A sympathetic reader cares because the method directly tackles the unpredictability and latency that currently limit reliable long-running agents.

Core claim

Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

What carries the argument

Parallel compaction, which processes multiple summarization blocks concurrently instead of a single sequential blocking call.

If this is right

  • Operators obtain fine-grained control over summary volume by writing targeted prompts for each block.
  • End-to-end wall time decreases without any increase in total decode volume.
  • Compaction throughput rises across both dense and mixture-of-experts architectures.
  • Retained knowledge becomes more consistent across independent runs of the same history.
  • The technique applies equally to reasoning and non-reasoning models on multi-hop QA and long-context dialogue tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Parallel compaction could support dynamic per-block length targets that adapt during an ongoing agent session.
  • It might combine with selective retention filters to further reduce information loss while keeping latency low.
  • The same block-wise parallelism could apply to other lossy maintenance steps such as memory pruning in multi-agent systems.
  • A direct test would compare fact-recall accuracy on held-out questions after parallel versus sequential compaction of the same histories.

Load-bearing premise

Parallel compaction preserves equivalent summarization quality and retained knowledge to the sequential baseline without introducing new inconsistencies or information loss across runs.

What would settle it

An experiment that feeds identical conversation histories to both parallel and sequential compaction, then measures factual coverage and output-token variance across repeated runs to check for degradation or increased inconsistency in the parallel case.

Figures

Figures reproduced from arXiv: 2605.23296 by Burak Topcu, Chita Das, Mahmut Taylan Kandemir, Musa Cim.

Figure 1
Figure 1. Figure 1: Sequential versus parallel compaction overview. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Output tokens vs. input length for three prompt variants and four backbones under Sequential compaction. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Output tokens per run across 10 repeated runs at [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy per model and configuration on (a) HotpotQA and (b) LoCoMo. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GPT-OSS-120B output tokens vs. block size across [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce \textbf{parallel compaction} for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes parallel compaction as a method for managing growing context in long-horizon LLM agents via LLM-based summarization. It claims that, unlike sequential synchronous compaction, the parallel approach grants operators fine-grained control over summary volume, supports targeted per-block prompt engineering, and—at matched compaction decode volume—reduces end-to-end wall time while increasing compaction throughput. The claims are supported by characterization experiments on HotpotQA and LoCoMo using four backbones (8B–120B, dense and MoE, reasoning and non-reasoning models).

Significance. If the quality-equivalence assumption holds, the technique would address a practical serving bottleneck by making compaction latency more predictable and controllable. The multi-model, multi-benchmark evaluation design is a strength, as is the explicit focus on matched decode volume rather than raw token count.

major comments (2)
  1. [Abstract and evaluation sections] The central performance claim (lower wall time and higher throughput at matched decode volume) is load-bearing only if parallel summaries preserve equivalent retained knowledge and avoid new cross-block inconsistencies relative to the sequential baseline. No downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks are reported to substantiate this equivalence on HotpotQA or LoCoMo.
  2. [Abstract] The abstract states that parallel compaction was characterized across the listed models and benchmarks, yet supplies no quantitative tables, variance statistics, or error analysis for either throughput or quality. Without these data the reader cannot assess whether the reported gains are statistically reliable or practically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for quality validation and clearer quantitative presentation. We respond to each major comment below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and evaluation sections] The central performance claim (lower wall time and higher throughput at matched decode volume) is load-bearing only if parallel summaries preserve equivalent retained knowledge and avoid new cross-block inconsistencies relative to the sequential baseline. No downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks are reported to substantiate this equivalence on HotpotQA or LoCoMo.

    Authors: We acknowledge that the manuscript does not report downstream QA accuracy, information-overlap metrics, human preference scores, or consistency checks comparing retained knowledge between parallel and sequential compaction. The work centers on serving characteristics (predictable volume control and wall-time reduction at matched decode volume) rather than end-task quality equivalence. The same summarization model and prompt templates are used in both conditions, but we agree this leaves the equivalence assumption untested. We will revise the manuscript to add an explicit limitations discussion noting the absence of these metrics and the scope of the current evaluation. revision: yes

  2. Referee: [Abstract] The abstract states that parallel compaction was characterized across the listed models and benchmarks, yet supplies no quantitative tables, variance statistics, or error analysis for either throughput or quality. Without these data the reader cannot assess whether the reported gains are statistically reliable or practically meaningful.

    Authors: Abstracts are concise overviews and conventionally omit tables, variance statistics, and error bars; the full evaluation sections of the manuscript present the multi-model, multi-benchmark results with the relevant quantitative data. We will revise the abstract to reference the key observed improvements (e.g., wall-time reduction at matched decode volume) and direct readers to the detailed results and statistics in the body of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparison to explicit baseline

full rationale

The paper introduces parallel compaction and reports direct wall-time and throughput measurements against a described sequential synchronous baseline on HotpotQA and LoCoMo. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. All central claims are grounded in external benchmark runs rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, fitted parameters, or new entities introduced; the work is an algorithmic and systems proposal relying on standard LLM inference assumptions.

pith-pipeline@v0.9.0 · 5723 in / 935 out tokens · 41701 ms · 2026-05-25T04:34:19.094786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Plan-and-write: Structure-guided length control for llms without model retraining,

    A. Akinfaderin, S. Subramanian, and A. Sehwag, “Plan-and-write: Structure-guided length control for llms without model retraining, ” arXiv preprint arXiv:2511.01807, 2025

  2. [2]

    Claude code: Best practices for agentic coding,

    Anthropic, “Claude code: Best practices for agentic coding, ”

  3. [3]

    Available: https://platform.claude.com/docs/en/ build-with-claude/compaction

    [Online]. Available: https://platform.claude.com/docs/en/ build-with-claude/compaction

  4. [4]

    Effective context engineering for ai agents,

    ——, “Effective context engineering for ai agents, ” https://www.anthropic.com/engineering/ effective-context-engineering-for-ai-agents, 2025, accessed: 2026-05-11

  5. [5]

    Context length alone hurts llm performance despite perfect retrieval,

    Y. Du, M. Tian, S. Ronanki, S. Rongali, S. Bodapati, A. Galstyan, A. Wells, R. Schwartz, E. A. Huerta, and H. Peng, “Context length alone hurts llm performance despite perfect retrieval, ”arXiv preprint arXiv:2510.05381, 2025

  6. [6]

    Cartridges: Lightweight and general-purpose long context representations via self-study,

    S. Eyuboglu, R. Ehrlich, S. Arora, N. Guha, D. Zinsley, E. Liu, W. Tennien, A. Rudra, J. Zou, A. Mirhoseiniet al., “Cartridges: Lightweight and general-purpose long context representations via self-study, ”arXiv preprint arXiv:2506.06266, 2025

  7. [7]

    Context rot: How increasing input tokens impacts llm performance,

    K. Hong, A. Troynikov, and J. Huber, “Context rot: How increasing input tokens impacts llm performance, ” Chroma, Tech. Rep., July

  8. [8]

    Available: https://research.trychroma.com/context-rot

    [Online]. Available: https://research.trychroma.com/context-rot

  9. [9]

    Llmlingua: Com- pressing prompts for accelerated inference of large language models,

    H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: Com- pressing prompts for accelerated inference of large language models, ” inProceedings of the 2023 conference on empirical methods in natural language processing, 2023, pp. 13 358–13 376

  10. [10]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression,

    H. Jiang, Q. Wu, X. Luo, D. Li, C.-Y. Lin, Y. Yang, and L. Qiu, “Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 1658–1677

  11. [11]

    ACON : Optimizing context compression for long-horizon LLM agents, 2025

    M. Kang, W.-N. Chen, D. Han, H. A. Inan, L. Wutschitz, Y. Chen, R. Sim, and S. Rajmohan, “Acon: Optimizing context compression for long-horizon llm agents, ”arXiv preprint arXiv:2510.00615, 2025

  12. [12]

    Langchain: Building applications with llms through composability,

    LangChain, “Langchain: Building applications with llms through composability, ” 2023. [Online]. Available: https: //github.com/langchain-ai/langchain

  13. [13]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts, ”Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024

  14. [14]

    Llamaindex: A data framework for llm applications,

    LlamaIndex, “Llamaindex: A data framework for llm applications, ”

  15. [15]

    Available: https://github.com/run-llama/llama_index

    [Online]. Available: https://github.com/run-llama/llama_index

  16. [16]

    Evaluating very long-term conversational memory of llm agents,

    A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating very long-term conversational memory of llm agents, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 851– 13 870

  17. [17]

    and Yoon, Seunghyun and Sch

    A. Modarressi, H. Deilamsalehy, F. Dernoncourt, T. Bui, R. A. Rossi, S. Yoon, and H. Schütze, “Nolima: Long-context evaluation beyond literal matching, ”arXiv preprint arXiv:2502.05167, 2025

  18. [18]

    Codex prompting guide,

    OpenAI, “Codex prompting guide, ” 2025, [Online]. Available: https://developers.openai.com/cookbook/examples/gpt-5/codex_ prompting_guide

  19. [19]

    Context-aware hierarchical merging for long document summarization,

    L. Ou and M. Lapata, “Context-aware hierarchical merging for long document summarization, ” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 5534–5561

  20. [20]

    Memgpt: towards llms as operating systems

    C. Packer, V. Fang, S. Patil, K. Lin, S. Wooders, and J. Gonzalez, “Memgpt: towards llms as operating systems. ” 2023

  21. [21]

    Llm evaluators recognize and favor their own generations,

    A. Panickssery, S. R. Bowman, and S. Feng, “Llm evaluators recognize and favor their own generations, ”Advances in Neural Information Processing Systems, vol. 37, pp. 68 772–68 802, 2024

  22. [22]

    LoCoMo-MC10: A 10-choice multiple-choice version of locomo,

    Percena, “LoCoMo-MC10: A 10-choice multiple-choice version of locomo, ” https://huggingface.co/datasets/Percena/locomo-mc10, 2026, hugging Face dataset. Accessed: 2026-05-19

  23. [23]

    On context utilization in summarization with large language models,

    M. Ravaut, A. Sun, N. Chen, and S. Joty, “On context utilization in summarization with large language models, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2764–2781

  24. [24]

    Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios,

    X. Wu, M. Wang, Y. Liu, X. Shi, H. Yan, L. Xiangju, J. Zhu, and W. Zhang, “Lifbench: Evaluating the instruction following performance and stability of large language models in long-context scenarios, ” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 16 445– 16 468

  25. [25]

    Longgenbench: Bench- marking long-form generation in long context llms,

    Y. Wu, M. S. Hee, Z. Hu, and R. K.-W. Lee, “Longgenbench: Bench- marking long-form generation in long context llms, ” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 6851–6872

  26. [26]

    Can llms track their output length? a dynamic feedback mechanism for precise length regulation,

    M. Xiao, A. Wang, Q. Hu, Z. Miao, H. Shen, L. Wang, W. Luo, and J. Su, “Can llms track their output length? a dynamic feedback mechanism for precise length regulation, ”arXiv preprint arXiv:2601.01768, 2026

  27. [27]

    Long context scaling: Divide and conquer via multi-agent question-driven collaboration,

    S. Xiao, Z. Lin, W. Gao, H. Chen, and Y. Zhang, “Long context scaling: Divide and conquer via multi-agent question-driven collaboration, ” arXiv preprint arXiv:2505.20625, 2025

  28. [28]

    Prompt-based one-shot exact length-controlled generation with llms,

    J. Xie and H.-y. Lee, “Prompt-based one-shot exact length-controlled generation with llms, ”arXiv preprint arXiv:2508.13805, 2025

  29. [29]

    Beyond goldfish memory: Long-term open-domain conversation,

    J. Xu, A. Szlam, and J. Weston, “Beyond goldfish memory: Long-term open-domain conversation, ” inProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 2022, pp. 5180–5197

  30. [30]

    Pride and prejudice: Llm amplifies self-bias in self-refinement,

    W. Xu, G. Zhu, X. Zhao, L. Pan, L. Li, and W. Wang, “Pride and prejudice: Llm amplifies self-bias in self-refinement, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 15 474–15 492

  31. [31]

    When does divide and conquer work for long context llm? a noise decomposition framework,

    Z. Xu, S. Zhu, J. Wang, J. Wang, B. Athiwaratkun, C. Wang, J. Zou, and C. Zhang, “When does divide and conquer work for long context llm? a noise decomposition framework, ”arXiv preprint arXiv:2506.16411, 2025

  32. [32]

    Hotpotqa: A dataset for diverse, explainable multi- hop question answering,

    Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning, “Hotpotqa: A dataset for diverse, explainable multi- hop question answering, ” inProceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2369–2380

  33. [33]

    Lifebench: Evaluating length instruction following in large language models,

    W. Zhang, Z. Zhou, K. Wang, J. Fang, R. Xu, Y. Zhang, R. Wang, G. Zhang, X. Li, L. Sunet al., “Lifebench: Evaluating length instruction following in large language models, ”Advances in Neural Information Processing Systems, vol. 38, 2026

  34. [34]

    Demystify verbosity compen- sation behavior of large language models,

    Y. Zhang, S. S. S. Das, and R. Zhang, “Demystify verbosity compen- sation behavior of large language models, ” inProceedings of the 2nd Workshop on Uncertainty-A ware NLP (UncertaiNLP 2025), 2025, pp. 160–178

  35. [35]

    Chain of agents: Large language models collaborating on long-context tasks,

    Y. Zhang, R. Sun, Y. Chen, T. Pfister, R. Zhang, and S. Ö. Arık, “Chain of agents: Large language models collaborating on long-context tasks, ”Advances in Neural Information Processing Systems, vol. 37, pp. 132 208–132 237, 2024

  36. [36]

    Llm ×mapreduce: Simplified long-sequence processing using large language models,

    Z. Zhou, C. Li, X. Chen, S. Wang, Y. Chao, Z. Li, H. Wang, R. An, Q. Shi, Z. Tan, X. Han, X. Shi, Z. Liu, and M. Sun, “Llm ×mapreduce: Simplified long-sequence processing using large language models, ”

  37. [37]

    Available: https://arxiv.org/abs/2410.09342

    [Online]. Available: https://arxiv.org/abs/2410.09342

  38. [38]

    Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026

    A. Zweiger, X. Fu, H. Guo, and Y. Kim, “Fast kv compaction via attention matching, ”arXiv preprint arXiv:2602.16284, 2026. 11