pith. sign in

arxiv: 2409.01579 · v2 · submitted 2024-09-03 · 💻 cs.CL · cs.AI

AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

Pith reviewed 2026-05-23 21:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords context compressionretrieval-augmented generationadaptive compressionRAG efficiencyextractive compressioncompression rate predictorquery complexityretrieval quality
0
0 comments X

The pith

A predictor trained on minimal sufficient document sets lets RAG systems keep only the documents each query actually needs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieved documents for retrieval-augmented generation often contain noise that raises both inference cost and error rates. The paper argues that the smallest number of top-ranked documents required to answer a query depends on query complexity and retrieval quality, so a fixed compression rate is inefficient. It first labels, for training queries, the fewest top documents that still let the model produce a correct answer. These labels form triplets with the query and documents that train a predictor to output the right compression rate at test time. Experiments on three QA datasets and one conversational multi-document QA dataset show the resulting shorter contexts reduce inference cost while answer accuracy stays nearly the same as the uncompressed baseline.

Core claim

AdaComp first annotates the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate, then constructs triplets of the query, retrieved documents, and its compression rate to train a compression-rate predictor that adaptively determines the compression rate based on both query complexity and retrieval quality.

What carries the argument

A compression-rate predictor trained on triplets of query, retrieved documents, and the minimum number of those documents needed to answer correctly.

If this is right

  • The number of kept documents varies with each query instead of using one fixed rate for every input.
  • Fewer tokens reach the language model, lowering inference time and memory use.
  • Answer quality on standard QA tasks remains nearly identical to the full-context baseline.
  • The same procedure works for both single-document QA and conversational multi-document QA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The predictor could be retrained on different retrievers or language models without changing the overall approach.
  • If labeling the minimal k during training proves costly, cheaper proxies such as attention patterns inside the model might substitute for it.
  • Adaptive selection might let systems safely increase the initial retrieval list size without a matching rise in cost.

Load-bearing premise

The smallest number of documents sufficient to answer a query can be reliably identified in advance for training examples so that a predictor trained on those labels will choose an adequate number for new queries.

What would settle it

Apply the trained predictor to a new set of queries and measure whether answer accuracy falls more than a few percentage points below the accuracy obtained when all retrieved documents are kept.

Figures

Figures reproduced from arXiv: 2409.01579 by Hainan Zhang, Hongwei Zheng, Liang Pang, Qianchi Zhang, Zhiming Zheng.

Figure 1
Figure 1. Figure 1: An illustration of how retrieval quality affects the [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of AdaComp, which includes a retriever module [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An illustration of how the number of documents affects final RAG performance, generally, in the beginning, as the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Case Study: answers generated using without [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Multi-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AdaComp, a low-cost extractive context compression technique for RAG. It first annotates the minimum top-k documents required for a query to be answered correctly, forms (query, retrieved docs, compression-rate) triplets, trains a predictor on these triplets to output an adaptive compression rate based on query complexity and retrieval quality, and then applies the predictor at inference time. Experiments on three QA datasets plus one conversational multi-document QA dataset are reported to show substantial inference-cost reduction while preserving performance nearly identical to the uncompressed baseline.

Significance. If the central experimental claim holds, AdaComp would supply a practical, trainable mechanism for variable-rate context compression that avoids both the over-compression risk of fixed extractive methods and the high cost of generative compressors. The explicit modeling of both query difficulty and retrieval quality as inputs to the predictor is a clear conceptual advance over static top-k or information-bottleneck baselines.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): the annotation procedure that produces the ground-truth minimum top-k labels is described only as “annotate the minimum top-k documents necessary for the RAG system to answer.” No concrete protocol (incremental LLM prompting, oracle search, human judgment, stability checks across seeds, or inter-annotator agreement) is supplied. Because the predictor is trained directly on these labels, any noise or query-specific bias in the annotation step directly undermines the claim that the learned predictor will generalize to held-out queries while preserving answer quality.
  2. [§4] §4 (Experiments): the abstract states that AdaComp “significantly reduces inference costs while maintaining performance nearly identical to uncompressed models,” yet supplies neither concrete metrics (exact accuracy deltas, latency or token counts, baselines, error bars, or statistical tests) nor training details for the predictor (architecture, loss, hyper-parameters, dataset sizes). These omissions make the central efficiency–performance trade-off claim impossible to verify from the provided text.
minor comments (1)
  1. [Abstract] Abstract: the phrase “nearly identical to uncompressed models” is vague; a quantitative bound (e.g., <1 % drop on each dataset) would strengthen the claim even in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and verifiability. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): the annotation procedure that produces the ground-truth minimum top-k labels is described only as “annotate the minimum top-k documents necessary for the RAG system to answer.” No concrete protocol (incremental LLM prompting, oracle search, human judgment, stability checks across seeds, or inter-annotator agreement) is supplied. Because the predictor is trained directly on these labels, any noise or query-specific bias in the annotation step directly undermines the claim that the learned predictor will generalize to held-out queries while preserving answer quality.

    Authors: We agree that the current description of the annotation procedure is insufficiently detailed for reproducibility. In the revised version we will expand §3 with a concrete protocol: an automated incremental prompting procedure that begins with the top-ranked document and successively adds the next-highest document until the RAG system produces a correct answer, repeated across three random seeds with a stability threshold. We will also add discussion of potential label noise and a pseudocode listing. revision: yes

  2. Referee: [§4] §4 (Experiments): the abstract states that AdaComp “significantly reduces inference costs while maintaining performance nearly identical to uncompressed models,” yet supplies neither concrete metrics (exact accuracy deltas, latency or token counts, baselines, error bars, or statistical tests) nor training details for the predictor (architecture, loss, hyper-parameters, dataset sizes). These omissions make the central efficiency–performance trade-off claim impossible to verify from the provided text.

    Authors: We acknowledge that the abstract and introductory sections do not contain the numerical results needed for immediate verification. The full experimental section reports these quantities (accuracy deltas, token counts, etc.), but to make the central claim verifiable without reading the entire paper we will insert a concise metrics summary into the abstract and add an explicit subsection in §4 listing the predictor architecture, loss function, hyperparameters, and training set sizes. revision: yes

Circularity Check

0 steps flagged

No circularity; method uses external annotation to train independent predictor

full rationale

The paper's core procedure annotates minimum top-k externally, builds triplets, and trains a separate predictor; reported performance is measured on held-out datasets rather than any fitted quantity or self-referential definition. No equations, self-citations, or renamings appear that would force the compression-rate output to equal its training inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method rests on the existence of a reliable annotation procedure for minimum top-k and on standard supervised-learning assumptions that the predictor will generalize; no free parameters, invented entities, or non-standard axioms are visible in the abstract.

pith-pipeline@v0.9.0 · 5810 in / 1042 out tokens · 18562 ms · 2026-05-23T21:04:51.667110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Chen, J.; Zhang, R.; Guo, J.; Fan, Y.; and Cheng, X. 2022. GERE: Generative evidence retrieval for fact verification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2184--2189

  4. [4]

    Chevalier, A.; Wettig, A.; Ajith, A.; and Chen, D. 2023. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788

  5. [5]

    Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; and Wang, H. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997

  6. [6]

    Ge, T.; Hu, J.; Wang, L.; Wang, X.; Chen, S.-Q.; and Wei, F. 2023. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945

  7. [7]

    LoRA: Low-Rank Adaptation of Large Language Models

    Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685

  8. [8]

    Huang, Q.; Fu, S.; Liu, X.; Wang, W.; Ko, T.; Zhang, Y.; and Tang, L. 2023. Learning Retrieval Augmentation for Personalized Dialogue Generation. In The 2023 Conference on Empirical Methods in Natural Language Processing

  9. [9]

    Jiang, H.; Wu, Q.; Lin, C.-Y.; Yang, Y.; and Qiu, L. 2023 a . Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736

  10. [10]

    Jiang, H.; Wu, Q.; Luo, X.; Li, D.; Lin, C.-Y.; Yang, Y.; and Qiu, L. 2023 b . Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839

  11. [11]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

  12. [12]

    Karpukhin, V.; O g uz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; and Yih, W.-t. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906

  13. [13]

    Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 453--466

  14. [14]

    Li, Y.; Dong, B.; Lin, C.; and Guerin, F. 2023. Compressing context to enhance inference efficiency of large language models. arXiv preprint arXiv:2310.06201

  15. [15]

    F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P

    Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 157--173

  16. [16]

    Mao, Y.; He, P.; Liu, X.; Shen, Y.; Gao, J.; Han, J.; and Chen, W. 2021. Generation-Augmented Retrieval for Open-Domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4089--4100

  17. [17]

    Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  18. [18]

    Learning to filter context for retrieval-augmented generation,

    Wang, Z.; Araki, J.; Jiang, Z.; Parvez, M. R.; and Neubig, G. 2023. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377

  19. [19]

    Xu, F.; Shi, W.; and Choi, E. 2023. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408

  20. [20]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600

  21. [21]

    Zhu, K.; Feng, X.; Du, X.; Gu, Y.; Yu, W.; Wang, H.; Chen, Q.; Chu, Z.; Chen, J.; and Qin, B. 2024. An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation. arXiv preprint arXiv:2406.01549