AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models
Pith reviewed 2026-05-23 21:04 UTC · model grok-4.3
The pith
A predictor trained on minimal sufficient document sets lets RAG systems keep only the documents each query actually needs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaComp first annotates the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate, then constructs triplets of the query, retrieved documents, and its compression rate to train a compression-rate predictor that adaptively determines the compression rate based on both query complexity and retrieval quality.
What carries the argument
A compression-rate predictor trained on triplets of query, retrieved documents, and the minimum number of those documents needed to answer correctly.
If this is right
- The number of kept documents varies with each query instead of using one fixed rate for every input.
- Fewer tokens reach the language model, lowering inference time and memory use.
- Answer quality on standard QA tasks remains nearly identical to the full-context baseline.
- The same procedure works for both single-document QA and conversational multi-document QA.
Where Pith is reading between the lines
- The predictor could be retrained on different retrievers or language models without changing the overall approach.
- If labeling the minimal k during training proves costly, cheaper proxies such as attention patterns inside the model might substitute for it.
- Adaptive selection might let systems safely increase the initial retrieval list size without a matching rise in cost.
Load-bearing premise
The smallest number of documents sufficient to answer a query can be reliably identified in advance for training examples so that a predictor trained on those labels will choose an adequate number for new queries.
What would settle it
Apply the trained predictor to a new set of queries and measure whether answer accuracy falls more than a few percentage points below the accuracy obtained when all retrieved documents are kept.
Figures
read the original abstract
Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Multi-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AdaComp, a low-cost extractive context compression technique for RAG. It first annotates the minimum top-k documents required for a query to be answered correctly, forms (query, retrieved docs, compression-rate) triplets, trains a predictor on these triplets to output an adaptive compression rate based on query complexity and retrieval quality, and then applies the predictor at inference time. Experiments on three QA datasets plus one conversational multi-document QA dataset are reported to show substantial inference-cost reduction while preserving performance nearly identical to the uncompressed baseline.
Significance. If the central experimental claim holds, AdaComp would supply a practical, trainable mechanism for variable-rate context compression that avoids both the over-compression risk of fixed extractive methods and the high cost of generative compressors. The explicit modeling of both query difficulty and retrieval quality as inputs to the predictor is a clear conceptual advance over static top-k or information-bottleneck baselines.
major comments (2)
- [Abstract and §3] Abstract and §3 (Method): the annotation procedure that produces the ground-truth minimum top-k labels is described only as “annotate the minimum top-k documents necessary for the RAG system to answer.” No concrete protocol (incremental LLM prompting, oracle search, human judgment, stability checks across seeds, or inter-annotator agreement) is supplied. Because the predictor is trained directly on these labels, any noise or query-specific bias in the annotation step directly undermines the claim that the learned predictor will generalize to held-out queries while preserving answer quality.
- [§4] §4 (Experiments): the abstract states that AdaComp “significantly reduces inference costs while maintaining performance nearly identical to uncompressed models,” yet supplies neither concrete metrics (exact accuracy deltas, latency or token counts, baselines, error bars, or statistical tests) nor training details for the predictor (architecture, loss, hyper-parameters, dataset sizes). These omissions make the central efficiency–performance trade-off claim impossible to verify from the provided text.
minor comments (1)
- [Abstract] Abstract: the phrase “nearly identical to uncompressed models” is vague; a quantitative bound (e.g., <1 % drop on each dataset) would strengthen the claim even in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and verifiability. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): the annotation procedure that produces the ground-truth minimum top-k labels is described only as “annotate the minimum top-k documents necessary for the RAG system to answer.” No concrete protocol (incremental LLM prompting, oracle search, human judgment, stability checks across seeds, or inter-annotator agreement) is supplied. Because the predictor is trained directly on these labels, any noise or query-specific bias in the annotation step directly undermines the claim that the learned predictor will generalize to held-out queries while preserving answer quality.
Authors: We agree that the current description of the annotation procedure is insufficiently detailed for reproducibility. In the revised version we will expand §3 with a concrete protocol: an automated incremental prompting procedure that begins with the top-ranked document and successively adds the next-highest document until the RAG system produces a correct answer, repeated across three random seeds with a stability threshold. We will also add discussion of potential label noise and a pseudocode listing. revision: yes
-
Referee: [§4] §4 (Experiments): the abstract states that AdaComp “significantly reduces inference costs while maintaining performance nearly identical to uncompressed models,” yet supplies neither concrete metrics (exact accuracy deltas, latency or token counts, baselines, error bars, or statistical tests) nor training details for the predictor (architecture, loss, hyper-parameters, dataset sizes). These omissions make the central efficiency–performance trade-off claim impossible to verify from the provided text.
Authors: We acknowledge that the abstract and introductory sections do not contain the numerical results needed for immediate verification. The full experimental section reports these quantities (accuracy deltas, token counts, etc.), but to make the central claim verifiable without reading the entire paper we will insert a concise metrics summary into the abstract and add an explicit subsection in §4 listing the predictor architecture, loss function, hyperparameters, and training set sizes. revision: yes
Circularity Check
No circularity; method uses external annotation to train independent predictor
full rationale
The paper's core procedure annotates minimum top-k externally, builds triplets, and trains a separate predictor; reported performance is measured on held-out datasets rather than any fitted quantity or self-referential definition. No equations, self-citations, or renamings appear that would force the compression-rate output to equal its training inputs by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Chen, J.; Zhang, R.; Guo, J.; Fan, Y.; and Cheng, X. 2022. GERE: Generative evidence retrieval for fact verification. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2184--2189
work page 2022
- [4]
-
[5]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; and Wang, H. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [6]
-
[7]
LoRA: Low-Rank Adaptation of Large Language Models
Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Huang, Q.; Fu, S.; Liu, X.; Wang, W.; Ko, T.; Zhang, Y.; and Tang, L. 2023. Learning Retrieval Augmentation for Personalized Dialogue Generation. In The 2023 Conference on Empirical Methods in Natural Language Processing
work page 2023
- [9]
- [10]
-
[11]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[12]
Karpukhin, V.; O g uz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; and Yih, W.-t. 2020. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[13]
Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 453--466
work page 2019
- [14]
-
[15]
F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P
Liu, N. F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; and Liang, P. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 157--173
work page 2024
-
[16]
Mao, Y.; He, P.; Liu, X.; Shen, Y.; Gao, J.; Han, J.; and Chen, W. 2021. Generation-Augmented Retrieval for Open-Domain Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4089--4100
work page 2021
-
[17]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Learning to filter context for retrieval-augmented generation,
Wang, Z.; Araki, J.; Jiang, Z.; Parvez, M. R.; and Neubig, G. 2023. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377
- [19]
-
[20]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [21]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.