Context Attribution with Multi-Armed Bandit Optimization
Pith reviewed 2026-05-19 07:17 UTC · model grok-4.3
The pith
Formulating context attribution as a combinatorial multi-armed bandit problem cuts model queries by up to 30% while matching existing attribution quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is a framework that casts the problem of determining which context segments influence an LLM's answer as a combinatorial multi-armed bandit optimization task. Linear Thompson Sampling is applied to select subsets of segments for evaluation, with rewards computed from the token log-probabilities that the subset produces when the original response is generated. This adaptive selection based on posterior relevance estimates replaces the uniform sampling of prior perturbation methods and yields up to 30% fewer queries while achieving comparable or better attribution on QA benchmarks.
What carries the argument
Linear Thompson Sampling over combinatorial arms representing context segment subsets, with rewards given by the log-probability improvement for the original response tokens.
If this is right
- Reduces model queries by up to 30% on multiple QA benchmarks.
- Matches or exceeds attribution quality of SHAP and similar methods.
- Works for both open-source and black-box models via the same reward function.
- Adaptively samples based on learned posterior rather than uniform random subsets.
Where Pith is reading between the lines
- The same bandit setup could support incremental attribution when new context arrives without full re-computation.
- If the log-prob reward is robust, it might allow attribution during generation rather than after the fact.
- Extending the reward to include other signals like attention weights could address cases where probabilities alone are insufficient.
Load-bearing premise
The model's token log-probabilities serve as a faithful measure of how supportive a context subset is for the generated response.
What would settle it
Measuring attribution quality on the same benchmarks with only 70 percent of the baseline queries and finding that quality falls below that of uniform sampling methods.
read the original abstract
Understanding which parts of the retrieved context contribute to a large language model's generated answer is essential for building interpretable and trustworthy retrieval-augmented generation. We propose a novel framework that formulates context attribution as a combinatorial multi-armed bandit problem. We utilize Linear Thompson Sampling to efficiently identify the most influential context segments while minimizing the number of model queries. Our reward function leverages token log-probabilities to measure how well a subset of segments supports the original response, making it applicable to both open-source and black-box API-based models. Unlike SHAP and other perturbation-based methods that sample subsets uniformly, our approach adaptively prioritizes informative subsets based on posterior estimates of segment relevance, reducing computational costs. Experiments on multiple QA benchmarks demonstrate that our method achieves up to 30\% reduction in model queries while matching or exceeding the attribution quality of existing approaches. Our code is publicly available at https://github.com/pd90506/camab.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formulates context attribution for RAG as a combinatorial multi-armed bandit problem and applies Linear Thompson Sampling with a reward derived from token log-probabilities to identify influential context segments. It claims this adaptive approach reduces model queries by up to 30% relative to uniform-sampling baselines such as SHAP while preserving or improving attribution quality on QA benchmarks, and releases code for reproducibility.
Significance. If the central experimental claim holds under rigorous validation, the work would provide a practical efficiency gain for perturbation-based interpretability methods that currently scale poorly with context length, extending applicability to both open-weight and API-based models.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'up to 30% reduction in model queries while matching or exceeding attribution quality' is presented without reported statistical significance tests, exact baseline configurations, dataset sizes, or enumeration details for the sampled subsets. These omissions make it impossible to assess whether the reported gains are robust or merely artifacts of particular splits.
- [§3.2] §3.2 (Reward Function): the assumption that token log-probabilities constitute a faithful proxy for how well a subset supports the original response is load-bearing for the bandit posterior and the claimed superiority over uniform sampling. No ablation or correlation analysis is provided to rule out confounding by calibration, token frequency, or surface-level generation statistics; if the proxy is misaligned, the adaptive selection reduces to expensive uniform sampling with no attribution benefit.
minor comments (2)
- [§3.1] Clarify the precise mapping from context segments to arms and the linear feature representation used inside Linear Thompson Sampling.
- [§5] Add a short discussion of failure modes when the original response is poorly calibrated or when context segments are highly correlated.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of experimental rigor and methodological validation. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'up to 30% reduction in model queries while matching or exceeding attribution quality' is presented without reported statistical significance tests, exact baseline configurations, dataset sizes, or enumeration details for the sampled subsets. These omissions make it impossible to assess whether the reported gains are robust or merely artifacts of particular splits.
Authors: We agree that additional statistical and configurational details are necessary to substantiate the central claims. In the revised manuscript, we will augment the abstract and §4 with statistical significance tests, including paired t-tests or Wilcoxon signed-rank tests across multiple random seeds, reporting p-values and confidence intervals for the query reduction figures. We will also specify exact baseline configurations (e.g., number of perturbations for SHAP), dataset sizes (including exact example counts per QA benchmark), and subset enumeration details such as maximum subset cardinality and sampling constraints. These changes will allow readers to evaluate robustness directly. revision: yes
-
Referee: [§3.2] §3.2 (Reward Function): the assumption that token log-probabilities constitute a faithful proxy for how well a subset supports the original response is load-bearing for the bandit posterior and the claimed superiority over uniform sampling. No ablation or correlation analysis is provided to rule out confounding by calibration, token frequency, or surface-level generation statistics; if the proxy is misaligned, the adaptive selection reduces to expensive uniform sampling with no attribution benefit.
Authors: The log-probability reward is motivated by its ability to provide a fine-grained, model-internal signal of how well a context subset supports the precise token sequence of the original response, which is directly relevant to attribution in RAG. The empirical superiority over uniform-sampling baselines in our experiments indicates that the proxy captures attribution-relevant information rather than reducing to uniform sampling. Nevertheless, we acknowledge the value of explicit validation. In revision we will add an ablation replacing the log-probability reward with an accuracy-based alternative and a correlation analysis against human faithfulness annotations on a subset of examples, discussing any detected confounding factors. revision: partial
Circularity Check
No significant circularity detected in the derivation or experimental claims.
full rationale
The paper formulates context attribution as a combinatorial multi-armed bandit problem solved via Linear Thompson Sampling, with a reward function defined directly from token log-probabilities of the target model. This constitutes an independent algorithmic procedure rather than a fitted equation or self-referential definition. Experimental results (query reduction and attribution quality on QA benchmarks) are reported as measured outcomes of running the method against baselines such as SHAP, not as quantities that reduce by construction to the same parameters used to define the reward or posterior estimates. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear as load-bearing steps. The derivation chain is self-contained against external benchmarks and falsifiable via the public code.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Token log-probabilities measure how well a subset of context segments supports the original response
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We cast the context attribution problem as a Combinatorial Multi-Armed Bandit (CMAB)... reward function based on normalized token likelihoods
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Combinatorial Thompson Sampling (CTS) maintains a posterior distribution over the importance parameters
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
In-Context Credit Assignment via the Core
Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.
Reference graph
Works this paper leans on
-
[1]
Generating hierarchical explanations on text classification via feature interaction detection
Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. Generating hierarchical explanations on text classification via feature interaction detection. arXiv preprint arXiv:2004.02015,
-
[2]
BanditSum: Extractive Summarization as a Contextual Bandit
Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Enabling large language models to generate text with citations
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627,
-
[4]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training language models to generate text with citations via fine-grained rewards
Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. arXiv preprint arXiv:2402.04315,
-
[7]
Teaching language models to support answers with verified quotes
10 Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Multi-level explanations for generative language models
Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, Amit Dhurandhar, Manish Nagireddy, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Werner Geyer, et al. Multi-level explanations for generative language models. arXiv preprint arXiv:2403.14459,
-
[10]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,
work page 2013
-
[11]
Mahesh Sudhakar, Sam Sattarzadeh, Konstantinos N Plataniotis, Jongseong Jang, Yeonjeong Jeong, and Hyunwoo Kim. Ada-sise: adaptive semantic input sampling for efficient explanation of convolutional neural networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1715–1719. IEEE,
work page 2021
-
[12]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[14]
Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, et al. Longcite: Enabling llms to generate fine-grained citations in long-context qa. arXiv preprint arXiv:2409.02897,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.