Context Attribution with Multi-Armed Bandit Optimization

Deng Pan; Keerthiram Murugesan; Nitesh Chawla; Nuno Moniz; Ting Hua

arxiv: 2506.19977 · v2 · submitted 2025-06-24 · 💻 cs.AI

Context Attribution with Multi-Armed Bandit Optimization

Deng Pan , Keerthiram Murugesan , Ting Hua , Nuno Moniz , Nitesh Chawla This is my paper

Pith reviewed 2026-05-19 07:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords context attributionmulti-armed banditlinear thompson samplingretrieval-augmented generationmodel interpretabilityquery reductiontoken log-probability

0 comments

The pith

Formulating context attribution as a combinatorial multi-armed bandit problem cuts model queries by up to 30% while matching existing attribution quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that context attribution in retrieval-augmented generation can be reframed as a combinatorial multi-armed bandit problem solved efficiently with Linear Thompson Sampling. This matters to a sympathetic reader because current methods require many separate model queries to test different context subsets, making them impractical for routine use. The approach instead uses the change in token log-probabilities as a reward signal to adaptively choose which subsets to test, building a posterior over which segments matter most. If correct, attribution becomes feasible with substantially lower computational overhead for both open and closed models on question answering tasks.

Core claim

The central discovery is a framework that casts the problem of determining which context segments influence an LLM's answer as a combinatorial multi-armed bandit optimization task. Linear Thompson Sampling is applied to select subsets of segments for evaluation, with rewards computed from the token log-probabilities that the subset produces when the original response is generated. This adaptive selection based on posterior relevance estimates replaces the uniform sampling of prior perturbation methods and yields up to 30% fewer queries while achieving comparable or better attribution on QA benchmarks.

What carries the argument

Linear Thompson Sampling over combinatorial arms representing context segment subsets, with rewards given by the log-probability improvement for the original response tokens.

If this is right

Reduces model queries by up to 30% on multiple QA benchmarks.
Matches or exceeds attribution quality of SHAP and similar methods.
Works for both open-source and black-box models via the same reward function.
Adaptively samples based on learned posterior rather than uniform random subsets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bandit setup could support incremental attribution when new context arrives without full re-computation.
If the log-prob reward is robust, it might allow attribution during generation rather than after the fact.
Extending the reward to include other signals like attention weights could address cases where probabilities alone are insufficient.

Load-bearing premise

The model's token log-probabilities serve as a faithful measure of how supportive a context subset is for the generated response.

What would settle it

Measuring attribution quality on the same benchmarks with only 70 percent of the baseline queries and finding that quality falls below that of uniform sampling methods.

read the original abstract

Understanding which parts of the retrieved context contribute to a large language model's generated answer is essential for building interpretable and trustworthy retrieval-augmented generation. We propose a novel framework that formulates context attribution as a combinatorial multi-armed bandit problem. We utilize Linear Thompson Sampling to efficiently identify the most influential context segments while minimizing the number of model queries. Our reward function leverages token log-probabilities to measure how well a subset of segments supports the original response, making it applicable to both open-source and black-box API-based models. Unlike SHAP and other perturbation-based methods that sample subsets uniformly, our approach adaptively prioritizes informative subsets based on posterior estimates of segment relevance, reducing computational costs. Experiments on multiple QA benchmarks demonstrate that our method achieves up to 30\% reduction in model queries while matching or exceeding the attribution quality of existing approaches. Our code is publicly available at https://github.com/pd90506/camab.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper casts context attribution as a combinatorial bandit problem with Linear Thompson Sampling to cut queries, but the log-prob reward looks like a shaky proxy for actual segment influence.

read the letter

The main takeaway is that this work reframes context attribution in RAG as a combinatorial multi-armed bandit task. Instead of uniform subset sampling like SHAP, it uses Linear Thompson Sampling to adaptively pick informative context segments based on posterior estimates, which the abstract claims cuts model queries by up to 30% while keeping attribution quality comparable on QA benchmarks. The reward comes from token log-probabilities after feeding subsets, and the approach is positioned to work on both open-source and black-box models. Code is released, which is a plus for anyone wanting to test it directly.

Referee Report

2 major / 2 minor

Summary. The manuscript formulates context attribution for RAG as a combinatorial multi-armed bandit problem and applies Linear Thompson Sampling with a reward derived from token log-probabilities to identify influential context segments. It claims this adaptive approach reduces model queries by up to 30% relative to uniform-sampling baselines such as SHAP while preserving or improving attribution quality on QA benchmarks, and releases code for reproducibility.

Significance. If the central experimental claim holds under rigorous validation, the work would provide a practical efficiency gain for perturbation-based interpretability methods that currently scale poorly with context length, extending applicability to both open-weight and API-based models.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central claim of 'up to 30% reduction in model queries while matching or exceeding attribution quality' is presented without reported statistical significance tests, exact baseline configurations, dataset sizes, or enumeration details for the sampled subsets. These omissions make it impossible to assess whether the reported gains are robust or merely artifacts of particular splits.
[§3.2] §3.2 (Reward Function): the assumption that token log-probabilities constitute a faithful proxy for how well a subset supports the original response is load-bearing for the bandit posterior and the claimed superiority over uniform sampling. No ablation or correlation analysis is provided to rule out confounding by calibration, token frequency, or surface-level generation statistics; if the proxy is misaligned, the adaptive selection reduces to expensive uniform sampling with no attribution benefit.

minor comments (2)

[§3.1] Clarify the precise mapping from context segments to arms and the linear feature representation used inside Linear Thompson Sampling.
[§5] Add a short discussion of failure modes when the original response is poorly calibrated or when context segments are highly correlated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of experimental rigor and methodological validation. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central claim of 'up to 30% reduction in model queries while matching or exceeding attribution quality' is presented without reported statistical significance tests, exact baseline configurations, dataset sizes, or enumeration details for the sampled subsets. These omissions make it impossible to assess whether the reported gains are robust or merely artifacts of particular splits.

Authors: We agree that additional statistical and configurational details are necessary to substantiate the central claims. In the revised manuscript, we will augment the abstract and §4 with statistical significance tests, including paired t-tests or Wilcoxon signed-rank tests across multiple random seeds, reporting p-values and confidence intervals for the query reduction figures. We will also specify exact baseline configurations (e.g., number of perturbations for SHAP), dataset sizes (including exact example counts per QA benchmark), and subset enumeration details such as maximum subset cardinality and sampling constraints. These changes will allow readers to evaluate robustness directly. revision: yes
Referee: [§3.2] §3.2 (Reward Function): the assumption that token log-probabilities constitute a faithful proxy for how well a subset supports the original response is load-bearing for the bandit posterior and the claimed superiority over uniform sampling. No ablation or correlation analysis is provided to rule out confounding by calibration, token frequency, or surface-level generation statistics; if the proxy is misaligned, the adaptive selection reduces to expensive uniform sampling with no attribution benefit.

Authors: The log-probability reward is motivated by its ability to provide a fine-grained, model-internal signal of how well a context subset supports the precise token sequence of the original response, which is directly relevant to attribution in RAG. The empirical superiority over uniform-sampling baselines in our experiments indicates that the proxy captures attribution-relevant information rather than reducing to uniform sampling. Nevertheless, we acknowledge the value of explicit validation. In revision we will add an ablation replacing the log-probability reward with an accuracy-based alternative and a correlation analysis against human faithfulness annotations on a subset of examples, discussing any detected confounding factors. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation or experimental claims.

full rationale

The paper formulates context attribution as a combinatorial multi-armed bandit problem solved via Linear Thompson Sampling, with a reward function defined directly from token log-probabilities of the target model. This constitutes an independent algorithmic procedure rather than a fitted equation or self-referential definition. Experimental results (query reduction and attribution quality on QA benchmarks) are reported as measured outcomes of running the method against baselines such as SHAP, not as quantities that reduce by construction to the same parameters used to define the reward or posterior estimates. No self-citation chains, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear as load-bearing steps. The derivation chain is self-contained against external benchmarks and falsifiable via the public code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that log-probability changes can serve as a proxy reward for attribution quality; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Token log-probabilities measure how well a subset of context segments supports the original response
Stated in the reward function description in the abstract.

pith-pipeline@v0.9.0 · 5697 in / 1261 out tokens · 24246 ms · 2026-05-19T07:17:46.035261+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We cast the context attribution problem as a Combinatorial Multi-Armed Bandit (CMAB)... reward function based on normalized token likelihoods
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Combinatorial Thompson Sampling (CTS) maintains a posterior distribution over the importance parameters

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

In-Context Credit Assignment via the Core
cs.GT 2026-05 unverdicted novelty 7.0

Algorithms based on the least core approximate stable credit assignments for AI-generated content using orders of magnitude fewer LLM calls than alternatives.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Generating hierarchical explanations on text classification via feature interaction detection

Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. Generating hierarchical explanations on text classification via feature interaction detection. arXiv preprint arXiv:2004.02015,

work page arXiv 2004
[2]

BanditSum: Extractive Summarization as a Contextual Bandit

Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627,

work page arXiv
[4]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Training language models to generate text with citations via fine-grained rewards

Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. arXiv preprint arXiv:2402.04315,

work page arXiv
[7]

Teaching language models to support answers with verified quotes

10 Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Multi-level explanations for generative language models

Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, Amit Dhurandhar, Manish Nagireddy, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Werner Geyer, et al. Multi-level explanations for generative language models. arXiv preprint arXiv:2403.14459,

work page arXiv
[10]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

work page 2013
[11]

Ada-sise: adaptive semantic input sampling for efficient explanation of convolutional neural networks

Mahesh Sudhakar, Sam Sattarzadeh, Konstantinos N Plataniotis, Jongseong Jang, Yeonjeong Jeong, and Hyunwoo Kim. Ada-sise: adaptive semantic input sampling for efficient explanation of convolutional neural networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1715–1719. IEEE,

work page 2021
[12]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[14]

LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, et al. Longcite: Enabling llms to generate fine-grained citations in long-context qa. arXiv preprint arXiv:2409.02897,

work page arXiv

[1] [1]

Generating hierarchical explanations on text classification via feature interaction detection

Hanjie Chen, Guangtao Zheng, and Yangfeng Ji. Generating hierarchical explanations on text classification via feature interaction detection. arXiv preprint arXiv:2004.02015,

work page arXiv 2004

[2] [2]

BanditSum: Extractive Summarization as a Contextual Bandit

Yue Dong, Yikang Shen, Eric Crawford, Herke van Hoof, and Jackie Chi Kit Cheung. Banditsum: Extractive summarization as a contextual bandit. arXiv preprint arXiv:1809.09672,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627,

work page arXiv

[4] [4]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2:1,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Training language models to generate text with citations via fine-grained rewards

Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. Training language models to generate text with citations via fine-grained rewards. arXiv preprint arXiv:2402.04315,

work page arXiv

[7] [7]

Teaching language models to support answers with verified quotes

10 Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Multi-level explanations for generative language models

Lucas Monteiro Paes, Dennis Wei, Hyo Jin Do, Hendrik Strobelt, Ronny Luss, Amit Dhurandhar, Manish Nagireddy, Karthikeyan Natesan Ramamurthy, Prasanna Sattigeri, Werner Geyer, et al. Multi-level explanations for generative language models. arXiv preprint arXiv:2403.14459,

work page arXiv

[10] [10]

Recursive deep models for semantic compositionality over a sentiment treebank

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642,

work page 2013

[11] [11]

Ada-sise: adaptive semantic input sampling for efficient explanation of convolutional neural networks

Mahesh Sudhakar, Sam Sattarzadeh, Konstantinos N Plataniotis, Jongseong Jang, Yeonjeong Jeong, and Hyunwoo Kim. Ada-sise: adaptive semantic input sampling for efficient explanation of convolutional neural networks. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1715–1719. IEEE,

work page 2021

[12] [12]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[14] [14]

LongCite: Enabling LLMs to generate fine-grained citations in long-context qa.arXiv preprint arXiv:2409.02897, 2024

Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, et al. Longcite: Enabling llms to generate fine-grained citations in long-context qa. arXiv preprint arXiv:2409.02897,

work page arXiv