pith. sign in

arxiv: 2506.14067 · v4 · submitted 2025-06-16 · 💻 cs.LG

Online Conformal Abstention for Factuality Control Under Adversarial Bandit Feedback

Pith reviewed 2026-05-19 08:51 UTC · model grok-4.3

classification 💻 cs.LG
keywords conformal abstentionfalse discovery rateonline learningadversarial bandit feedbackpartial feedbacklarge language modelsfactuality controlregret bounds
0
0 comments X

The pith

ExAUL achieves O(√T) FDR control for conformal abstention under partial adversarial feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ExAUL as a framework that lets generative systems decide when to abstain from answering to limit unreliable outputs while learning from limited user signals. Real deployments often give only thumbs-up or thumbs-down on the chosen answer and may face shifting or hostile conditions, so the authors build a conversion lemma that turns any bandit algorithm's regret directly into a finite-sample bound on the false discovery rate. They add feedback unlocking to pull extra learning signals from the partial observations. The resulting regret scales as O(√(T ln |H|)), which yields an O(√T) FDR guarantee that matches what full-information methods can deliver.

Core claim

ExAUL combines a conversion lemma that maps the regret of any bandit algorithm into a finite-sample FDR bound with a feedback unlocking strategy that exploits the structure of conformal abstention; this produces a regret bound of O(√(T ln |H|)) that implies an O(√T) bound on FDR risk control, matching the controllability of full-information settings even when feedback is restricted to the selected response and the environment is adversarial.

What carries the argument

The conversion lemma that maps the regret of any bandit algorithm directly into a finite-sample FDR bound, together with the feedback unlocking strategy that extracts additional learning signals from partial feedback on the selected response.

If this is right

  • Generative models can keep the rate of unreliable answers below a controllable threshold while still answering frequently.
  • Large language models on question-answering tasks maintain FDR control in non-stationary and adversarial environments.
  • Partial user feedback on the chosen answer is sufficient to reach the same risk level as full evaluation of all responses.
  • Answering coverage remains competitive while the FDR stays bounded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conversion approach could be adapted to control other risk measures such as false negative rate or expected calibration error.
  • Feedback unlocking may extend naturally to multi-turn conversations where later user signals can retroactively inform earlier abstention decisions.
  • Real-world deployment on live user platforms could test whether the O(√T) scaling holds when the adversarial model is replaced by actual non-stationary human behavior.

Load-bearing premise

The conversion lemma continues to translate bandit regret into a finite-sample FDR bound when feedback is limited to the selected response and the environment is adversarial.

What would settle it

An experiment in which the false discovery rate grows faster than O(√T) while ExAUL runs on an adversarial bandit feedback stream for conformal abstention would disprove the claimed guarantee.

Figures

Figures reproduced from arXiv: 2506.14067 by Minjae Lee, Sangdon Park, Yoonjae Jung.

Figure 1
Figure 1. Figure 1: Qualitative examples from interactive dialog simulation. This demonstrates that our [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of selective generation methods under a stochastic environment with [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of selective generation methods under a stochastic environment with GPT-3.5- [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FDRt and Inefft for ExSUL under an interactive environment with GPT-3.5-Turbo as a question-answering agent (T = 15,000, α = 0.25). A user agent continuously creates questions based on changing context, which simulate distribution shifts. 5.3 Interactive Environment We empirically demonstrate the efficacy of our proposed ExSUL in real-world interactive applications. In particular, we simulate an interactiv… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of selective generation methods under a stochastic environment with GPT-3.5- [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of selective generation methods under a stochastic environment with [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of selective generation methods under a stochastic environment with GPT [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of selective generation methods under a stochastic environment with [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of selective generation methods under a distribution-shift environment with [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: The violin plots for the ablation study over varying [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: The violin plots for the ablation study over varying [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Additional FDRt comparisons of selective generation methods under stochastic envi￾ronments with a longer time horizon. Here, we repeat the datasets to evaluate whether Exp3-SG converges over a long horizon (T = 1, 500K). As shown in the plots, Exp3-SG exhibits bare convergence after a prolonged horizon T. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 t 1e6 0.0 0.1 0.2 0.3 0.4 0.5 FDR ExSUL (ours) Exp3-SG EW-SG No-S… view at source ↗
Figure 24
Figure 24. Figure 24: Additional FDRt comparison of selective generation methods under distribution-shift environments with a longer time horizon. We consider a single distribution shift from 2000K-sized TriviaQA to 2000K-sized NQ, denoted in a dotted vertical line, along with GPT-3.5-turbo as our base generator (T = 4000K, α = 0.1). As can be seen, Exp3-SG eventually satisfies a FDR even under the shift. 24 [PITH_FULL_IMAGE:… view at source ↗
Figure 25
Figure 25. Figure 25: Comparison of ExSUL and ExSUL without the inefficiency loss at (6) for the ablation study, with GPT-3.5-turbo as a generator on TriviaQA (T = 30K, α = 0.25). Removing at in ExSUL clearly leads to convergence to more inefficient solution. This behavior contrasts with the faster convergence of EW-SG compared to ExSUL or Exp3-SG , e.g., EW-SG and ExSUL eventually converges to similar levels of inefficiency i… view at source ↗
read the original abstract

As interactive generative systems are increasingly deployed in real-world applications, their tendency to generate unreliable or false responses raises serious concerns. Conformal abstention mitigates this risk by ensuring that the system answers only when confident. However, real-world deployments typically provide only partial user feedback (e.g., thumbs up/down) on the selected response and often operate in non-stationary or adversarial environments, for which effective learning methods are largely missing. To bridge this gap, we propose ExAUL, a novel online learning framework for conformal abstention with adversarial and partial feedback. Technically, we introduce (i) a novel conversion lemma}that translates the regret of any bandit algorithm into an FDR bound, and (ii) feedback unlocking, a strategy that exploits the structure of conformal abstention to extract additional learning signals from partial feedback. We prove that ExAUL achieves a regret bound of $O(\sqrt{T \ln |{H}|})$, which translates into an ${O}(\sqrt{T})$ bound on FDR risk control, matching the controllability of full-information settings despite receiving only partial feedback. While applicable to general generative tasks, we demonstrate the efficacy of ExAUL for ensuring the reliability of Large Language Models (LLMs) through empirical validation on question-answering tasks across diverse non-stationary and adversarial settings. Our results demonstrate that ExAUL robustly controls the FDR while maintaining competitive answering coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces ExAUL, an online framework for conformal abstention that controls false discovery rate (FDR) in generative models under adversarial bandit feedback. It proposes a conversion lemma mapping the regret of any bandit algorithm to a finite-sample FDR bound and a 'feedback unlocking' technique that extracts extra signals from partial (selected-response) feedback. The central theorem states that ExAUL attains O(√(T ln |H|)) regret, which translates into an O(√T) FDR guarantee, matching full-information rates; the claim is supported by an empirical evaluation on LLM question-answering tasks in non-stationary and adversarial regimes.

Significance. If the conversion lemma is shown to hold without hidden full-information or stationarity assumptions, the result would be significant: it extends conformal prediction to realistic partial-feedback, adversarial online settings while preserving the √T rate that is standard in full-information conformal control. The empirical demonstration across diverse non-stationary environments provides concrete evidence of practical utility for factuality control in LLMs.

major comments (2)
  1. [§3.2, Lemma 3] §3.2, Lemma 3 (Conversion Lemma): The lemma asserts that any bandit algorithm's regret directly yields an O(√T) FDR bound under partial feedback. The provided derivation does not explicitly construct an unbiased or conservatively biased estimator for the full loss vector when the adversary chooses both the loss vector and the feedback mask adaptively; without this step the translation from regret to FDR is not yet load-bearing for the headline claim.
  2. [§4.1, Theorem 1] §4.1, Theorem 1: The O(√(T ln |H|)) regret bound is stated to hold for ExAUL. The proof sketch relies on the conversion lemma plus feedback unlocking, yet the interaction between the two components under adaptive adversarial feedback is not fully expanded; this is the sole bridge to the FDR guarantee and therefore requires a self-contained argument.
minor comments (3)
  1. [Abstract] Notation: the abstract writes |{H}| while the body uses |H|; standardize to |H| throughout.
  2. [Figure 2] Figure 2 caption: the legend labels 'ExAUL (unlocked)' and 'ExAUL (baseline)' but the text does not define the baseline variant; add a one-sentence clarification.
  3. [§2] Related work: the discussion of prior online conformal methods omits the recent bandit-conformal line (e.g., works on contextual bandits with abstention); a brief comparison would clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below, providing clarifications and committing to revisions that strengthen the rigor of the proofs without altering the core claims.

read point-by-point responses
  1. Referee: [§3.2, Lemma 3] §3.2, Lemma 3 (Conversion Lemma): The lemma asserts that any bandit algorithm's regret directly yields an O(√T) FDR bound under partial feedback. The provided derivation does not explicitly construct an unbiased or conservatively biased estimator for the full loss vector when the adversary chooses both the loss vector and the feedback mask adaptively; without this step the translation from regret to FDR is not yet load-bearing for the headline claim.

    Authors: We thank the referee for highlighting this point. The conversion lemma in the current manuscript relies on the structure of conformal abstention to ensure that the partial feedback (only on the selected response) yields a conservatively biased estimator for the full loss vector. In the revision, we will explicitly construct this estimator: for the selected action we use the observed loss directly (unbiased), and for non-selected actions we use a conservative upper bound derived from the abstention threshold and the fact that the adversary's mask is applied after the action selection. This construction holds under fully adaptive adversarial choices of both losses and masks, without requiring stationarity or hidden full-information assumptions. We will add this step-by-step derivation to Section 3.2 to make the regret-to-FDR translation fully load-bearing. revision: yes

  2. Referee: [§4.1, Theorem 1] §4.1, Theorem 1: The O(√(T ln |H|)) regret bound is stated to hold for ExAUL. The proof sketch relies on the conversion lemma plus feedback unlocking, yet the interaction between the two components under adaptive adversarial feedback is not fully expanded; this is the sole bridge to the FDR guarantee and therefore requires a self-contained argument.

    Authors: We agree that the interaction between feedback unlocking and the conversion lemma merits a more self-contained argument in the proof of Theorem 1. Feedback unlocking exploits the specific structure of conformal abstention (where the decision to answer or abstain provides an implicit signal even under partial feedback) to extract additional unbiased estimates beyond the bandit observation. In the revised proof, we will first show how unlocking augments the effective feedback to reduce the variance of the loss estimates, then apply the (revised) conversion lemma to translate the resulting O(√(T ln |H|)) regret directly into the O(√T) FDR bound. The argument will be expanded to explicitly track the adaptive adversary's choices at each step, confirming that no additional assumptions are needed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation reduces to external bandit regret via novel lemma

full rationale

The paper's core result derives an O(√T) FDR bound from the O(√(T ln |H|)) regret of any standard bandit algorithm, using a novel conversion lemma and feedback unlocking to handle partial adversarial feedback. The bandit regret is an external quantity from online learning theory, not fitted or defined within the paper. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided derivation chain; the lemma is explicitly introduced as new and bridges to FDR without assuming full observation by construction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard online-learning assumptions plus two new technical devices whose validity is not independently verified in the abstract.

axioms (1)
  • domain assumption User feedback is adversarial bandit feedback observed only on the selected response.
    Stated in the problem formulation for non-stationary and adversarial environments.

pith-pipeline@v0.9.0 · 5787 in / 1269 out tokens · 32603 ms · 2026-05-19T08:51:25.073483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    GPT-3.5-turbo model card

    OpenAI. GPT-3.5-turbo model card. https://platform.openai.com/docs/models/ gpt-3-5-turbo , 2023

  2. [2]

    LLaMA 3.1 8B-Instruct model card

    Meta AI. LLaMA 3.1 8B-Instruct model card. https://huggingface.co/meta-llama/ Llama-3.1-8B-Instruct , 2024

  3. [3]

    Selective classification for deep neural networks.Advances in neural information processing systems, 30, 2017

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks.Advances in neural information processing systems, 30, 2017

  4. [4]

    Hierarchical selective classification

    Shani Goren, Ido Galil, and Ran El-Yaniv. Hierarchical selective classification. arXiv preprint arXiv:2405.11533, 2024

  5. [5]

    Learning to reject with a fixed predictor: Application to decontextualization

    Christopher Mohri, Daniel Andor, Eunsol Choi, and Michael Collins. Learning to reject with a fixed predictor: Application to decontextualization. arXiv preprint arXiv:2301.09044, 2023

  6. [6]

    Selective generation for con- trollable language models

    Minjae Lee, Kyungmin Kim, Taesoo Kim, and Sangdon Park. Selective generation for con- trollable language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  7. [7]

    The nonstochastic multiarmed bandit problem

    Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002

  8. [8]

    TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics

  9. [9]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  10. [10]

    Prediction, learning, and games

    Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006

  11. [11]

    Foundations of machine learning

    Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012

  12. [12]

    Semi-bandit feedback: A survey of results

    Karthik Abinav Sankararaman. Semi-bandit feedback: A survey of results. ARXIV CoRR, 2016

  13. [13]

    Foster and Alexander Rakhlin

    Dylan J. Foster and Alexander Rakhlin. Foundations of reinforcement learning and interactive decision making, 2023

  14. [14]

    The weighted majority algorithm

    Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and computation, 108(2):212–261, 1994

  15. [15]

    Sample mean based index policies by o (log n) regret for the multi-armed bandit problem

    Rajeev Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in applied probability, 27(4):1054–1078, 1995

  16. [16]

    Eluder dimension and the sample complexity of optimistic exploration

    Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. Advances in Neural Information Processing Systems, 26, 2013. 11

  17. [17]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  18. [18]

    A large annotated corpus for learning natural language inference

    Samuel Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, 2015

  19. [19]

    Trading regret for efficiency: online convex optimization with long term constraints.The Journal of Machine Learning Research, 13(1):2503– 2528, 2012

    Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: online convex optimization with long term constraints.The Journal of Machine Learning Research, 13(1):2503– 2528, 2012

  20. [20]

    An online convex optimization approach to proactive network resource allocation

    Tianyi Chen, Qing Ling, and Georgios B Giannakis. An online convex optimization approach to proactive network resource allocation. IEEE Transactions on Signal Processing, 65(24):6350– 6364, 2017

  21. [21]

    Exponentially weighted algorithm for online network resource allocation with long-term constraints

    Ahmed Sid-Ali, Ioannis Lambadaris, Yiqiang Q Zhao, Gennady Shaikhet, and Amirhossein Asgharnia. Exponentially weighted algorithm for online network resource allocation with long-term constraints. arXiv preprint arXiv:2405.02373, 2024

  22. [22]

    Optimal algorithms for online convex optimization with adversarial constraints

    Abhishek Sinha and Rahul Vaze. Optimal algorithms for online convex optimization with adversarial constraints. Advances in Neural Information Processing Systems, 37:41274–41302, 2024

  23. [23]

    Stochastic online conformal prediction with semi-bandit feedback

    Haosen Ge, Hamsa Bastani, and Osbert Bastani. Stochastic online conformal prediction with semi-bandit feedback. arXiv preprint arXiv:2405.13268, 2024

  24. [24]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

  25. [25]

    When you are done with the current context, say ’SHIFT’ to request a new context

    Harrison Chase. LangChain, October 2022. 12 A Additional Related Work Constrained Online Learning Several works have studied online learning with constraints. [19] first proposes long-term constraints in online convex optimization with time-invariant constraints, which allows to have the constraint function gt to be violated at each time step, i.e., g(xt)...