pith. machine review for the scientific record. sign in

arxiv: 2605.06605 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:09 UTC · model grok-4.3

classification 💻 cs.LG
keywords dynamic budget allocationconformal survival analysisLLM evaluationjailbreak predictionmulti-turn interactionsfinite-sample coveragecensored datatime-to-event bounds
0
0 comments X

The pith

DAPRO dynamically reallocates limited compute across LLM test cases to produce valid lower bounds on the number of turns until events like jailbreaks occur.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating large language models over multiple conversation turns is costly because events such as successful jailbreaks or task completions can appear only after many interactions and remain unobserved under any fixed budget. Static allocation methods waste turns on quick cases and under-test hard ones, leading to inefficient use of resources and weaker statistical bounds. The paper introduces DAPRO, a framework that adjusts each case's remaining budget on the fly via projected optimization while updating censoring weights. It proves that this dynamic scheme still satisfies the total budget limit and supplies distribution-free finite-sample coverage for the lower predictive bounds on iteration counts, without the conditional independence assumption required by earlier conformal survival techniques. The resulting coverage bound depends on the square root of the average censoring weight rather than the worst-case weight, and the approach also yields low-variance unbiased estimates of aggregate metrics such as overall jailbreak rates.

Core claim

DAPRO is the first dynamic budget allocation framework for bounding time-to-event in multi-turn LLM interactions. It satisfies the budget constraint and supplies distribution-free, finite-sample coverage guarantees without requiring conditional independence between censoring and event times. A novel coverage bound scales with the square root of the mean censoring weight rather than the worst-case weight, and the method supports unbiased estimation of population-level metrics such as jailbreak rates under constrained computation.

What carries the argument

DAPRO (Dynamic Allocation via PRojected Optimization), which dynamically adjusts per-sample iteration budgets by solving a projected optimization problem that updates censoring weights on the fly to enforce the global budget while preserving conformal coverage.

Load-bearing premise

The projected optimization step that makes the allocation dynamic continues to preserve the finite-sample coverage properties of the underlying conformal survival bounds when censoring weights are updated online.

What would settle it

Apply DAPRO to a large collection of multi-turn LLM interactions with a known nominal coverage level such as 90 percent and measure the empirical fraction of cases in which the true iteration count lies above the reported lower bound; consistent under-coverage on independent data would falsify the guarantee.

Figures

Figures reproduced from arXiv: 2605.06605 by Shai Feldman, Yaniv Romano.

Figure 1
Figure 1. Figure 1: Illustration of our framework: (i) collecting data via dynamic budget allocation; (ii) view at source ↗
Figure 2
Figure 2. Figure 2: Adaptive vs. static budget allocation. By dynamically adjusting the budget over time view at source ↗
Figure 3
Figure 3. Figure 3: Theoretical coverage lower bounds of Theorem 1 (green) and [ view at source ↗
Figure 4
Figure 4. Figure 4: Coverage rate and LPB size of various methods over the RedTeam datasets with Qwen 2.5 view at source ↗
Figure 5
Figure 5. Figure 5: Toxicity dataset: absolute coverage deviation (left) and average budget utilized (right) across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Additional experimental setups, including an analysis of the empirical mean weights and the number of observed unsafe events, are deferred to Appendix F. Furthermore, our ablation study in Appendix F.7 reveals that even when degrad… view at source ↗
Figure 6
Figure 6. Figure 6: AutoIF dataset: coverage deviation (left) and budget utilized (right) over Qwen 2.5 target model. Target coverage rate: 90% for LPB and 70% for UPB, and B¯ = 20 budget per sample. 5 Discussion and impact statement We introduced DAPRO, a novel dynamic budget allocation approach for LLM evaluation. Our approach can be employed to construct reliable LPBs for individual prompts and extract unbiased population￾… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of our framework for LLM utility evaluation: (i) collecting data via dynamic view at source ↗
Figure 8
Figure 8. Figure 8: Toxicity dataset: coverage rate, LPB size, coverage deviation, and budget utilized by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. methods successfully attain valid coverage and satisfy the budget constraint. However, the static optimized baseline still suffers f… view at source ↗
Figure 9
Figure 9. Figure 9: Toxicity dataset: number of observed unsafe events, mean inverse-probability weight, coverage variance, and LPB variance by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. F.3 RedTeam dataset with LlamaGuard as a judge Figures 12 and 13 detail the results on the Red… view at source ↗
Figure 10
Figure 10. Figure 10: RedTeam dataset with Qwen 2.5 14B Instruct as a judge: coverage rate, LPB size, coverage deviation, and budget utilized by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. budget split parameter is not tuned correctly. Once again, the LocallyAdaptive technique attai… view at source ↗
Figure 11
Figure 11. Figure 11: RedTeam dataset with Qwen 2.5 14B Instruct as a judge: number of observed unsafe events, mean inverse-probability weight, coverage variance, and LPB variance by various meth￾ods across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. evaluate the success of the attack, the judge is provid… view at source ↗
Figure 12
Figure 12. Figure 12: RedTeam dataset with Llama-Guard as a judge: coverage rate, LPB size, coverage deviation, and budget utilized by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 10 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. again exhibits lower variance than the static baseline, though not as low as the globally optimized DAPR… view at source ↗
Figure 13
Figure 13. Figure 13: RedTeam dataset with Llama-Guard as a judge: number of observed unsafe events, mean inverse-probability weight, coverage variance, and LPB variance by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 10 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. it outputs 1. We repeat our analysis for this experiment once for … view at source ↗
Figure 14
Figure 14. Figure 14: Hallucination dataset: coverage rate, LPB size, coverage deviation, and budget utilized by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 10 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. we examined, indicating that our approach is robust to these choices. Finally, these figures indicate that Greedy is not robus… view at source ↗
Figure 15
Figure 15. Figure 15: Hallucination dataset: number of observed successful events, mean inverse-probability weight, coverage variance, and LPB variance by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 10 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. the variance of all methods is very similar, as the UPBs they generate are closer to… view at source ↗
Figure 16
Figure 16. Figure 16: AutoIF (LPB) dataset: coverage rate, LPB size, coverage deviation, and budget utilized by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. which quantifies the average number of turns required for the model to produce that unsafe response. When evaluating LLM utilit… view at source ↗
Figure 17
Figure 17. Figure 17: AutoIF (LPB) dataset: number of observed successful events, mean inverse-probability weight, coverage variance, and LPB variance by various methods across four target LLMs. Target coverage rate: 90% and target B¯ = 20 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. censoring time, such as DAPRO, we can construct unbiased estimates of the true popula… view at source ↗
Figure 18
Figure 18. Figure 18: AutoIF (UPB) dataset: coverage rate, UPB size, coverage deviation, and budget utilized by various methods across four target LLMs. Target coverage rate: 70% and target B¯ = 30 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. where, w(i) = P (Ti ≤ Ci | {(Xi , Hi , Ti)}i∈Ical1 , Dtrain) −1 is the corresponding weight. In our experiments, since conversa… view at source ↗
Figure 19
Figure 19. Figure 19: AutoIF (UPB) dataset: number of observed successful events, mean inverse-probability weight, coverage variance, and UPB variance by various methods across four target LLMs. Target coverage rate: 70% and target B¯ = 30 budget per sample. Performance metrics are taken over 50 random splits of the calibration and test sets. where the adjusted weights are w tmax (i) = P (Ti ≤ Ci or Ci = tmax | {(Xi , Hi , Ti)… view at source ↗
Figure 20
Figure 20. Figure 20: Population-level safety metric estimation on the Toxicity dataset with Qwen 2.5 14B view at source ↗
Figure 21
Figure 21. Figure 21: Population-level metric estimation on the view at source ↗
Figure 22
Figure 22. Figure 22: Population-level metric estimation on the view at source ↗
Figure 23
Figure 23. Figure 23: Population-level metric estimation on the view at source ↗
Figure 24
Figure 24. Figure 24: Population-level metric estimation on the view at source ↗
Figure 25
Figure 25. Figure 25: Population-level metric estimation on the view at source ↗
Figure 26
Figure 26. Figure 26: Impact of first calibration set split size ( view at source ↗
Figure 27
Figure 27. Figure 27: Impact of score informativeness (λ) on empirical coverage, budget consumed per sample, and mean weight on the Toxicity dataset. The Qwen 2.5 14B Instruct model serves as both attacker and target, with a first-split set size of N1 = 100. Scores are corrupted by injecting random noise with level λ. The nominal coverage level is set to 1 − α = 90% and the target budget is 20. Shaded regions denote semi-devia… view at source ↗
Figure 28
Figure 28. Figure 28: Impact of the nominal budget per sample on empirical coverage, budget consumed per view at source ↗
read the original abstract

Evaluating and predicting the performance of large language models (LLMs) in multi-turn conversational settings is critical yet computationally expensive; key events -- e.g., jailbreaks or successful task completion by an agent -- often emerge only after repeated interactions. These events might be rare, and under any feasible computational budget, remain unobserved. Recent conformal survival frameworks construct reliable lower predictive bounds (LPBs) on the number of iterations to trigger the event of interest, but rely on static budget allocation that is inefficient in multi-turn setups. To address this, we introduce \emph{Dynamic Allocation via PRojected Optimization} (DAPRO), the first theoretically valid dynamic budget allocation framework for bounding the time-to-event in multi-turn LLM interactions. We prove that DAPRO satisfies the budget constraint and provides distribution-free, finite-sample coverage guarantees without requiring the conditional independence between censoring and event times assumed by prior conformal survival approaches. A key theoretical contribution is a novel coverage bound that scales with the square root of the mean censoring weight rather than the worst-case weight, yielding provably tighter guarantees than prior work. Furthermore, DAPRO can be employed to obtain unbiased, low-variance estimates of population-level evaluation metrics, such as the jailbreak rate, under limited computing resources. Comprehensive experiments across agentic task success, adversarial jailbreaks, toxic content generation, and RAG hallucinations using LLMs such as Llama 3.1 and Qwen 2.5 demonstrate that DAPRO consistently achieves coverage closer to the nominal level with lower variance than static baselines, while satisfying the budget constraint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DAPRO (Dynamic Allocation via PRojected Optimization), a dynamic budget allocation framework for multi-turn LLM evaluations that constructs lower predictive bounds on time-to-event (e.g., jailbreaks or task success) using conformal survival analysis. It claims to satisfy hard budget constraints while delivering distribution-free finite-sample coverage guarantees without the conditional independence assumption between censoring and event times required by prior work, via a novel bound that scales with the square root of the mean censoring weight rather than the worst-case weight. Experiments on agentic tasks, adversarial jailbreaks, toxicity, and RAG hallucinations with models such as Llama 3.1 and Qwen 2.5 are reported to achieve coverage closer to the nominal level with lower variance than static baselines.

Significance. If the coverage guarantees survive the dynamic projection step, the work would meaningfully advance compute-efficient LLM evaluation for rare conversational events by enabling adaptive allocation and unbiased population-level metric estimates (e.g., jailbreak rates) under fixed budgets. The relaxation of the independence assumption and the tighter sqrt-mean-weight bound represent concrete improvements over existing conformal survival methods, with potential practical impact in safety and capability assessment pipelines.

major comments (2)
  1. [§4] §4 (Theoretical Analysis), the projected optimization step in DAPRO: the central finite-sample coverage claim for the dynamic case rests on the projection operator preserving exchangeability of the nonconformity scores and the validity of the conformal p-values when censoring weights are updated on the fly. The provided proof sketch does not explicitly address whether the dependence introduced by dynamic weight updates (which depend on observed outcomes) violates the conditions needed to carry over the static conformal guarantee; a detailed argument or counter-example analysis is required, as this is load-bearing for the claim of validity without conditional independence.
  2. [§5] §5 (Experiments), quantitative results: the abstract states that DAPRO achieves 'coverage closer to the nominal level with lower variance,' but the manuscript must report explicit numerical comparisons (e.g., empirical coverage rates, variance values, and p-values against static conformal baselines) across all tasks and models, including confidence intervals or standard errors, to substantiate the improvement and allow assessment of effect sizes.
minor comments (2)
  1. [§2] Notation for censoring weights and nonconformity scores should be defined once in §2 or §3 and used consistently; current usage mixes w_i and W_t without a clear mapping to the dynamic update rule.
  2. [§3] The manuscript would benefit from an explicit statement of the computational complexity of the projected optimization step per iteration, as this affects practical deployability under tight budgets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help clarify the presentation of our theoretical results and strengthen the experimental reporting. We address each major comment below and will incorporate revisions to improve the manuscript.

read point-by-point responses
  1. Referee: §4 (Theoretical Analysis), the projected optimization step in DAPRO: the central finite-sample coverage claim for the dynamic case rests on the projection operator preserving exchangeability of the nonconformity scores and the validity of the conformal p-values when censoring weights are updated on the fly. The provided proof sketch does not explicitly address whether the dependence introduced by dynamic weight updates (which depend on observed outcomes) violates the conditions needed to carry over the static conformal guarantee; a detailed argument or counter-example analysis is required, as this is load-bearing for the claim of validity without conditional independence.

    Authors: We appreciate the referee's emphasis on rigorously justifying the dynamic projection step, which is indeed central to our coverage claims. The proof sketch in the manuscript relies on the projection being a deterministic, contractive operator applied to weights that are predictable with respect to the filtration of observed outcomes, thereby preserving the exchangeability of the nonconformity scores and the super-martingale property of the p-values. This allows the finite-sample guarantee to carry over without invoking conditional independence between censoring and event times. To address the concern directly, we will expand §4 in the revision with a dedicated lemma that formally proves preservation of exchangeability under the dynamic updates, including an explicit argument that the dependence structure remains controlled by the square-root mean-weight bound. We do not believe a counter-example exists within our stated assumptions, but welcome any specific counter-example the referee may have in mind so that we can address it. revision: yes

  2. Referee: §5 (Experiments), quantitative results: the abstract states that DAPRO achieves 'coverage closer to the nominal level with lower variance,' but the manuscript must report explicit numerical comparisons (e.g., empirical coverage rates, variance values, and p-values against static conformal baselines) across all tasks and models, including confidence intervals or standard errors, to substantiate the improvement and allow assessment of effect sizes.

    Authors: We agree that explicit numerical values and statistical comparisons will make the experimental claims more transparent and allow readers to better evaluate effect sizes. Although the current manuscript presents these trends via figures across the agentic, jailbreak, toxicity, and RAG tasks with Llama 3.1 and Qwen 2.5, we will add a dedicated table (or expanded results section) in the revision that reports the precise empirical coverage rates, variance values, standard errors, and p-values from paired statistical tests against the static conformal baselines for every task-model combination. This will directly substantiate the abstract statement with the quantitative detail requested. revision: yes

Circularity Check

0 steps flagged

No circularity: DAPRO's guarantees rest on new theoretical extensions of conformal prediction

full rationale

The paper introduces DAPRO as a dynamic allocation method and states that it proves budget satisfaction plus distribution-free finite-sample coverage without the conditional independence assumption of prior conformal survival work. The key novelty is a coverage bound scaling with sqrt(mean censoring weight) together with a projected optimization step. No equation or claim reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the derivation chain is presented as an independent extension whose validity is established inside the paper rather than imported from the authors' prior results. The skeptic concern about the projection operator is a question of proof correctness, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on extensions of conformal prediction theory to dynamic allocation and a novel coverage bound. No explicit free parameters are described in the abstract. The method relaxes the conditional independence assumption of prior work.

axioms (2)
  • standard math Distribution-free finite-sample coverage guarantees from conformal prediction theory
    Invoked to obtain the LPBs and the new coverage result.
  • domain assumption The projected optimization preserves the coverage properties when allocations are made dynamic
    Required for the dynamic framework to inherit the finite-sample guarantees.
invented entities (1)
  • DAPRO (Dynamic Allocation via PRojected Optimization) no independent evidence
    purpose: Dynamic budget allocation procedure for time-to-event bounding
    New algorithmic framework introduced by the paper.

pith-pipeline@v0.9.0 · 5590 in / 1498 out tokens · 74174 ms · 2026-05-08T12:09:18.713124+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 25 canonical work pages · 13 internal anchors

  1. [1]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  2. [2]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models.arXiv preprint arXiv:2303.18223, 2023

  3. [3]

    RealToxic- ityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. RealToxic- ityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InFindings of the Association for Computational Linguistics: EMNLP, pages 3356–3369, 2020

  4. [4]

    Extracting training data from large language models

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Kather- ine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21), pages 2633–2650, 2021

  5. [5]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

  6. [6]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  7. [7]

    Automating customer service using langchain: Building custom open-source gpt chatbot for organizations.arXiv preprint arXiv:2310.05421, 2023

    Keivalya Pandya and Mehfuza Holia. Automating customer service using langchain: Building custom open-source gpt chatbot for organizations.arXiv preprint arXiv:2310.05421, 2023

  8. [8]

    Watermark in the classroom: A conformal framework for adaptive ai usage detection.arXiv preprint arXiv:2507.23113, 2025

    Yangxinyu Xie, Xuyang Chen, Zhimei Ren, and Weijie J Su. Watermark in the classroom: A conformal framework for adaptive ai usage detection.arXiv preprint arXiv:2507.23113, 2025. 10

  9. [9]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  10. [10]

    Red Teaming Language Models with Language Models

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

  11. [11]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  12. [12]

    Calibrated predictive lower bounds on time-to-unsafe-sampling in LLMs

    Hen Davidov, Shai Feldman, Gilad Freidkin, and Yaniv Romano. Calibrated predictive lower bounds on time-to-unsafe-sampling in LLMs. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026

  13. [13]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  14. [14]

    Conformalized survival analysis with adaptive cut-offs.Biometrika, 111(2):459–477, 2024

    Yu Gui, Rohan Hore, Zhimei Ren, and Rina Foygel Barber. Conformalized survival analysis with adaptive cut-offs.Biometrika, 111(2):459–477, 2024

  15. [15]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  16. [16]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

  17. [17]

    Self-play with execution feedback: Improving instruction-following capabilities of large language models

    Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. InThe Thirteenth International Conference on Learning Representations, 2025

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743, 2025

  20. [20]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  21. [21]

    Detoxify

    Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

  22. [22]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  23. [23]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  24. [24]

    and Bates, Stephen and Cand

    Anastasios N Angelopoulos, Stephen Bates, Emmanuel J Candès, Michael I Jordan, and Lihua Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.arXiv preprint arXiv:2110.01052, 2021. 11

  25. [25]

    Anastasios N Angelopoulos

    Anastasios N Angelopoulos. Conformal risk control for non-monotonic losses.arXiv preprint arXiv:2602.20151, 2026

  26. [26]

    Predictive inference with the jackknife+.The Annals of Statistics, 49(1):486–507, 2021

    Rina Foygel Barber, Emmanuel J Candès, Aaditya Ramdas, and Ryan J Tibshirani. Predictive inference with the jackknife+.The Annals of Statistics, 49(1):486–507, 2021

  27. [27]

    Leave-one-out stable conformal prediction.arXiv preprint arXiv:2504.12189, 2025

    Kiljae Lee and Yuan Zhang. Leave-one-out stable conformal prediction.arXiv preprint arXiv:2504.12189, 2025

  28. [28]

    Nested conformal prediction and quantile out-of-bag ensemble methods.Pattern Recognition, 127:108496, 2022

    Chirag Gupta, Arun K Kuchibhotla, and Aaditya Ramdas. Nested conformal prediction and quantile out-of-bag ensemble methods.Pattern Recognition, 127:108496, 2022

  29. [29]

    Improving conditional coverage via orthogo- nal quantile regression.Advances in neural information processing systems, 34:2060–2071, 2021

    Shai Feldman, Stephen Bates, and Yaniv Romano. Improving conditional coverage via orthogo- nal quantile regression.Advances in neural information processing systems, 34:2060–2071, 2021

  30. [30]

    Conformal prediction with condi- tional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4):1100–1126, 2025

    Isaac Gibbs, John J Cherian, and Emmanuel J Candès. Conformal prediction with condi- tional guarantees.Journal of the Royal Statistical Society Series B: Statistical Methodology, 87(4):1100–1126, 2025

  31. [31]

    Adaptive conformal inference under distribution shift

    Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660–1672, 2021

  32. [32]

    Achieving risk control in online learning settings.Transactions on Machine Learning Research, 2023

    Shai Feldman, Liran Ringel, Stephen Bates, and Yaniv Romano. Achieving risk control in online learning settings.Transactions on Machine Learning Research, 2023

  33. [33]

    Label noise robustness of conformal prediction.Journal of Machine Learning Research, 25(328):1–66, 2024

    Bat-Sheva Einbinder, Shai Feldman, Stephen Bates, Anastasios N Angelopoulos, Asaf Gendler, and Yaniv Romano. Label noise robustness of conformal prediction.Journal of Machine Learning Research, 25(328):1–66, 2024

  34. [34]

    arXiv preprint arXiv:2405.02648 , year=

    Coby Penso and Jacob Goldberger. A conformal prediction score that is robust to label noise. arXiv preprint arXiv:2405.02648, 2024

  35. [35]

    Hale, and Paul Röttger

    Bertie Vidgen, Nino Scherrer, Hannah Rose Kirk, Rebecca Qian, Anand Kannappan, Scott A Hale, and Paul Röttger. SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models.arXiv preprint arXiv:2311.08370, 2023

  36. [36]

    Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails

    Traian Rebedea, Razvan Dinu, Makesh Narsimhan Sreedhar, Christopher Parisien, and Jonathan Cohen. Nemo guardrails: A toolkit for controllable and safe llm applications with programmable rails. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 431–445, 2023

  37. [37]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  38. [38]

    Jailbroken: How does llm safety training fail?Advances in neural information processing systems, 36:80079–80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in neural information processing systems, 36:80079–80110, 2023

  39. [39]

    Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024

    Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically.Advances in Neural Information Processing Systems, 37:61065–61105, 2024

  40. [40]

    Certifying llm safety against adversarial prompting

    Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Aaron Jiaxun Li, Soheil Feizi, and Himabindu Lakkaraju. Certifying llm safety against adversarial prompting.arXiv preprint arXiv:2309.02705, 2023

  41. [41]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. Smoothllm: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684, 2023. 12

  42. [42]

    Active testing: Sample- efficient model evaluation

    Jannik Kossen, Sebastian Farquhar, Yarin Gal, and Tom Rainforth. Active testing: Sample- efficient model evaluation. InInternational Conference on Machine Learning, pages 5753–5763. PMLR, 2021

  43. [43]

    Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

    Yang Li, Jie Ma, Miguel Ballesteros, Yassine Benajiba, and Graham Horwood. Active evaluation acquisition for efficient llm benchmarking.arXiv preprint arXiv:2410.05952, 2024

  44. [44]

    On speeding up language model evaluation

    Jin Peng Zhou, Christian K Belardi, Ruihan Wu, Travis Zhang, Carla P Gomes, Wen Sun, and Kilian Q Weinberger. On speeding up language model evaluation. InThe Thirteenth International Conference on Learning Representations, 2025

  45. [45]

    Cer-eval: Certifiable and cost-efficient evaluation framework for llms.arXiv preprint arXiv:2505.03814, 2025

    Ganghua Wang, Zhaorun Chen, Bo Li, and Haifeng Xu. Cer-eval: Certifiable and cost-efficient evaluation framework for llms.arXiv preprint arXiv:2505.03814, 2025

  46. [46]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World, volume 29. Springer, 2005

  47. [47]

    A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

    Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction.Journal of machine learning research, 9(3), 2008

  48. [48]

    Inductive confidence machines for regression

    Harris Papadopoulos, Kostas Proedrou, V olodya V ovk, and Alex Gammerman. Inductive confidence machines for regression. InEuropean conference on machine learning, pages 345–356. Springer, 2002

  49. [49]

    Conformal prediction under covariate shift.Advances in neural information processing systems, 32, 2019

    Ryan J Tibshirani, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. Conformal prediction under covariate shift.Advances in neural information processing systems, 32, 2019

  50. [50]

    Conformalized survival analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):24–45, 2023

    Emmanuel Candès, Lihua Lei, and Zhimei Ren. Conformalized survival analysis.Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):24–45, 2023

  51. [51]

    Conformalized Survival Analysis for General Right-Censored Data

    Hen Davidov, Shai Feldman, Gil Shamai, Ron Kimmel, and Yaniv Romano. Conformalized Survival Analysis for General Right-Censored Data. InInternational Conference on Learning Representations, 2025

  52. [52]

    Pac confidence sets for deep neural networks via calibrated prediction.arXiv preprint arXiv:2001.00106, 2019

    Sangdon Park, Osbert Bastani, Nikolai Matni, and Insup Lee. Pac confidence sets for deep neural networks via calibrated prediction.arXiv preprint arXiv:2001.00106, 2019

  53. [53]

    Distribution-Free, Risk-Controlling Prediction Sets.Journal of the ACM (JACM), 68(6):1–34, 2021

    Stephen Bates, Anastasios Angelopoulos, Lihua Lei, Jitendra Malik, and Michael Jordan. Distribution-Free, Risk-Controlling Prediction Sets.Journal of the ACM (JACM), 68(6):1–34, 2021

  54. [54]

    Testing for outliers with conformal p-values.The Annals of Statistics, 51(1):149–178, 2023

    Stephen Bates, Emmanuel Candès, Lihua Lei, Yaniv Romano, and Matteo Sesia. Testing for outliers with conformal p-values.The Annals of Statistics, 51(1):149–178, 2023

  55. [55]

    Active learning literature survey.University of Wisconsin, Madison, 52, 07 2010

    Burr Settles. Active learning literature survey.University of Wisconsin, Madison, 52, 07 2010

  56. [56]

    Elsevier, 2013

    Valerii Vadimovich Fedorov.Theory of optimal experiments. Elsevier, 2013

  57. [57]

    Active Statistical Inference

    Tijana Zrnic and Emmanuel Candès. Active Statistical Inference. InInternational Conference on Machine Learning, 2024

  58. [58]

    Adaptive experimental design using the propensity score.Journal of Business & Economic Statistics, 29(1):96–108, 2011

    Jinyong Hahn, Keisuke Hirano, and Dean Karlan. Adaptive experimental design using the propensity score.Journal of Business & Economic Statistics, 29(1):96–108, 2011

  59. [59]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  60. [60]

    Introduction to multi-armed bandits.Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019

    Aleksandrs Slivkins. Introduction to multi-armed bandits.Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019. 13

  61. [61]

    Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

    Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem.Machine learning, 47(2):235–256, 2002

  62. [62]

    Analysis of thompson sampling for the multi-armed bandit problem

    Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. InConference on learning theory, pages 39–1. JMLR Workshop and Conference Proceedings, 2012

  63. [63]

    Conformal risk control

    Anastasios Nikolas Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Conformal risk control. InThe Twelfth International Conference on Learning Representations, 2024

  64. [64]

    SQuAD: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics

  65. [65]

    Know what you don’t know: Unanswerable questions for SQuAD

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don’t know: Unanswerable questions for SQuAD. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia, July 2018. Association for Computational Linguistics

  66. [66]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  67. [67]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transform- ers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  68. [68]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  69. [69]

    Multilingual E5 Text Embeddings: A Technical Report

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

  70. [70]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  71. [71]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  72. [72]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-perfo...

  73. [73]

    attacker

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011. 14 A Additional Related work Our work sits at th...

  74. [74]

    Therefore, given {(Xk, Hk, Tk)}k∈Ical2 and Dcal1, the cen- soring times {Ci}i∈Ical2 are mutually independent, since Ci is a deterministic function of (ξi, Ti, Xi, Hi)

    By construction, for all i∈ I cal2: (ξi, Ti)⊥ {(ξ j, Tj)}j∈Ical2\{i} | {(Xk, Hk, Tk)}k∈Ical2 ,D cal1. Therefore, given {(Xk, Hk, Tk)}k∈Ical2 and Dcal1, the cen- soring times {Ci}i∈Ical2 are mutually independent, since Ci is a deterministic function of (ξi, Ti, Xi, Hi)

  75. [75]

    The marginal law of (Xi, Ti, Hi) for i∈ I cal2 is independent of Dcal1 given Dtrain, and {(Xi, Ti, Hi)}i∈Ical2 are mutually independent given(D cal1,D train). By Proposition 1, the weights used by the algorithm are correct: wτ(i) = P h Ci ≥ ˆfτ(Xi) Xi, Hi, Ti,D cal1,D train i−1 for all i∈ I cal such that Ci = ˆfprior(Xi) and τ∈ T= [0, τ prior]. Every i∈ I...

  76. [76]

    (Mutual independence)For all i∈ I cal2: (Ci, Ti)⊥ {(C j, Tj)}j∈Ical2\{i} | {(Xk, Hk, Tk)}k∈Ical2 ,D cal1

  77. [77]

    (Marginal independence of calibration data)The marginal law of (Xi, Ti, Hi) for i∈ I cal2 is independent of Dcal1 given Dtrain, and {(Xi, Ti, Hi)}i∈Ical2 are mutually independent given(D cal1,D train)

  78. [78]

    Y i∈S fi(Xi, Ti, Hi, Ci) Dcal1,D train # =E

    (Bounded mean weight)There exists a constant ¯w≥1, which may depend on Dtrain but not onD cal1, such that almost surely overD cal1: E[wτ(i)| D cal1,D train]≤¯wfor alli∈ I cal2. Define the estimated miscoverage rate and calibrated quantile level by: ˆα(τ) := 1 |Ical| X i∈Ical wτ(i)I n ˜Ti < ˆfτ(Xi)≤C i o ,ˆτ:= sup τ∈ T: sup τ ′≤τ ˆα(τ′)≤α . We remark that ...

  79. [79]

    X i∈Ical1 bexp i(ˆλ)| D cal1 # +E h bemp j(ˆλ)| D cal1 i ≤(N 1 + 1)E[ ¯B2 | Dcal1]. 36 SinceE h bempj(ˆλ)| D cal1 i =E h bexpj(ˆλ)| D cal1 i , we set: E

    Multilinearity: B is a polynomial in {P(j)} tmax j=1 that is linear in each coordinateP(j) when all other coordinates are held fixed. 3.Monotonicity:Bis monotonically increasing in each coordinateP(j). Proof. (i) Each product Qt j=1 P(j)∈[0,1] and there are tmax terms. (ii) Direct inspection of the polynomial structure. (iii) For any j0, increasing P(j 0)...