arxiv: 2605.06788 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.MA

Recognition: 2 theorem links

· Lean Theorem

Conformal Agent Error Attribution

Naihe Feng , Yi Sui , Shiyi Hou , Ga Wu , Jesse C. Cresswell

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords conformal predictionerror attributionmulti-agent systemssequential trajectoriesfiltrationcoverage guaranteesagent rollbackuncertainty quantification

0 comments

The pith

Conformal prediction attributes errors in multi-agent trajectories using contiguous sets with finite-sample guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework to identify where errors occur in long sequences of actions generated by multi-agent systems powered by large language models. It uses conformal prediction to produce prediction sets that contain the error with guaranteed probability, without assuming a specific data distribution. The key innovation is new algorithms that output contiguous blocks of the trajectory rather than arbitrary sets, allowing the system to roll back efficiently to a safe state before the error. This is verified on various agents and datasets, showing precise isolation of errors and successful self-correction.

Core claim

The framework applies conformal prediction to agent trajectories by introducing filtration-based algorithms that generate contiguous prediction sets. These sets provide distribution-free coverage guarantees and enable efficient recovery by rolling back the multi-agent system to correct its errors. The approach is model-agnostic and supplies a principled uncertainty quantification for error attribution.

What carries the argument

Filtration-based conformal prediction for sequential data, which constructs contiguous sequence sets that contain the error location with finite-sample guarantees.

If this is right

Errors in long interaction traces can be isolated precisely using the contiguous sets.
The multi-agent system can use the sets to rollback and correct its own errors autonomously.
The method works across different agents and datasets while remaining model-agnostic.
It adds a layer of uncertainty awareness to error attribution in multi-agent systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar techniques could apply to debugging single large language model chains or reinforcement learning episodes by identifying failure steps.
Contiguous sets might integrate with existing logging tools to reduce the manual effort in tracing agent mistakes.
If the coverage holds in practice, it could lead to more reliable autonomous agent deployments in real-world tasks.
The approach suggests a general way to add safety layers to sequential decision systems without retraining the underlying models.

Load-bearing premise

The agent trajectories must meet the exchangeability or filtration conditions needed for the conformal algorithms to provide the stated coverage guarantees without needing adjustments after the fact.

What would settle it

Test the method on a collection of agent trajectories with known error locations; if the proportion of sets containing the true error falls below the nominal coverage level or if the sets fail to isolate the error efficiently, the guarantees do not hold in this setting.

Figures

Figures reproduced from arXiv: 2605.06788 by Ga Wu, Jesse C. Cresswell, Naihe Feng, Shiyi Hou, Yi Sui.

**Figure 1.** Figure 1: Conformal agent error attribution isolates the decisive error in a failed MAS trajectory within a conformal prediction set, providing statistical guarantees of coverage. Advances in large language models (LLMs) have driven the widespread adoption of multiagent systems (MAS) for complex tasks requiring decomposition, coordination, and tool use [10], with strong empirical performance in domains such as so… view at source ↗

**Figure 2.** Figure 2: An example binary tree T representing an agent trajectory x consisting of four steps c1, ..., c4. Contiguous prediction sets C(xn+1) will consist of a single node vi. To produce contiguous sets, we can adapt algorithms for hierarchical classification by mapping agent trajectories x onto a binary tree T as depicted in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Left filtering with FLF(x; q) progressively removes steps from the left as q decreases. The smallest q which retains the decisive error y ∗ is used as the conformal score SLF(x, y∗). Viewing a trajectory x = (c1, . . . , cℓ) as a sequence, Left Filtration (LF) progressively removes steps from x starting on the left with c1 until the remaining subsequence scores below a calibrated threshold, returning a s… view at source ↗

**Figure 4.** Figure 4: Two-way filtering with FTWF(x; q) uses the intersection of filtering from the right and left. The smallest q which retains the decisive error y ∗ is the conformal score STWF(x, y∗). Filtering from one direction has downsides; when the decisive error is at the start (end), the suffix (prefix) which covers the ground truth will contain the entire trajectory. Moreover, when the decisive error is near the midd… view at source ↗

**Figure 5.** Figure 5: After generating a conformal set for a failed task, we roll back the state of the MAS to the first step in the set, and restart the agent with information on the failed trace. CP sets for agent error attribution have multiple uses. They can be used by humans for manual debugging; a person can focus only on the steps within the set and have good coverage of true errors. Here, contiguity is a great benefit—… view at source ↗

**Figure 6.** Figure 6: Normalized decisive error position distributions for the real-world benchmark (Who&When), and controlled error injection datasets (Left/Mid/Right Dense) on GSM8k. In the absence of other real-world data, and to provide greater variety, we follow existing practices [17] to synthetically generate failed agent trajectories through error injection. To induce errors at controlled steps, we inject instructions… view at source ↗

**Figure 7.** Figure 7: Empirical Coverage vs. Target Coverage for CP methods with the fine-tuned LLM scoring function on MATH. 4.2 Evaluation Metrics Scoring functions g can be viewed as classifiers on an ℓ-way task—predicting the decisive error location. Hence, we evaluate them using AUROC, AUPRC, and accuracy. Since ℓ can be large, baseline levels of these metrics are very low. Given a test dataset {(xi , y∗ i )} n+m i=n+1 and… view at source ↗

**Figure 8.** Figure 8: Example rollback scenario with text for the task, decisive error, final answer, and corrected [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

When multi-agent systems (MAS) fail, identifying where the decisive error occurred is the first step for automated recovery to an earlier state. Error attribution remains a fundamental challenge due to the long interaction traces that large language model-based MAS generate. This paper presents a framework for error attribution based on conformal prediction (CP) which provides finite-sample, distribution-free coverage guarantees. We introduce new algorithms for filtration-based CP designed for sequential data such as agent trajectories. Unlike existing CP algorithms, our approach predicts sets that are contiguous sequences to enable efficient recovery and debugging. We verify our theoretical guarantees on a variety of agents and datasets, show that errors can be precisely isolated, then use prediction sets to rollback MAS to correct their own errors. Our overall approach is model-agnostic, and offers a principled uncertainty layer for MAS error attribution. We release code at https://github.com/layer6ai-labs/conformal-agent-error-attribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts conformal prediction to produce contiguous sets for error attribution and rollback in LLM multi-agent trajectories, but the coverage guarantees look shaky on causally dependent data.

read the letter

The main thing here is a conformal prediction framework that outputs contiguous sequence sets for agent trajectories, letting you isolate errors and roll back to a prior state in multi-agent LLM systems. They add filtration-based algorithms to handle the sequential nature and claim finite-sample, distribution-free coverage plus precise isolation, with code released and tests across agents and datasets. That combination is new enough to be worth noting, and the model-agnostic angle plus the rollback use case gives it a practical flavor that standard CP papers often lack. The experiments showing error isolation and self-correction are a concrete plus for anyone building reliable MAS workflows. Credit for shipping the code and running checks on multiple setups rather than just theory. The soft spot is the load-bearing assumption that the trajectories satisfy the filtration or conditional exchangeability conditions needed for the new algorithms to deliver the stated guarantees. LLM agent traces are causally dependent, non-stationary, and far from exchangeable, so the coverage could degrade without extra structure or post-hoc fixes. The abstract asserts verification of the guarantees, but if the derivations only hold under stronger conditions than the data actually meets, or if the empirical results rely on tuning that isn't fully documented, the theoretical claims weaken. Empirical isolation alone does not confirm the coverage in the target regime. This is for people working on uncertainty quantification for deployed agent systems or sequential conformal methods. A reader focused on practical reliability in multi-agent setups would get usable ideas from the algorithms and experiments, even if they have to re-check the assumptions themselves. It deserves a serious referee because the problem is timely and the extension is non-routine, though any review should press hard on whether the dependence is properly handled in the proofs and checks. I would send it out rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce a conformal prediction (CP) framework for error attribution in multi-agent systems (MAS) using LLM-based agents. It develops new filtration-based CP algorithms for sequential trajectories that output contiguous prediction sets, asserting finite-sample distribution-free coverage guarantees. These are verified on various agents and datasets to enable precise error isolation and rollback for self-correction, with the overall approach being model-agnostic and code released at a GitHub link.

Significance. If the coverage guarantees hold for causally dependent, non-stationary trajectories, this would represent a meaningful advance in providing principled, distribution-free uncertainty quantification for debugging and recovering from failures in complex MAS. The focus on contiguous sets for efficient recovery, combined with empirical verification and code release, strengthens the potential impact for automated error handling in agentic systems.

major comments (2)

Abstract and theoretical claims: The finite-sample, distribution-free coverage guarantees for the new filtration-based CP algorithms are load-bearing for the error isolation and rollback mechanism. Standard CP requires exchangeability, while filtration variants need adapted conditional properties (e.g., martingale structure); the manuscript must explicitly derive how these hold for LLM-generated trajectories, which are causally dependent and non-stationary, rather than assuming the conditions are met without post-hoc adjustment.
Empirical section on verification: The abstract asserts verification of theoretical guarantees and precise isolation across agents/datasets, but without reported details on how filtrations are constructed per trajectory, empirical coverage rates (e.g., whether they match nominal levels without tuning), data splits, or checks against violation of exchangeability, the claim of no post-hoc adjustments remains unassessed and risks undermining the central rollback application.

minor comments (1)

The abstract could more explicitly name the agents, datasets, and nominal coverage levels used in verification to aid immediate assessment of the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our theoretical guarantees and empirical verification. We address each major comment below and have revised the manuscript to incorporate the requested details and derivations.

read point-by-point responses

Referee: Abstract and theoretical claims: The finite-sample, distribution-free coverage guarantees for the new filtration-based CP algorithms are load-bearing for the error isolation and rollback mechanism. Standard CP requires exchangeability, while filtration variants need adapted conditional properties (e.g., martingale structure); the manuscript must explicitly derive how these hold for LLM-generated trajectories, which are causally dependent and non-stationary, rather than assuming the conditions are met without post-hoc adjustment.

Authors: We agree that an explicit derivation is necessary to substantiate the claims for causally dependent, non-stationary trajectories. In the revised manuscript, we have added a new subsection (Section 3.3) that derives the coverage guarantees step by step. Specifically, we define the filtration as the increasing sequence of sigma-algebras generated by the observed history up to each time step in the agent trajectory. We then show that the conformity scores, computed as the negative log-likelihood of the next action under the agent's policy conditioned on the filtration, satisfy the required martingale property. This allows the standard conformal prediction argument to be applied conditionally on the filtration, yielding finite-sample, distribution-free marginal coverage without any post-hoc adjustments or assumptions of stationarity. The derivation builds directly on existing results for conformal prediction under filtrations and is now cross-referenced in the abstract and introduction. revision: yes
Referee: Empirical section on verification: The abstract asserts verification of theoretical guarantees and precise isolation across agents/datasets, but without reported details on how filtrations are constructed per trajectory, empirical coverage rates (e.g., whether they match nominal levels without tuning), data splits, or checks against violation of exchangeability, the claim of no post-hoc adjustments remains unassessed and risks undermining the central rollback application.

Authors: We acknowledge that the original experimental section lacked sufficient implementation details. In the revised version, we have substantially expanded Section 5 with the following additions: (i) a precise description of filtration construction, where for each trajectory the filtration at step t is the sigma-algebra generated by all prior agent observations, actions, and LLM prompts up to t; (ii) tables reporting empirical coverage rates for nominal levels of 80%, 90%, and 95% across all agent types and datasets, confirming that observed coverage matches the nominal levels within sampling error and without any parameter tuning; (iii) explicit data split information (70% of trajectories used for calibration of the conformity score thresholds, 30% held out for evaluation of coverage and rollback performance); and (iv) an additional robustness experiment that compares coverage on original trajectories versus randomly shuffled versions to assess sensitivity to exchangeability violations. These revisions directly address the concern and strengthen the evidence for the rollback application. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper extends standard conformal prediction theory to new filtration-based algorithms for sequential agent trajectories, claiming finite-sample distribution-free coverage and contiguous prediction sets. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the guarantees derive from established CP properties (exchangeability or filtration conditions) applied to the new setting, with empirical verification on agents and datasets providing independent checks. The derivation remains self-contained against external CP benchmarks without renaming known results or smuggling ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard conformal prediction assumptions applied to sequential data; no free parameters, invented entities, or ad-hoc axioms are mentioned in the abstract.

axioms (1)

standard math Data points in agent trajectories satisfy the exchangeability or filtration conditions needed for conformal prediction to deliver finite-sample coverage guarantees.
Invoked to support the distribution-free guarantees claimed for the new algorithms.

pith-pipeline@v0.9.0 · 5458 in / 1272 out tokens · 48761 ms · 2026-05-11T01:02:05.700388+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce new algorithms for filtration-based CP designed for sequential data such as agent trajectories... predicts sets that are contiguous sequences
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.4... prediction sets constructed as CLF(xn+1; ˆq) = FLF(xn+1; ˆq) satisfy 1−α ≤ P[y∗n+1 ∈ CLF(xn+1; ˆq)] < 1−α + 1/(n+1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages · 3 internal anchors

[1]

A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification

Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik

Anastasios N. Angelopoulos, Stephen Bates, Michael Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. InInternational Conference on Learning Representations, 2021

2021
[3]

Where did it all go wrong? a hierarchical look into multi-agent error attribution

Adi Banerjee, Anirudh Nair, and Tarik Borogovac. Where did it all go wrong? a hierarchical look into multi-agent error attribution. InNeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling, 2025

2025
[4]

Gonzalez, and Ion Stoica

Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why Do Multi-Agent LLM Systems Fail? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 10

2025
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Cresswell, Yi Sui, Bhargava Kumar, and Noël V ouitsis

Jesse C. Cresswell, Yi Sui, Bhargava Kumar, and Noël V ouitsis. Conformal Prediction Sets Improve Human Decision Making. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 9439–9457, 2024

2024
[7]

Cresswell, Bhargava Kumar, Yi Sui, and Mouloud Belbahri

Jesse C. Cresswell, Bhargava Kumar, Yi Sui, and Mouloud Belbahri. Conformal Prediction Sets Can Cause Disparate Impact. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[8]

Alireza Ghafarollahi and Markus J. Buehler. SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning.Advanced Materials, 37(22),
[9]

doi: 10.1002/adma.202413523

work page doi:10.1002/adma.202413523
[10]

Adaptive conformal inference under distribution shift

Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, volume 34, pages 1660–1672, 2021

2021
[11]

Large language model based multi-agents: A survey of progress and challenges,

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024. ISBN 978-1-956792-04-1. doi: 10.24963/ijcai.2024/890

work page doi:10.24963/ijcai.2024/890 2024
[12]

and Ramdas, Aaditya , year=

Chirag Gupta, Arun K. Kuchibhotla, and Aaditya Ramdas. Nested conformal prediction and quantile out-of-bag ensemble methods.Pattern Recognition, 127:108496, 2022. ISSN 0031-3203. doi: 10.1016/j.patcog.2021.108496

work page doi:10.1016/j.patcog.2021.108496 2022
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021

2021
[14]

Hierarchical conformal classification.arXiv:2508.13288, 2025

Floris den Hengst, Inès Blin, Majid Mohammadi, Syed Ihtesham Hussain Shah, and Taraneh Younesian. Hierarchical conformal classification.arXiv:2508.13288, 2025

work page arXiv 2025
[15]

MetaGPT: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, 2024

2024
[16]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation . arXiv:2312.13010, 2023

work page internal anchor Pith review arXiv 2023
[17]

Conformal prediction for deep classifier via label ranking

Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal prediction for deep classifier via label ranking. InProceedings of the 41st International Conference on Machine Learning, 2024

2024
[18]

Aegis: Automated error generation and attribution for multi-agent systems

Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, and Xue Feng. Aegis: Automated error generation and attribution for multi-agent systems. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[19]

Document summarization with conformal importance guarantees

Bruce Kuwahara, Chen-Yuan Lin, Xiao Shi Huang, Kin Kwan Leung, Jullian Arta Yapeter, Ilya Stanevich, Felipe Perez, and Jesse C Cresswell. Document summarization with conformal importance guarantees. InAdvances in Neural Information Processing Systems, volume 38, 2025

2025
[20]

Cresswell

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. Classifying and addressing the diversity of errors in retrieval- augmented generation systems. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3185–3207,
[21]

doi: 10.18653/v1/2026.eacl-long.147

ISBN 979-8-89176-380-7. doi: 10.18653/v1/2026.eacl-long.147. 11

work page doi:10.18653/v1/2026.eacl-long.147 2026
[22]

A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration. InFirst Conference on Language Modeling, 2024

2024
[23]

Conformal prediction in hierarchical classification with constrained representation complexity

Thomas Mortier, Alireza Javanmardi, Yusuf Sale, Eyke Hüllermeier, and Willem Waegeman. Conformal prediction in hierarchical classification with constrained representation complexity. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026

2026
[24]

Scaling Large Language Model- based Multi-Agent Collaboration

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling Large Language Model- based Multi-Agent Collaboration. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[25]

Temporal convolutional neural networks for diagnosis from lab tests.arXiv:1511.07938, 2015

Narges Razavian and David Sontag. Temporal convolutional neural networks for diagnosis from lab tests.arXiv:1511.07938, 2015

work page arXiv 2015
[26]

Classification with valid and adaptive coverage

Yaniv Romano, Matteo Sesia, and Emmanuel Candès. Classification with valid and adaptive coverage. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020
[27]

Cresswell

Brendan Leigh Ross, Noël V ouitsis, Atiyeh Ashari Ghomi, Rasa Hosseinzadeh, Ji Xin, Zhaoyan Liu, Yi Sui, Shiyi Hou, Kin Kwan Leung, Gabriel Loaiza-Ganem, and Jesse C. Cresswell. Textual bayes: Quantifying prompt uncertainty in LLM-based systems. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[28]

A tutorial on conformal prediction.Journal of Machine Learning Research, 9(3), 2008

Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction.Journal of Machine Learning Research, 9(3), 2008

2008
[29]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

2005
[30]

Large language models are diverse role-players for summarization evaluation

Ning Wu, Ming Gong, Linjun Shou, Shining Liang, and Daxin Jiang. Large language models are diverse role-players for summarization evaluation. InNatural Language Processing and Chinese Computing, pages 695–707. Springer Nature Switzerland, 2023. ISBN 978-3-031-44693-1

2023
[31]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

2025
[32]

Suchow, Zhenyu Cui, Rong Liu, Zhaozhuo Xu, Denghui Zhang, Koduvayur Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, and Qianqian Xie

Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan W. Suchow, Zhenyu Cui, Rong Liu, Zhaozhuo Xu, Denghui Zhang, Koduvayur Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, and Qianqian Xie. FinCon: A Synthesized LLM Multi-Agent System with Conceptual Verbal Reinforcement for Enhanced Financial Decisio...

work page doi:10.52202/079017-4354 2024
[33]

CORRECT: COndensed eRror RECognition via knowledge transfer in multi-agent systems.arXiv preprint arXiv:2509.24088, 2025

Yifan Yu, Moyan Li, Shaoyuan Xu, Jinmiao Fu, Xinhai Hou, Fan Lai, and Bryan Wang. CORRECT: COndensed eRror RECognition via knowledge Transfer in multi-agent systems. arXiv:2509.24088, 2025

work page arXiv 2025
[34]

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems. InForty-second International Conference on Machine Learning, 2025

2025
[35]

RAFFLES: Reasoning-based attribution of faults for LLM systems

Chenyang Zhu, Spencer Hong, Jingyu Wu, Kushal Chawla, Yuhui Tang, Youbing Yin, Nathan Wolfe, Erin Babinsky, and Daben Liu. RAFFLES: Reasoning-based attribution of faults for LLM systems. InFirst Workshop on Multi-Turn Interactions in Large Language Models, 2025. 12 A Theorems and Proofs In this appendix we provide complete proofs of the theorems stated in...

2025
[36]

Assume Natalia sold clips to 36 friends in April…

20 epochs of training took roughly one day, and consumed less than 16 GB of GPU memory with batch size 32. All other calls to LLMs used commercial APIs, and we discussed the number of calls needed in Section 5.4. Once scoring calls are made, performing conformal calibration is a trivial computational cost. C.1.4 Rollback Experiment Implementation Conforma...