PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

Varun Kotte

arxiv: 2605.18812 · v1 · pith:KS4D4PA6new · submitted 2026-05-12 · 💻 cs.LG · cs.CL· cs.IR

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

Varun Kotte This is my paper

Pith reviewed 2026-05-20 21:28 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.IR

keywords conformal predictionjoint coverageNLP pipelinesLLM systemsuncertainty quantificationmulti-stage modelsdistribution-free guaranteessplit conformal

0 comments

The pith

PASC reduces multi-stage joint coverage in NLP and LLM pipelines to a single conformal prediction on the joint maximum nonconformity score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern NLP and LLM systems consist of chained stages where errors accumulate, yet standard conformal prediction either treats stages separately or applies loose union bounds. The paper presents PASC, which folds the entire pipeline into one scalar nonconformity score by taking the maximum across stages and then runs ordinary split conformal prediction on that score. This produces a finite-sample, distribution-free guarantee that every stage meets its coverage target simultaneously with probability at least 1 minus alpha. Experiments on a three-stage NER-NED-typing pipeline show 96.4 percent end-to-end coverage at the same average set size as the alternatives, and the method continues to hold coverage under distribution shift while independent calibration collapses. The same reduction is claimed to work directly for retrieval-augmented generation and agentic LLM chains.

Core claim

PASC reduces the multi-stage joint coverage problem to a single scalar conformal prediction problem on the joint maximum nonconformity score. It thereby supplies a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha and is nearly tight up to a 1/(n+1) factor.

What carries the argument

The joint maximum nonconformity score, formed by taking the worst-case nonconformity across every stage of the pipeline and feeding that single scalar into ordinary split conformal calibration.

If this is right

All K stages receive simultaneous coverage with probability at least 1-alpha.
The method uses only one quantile computation and runs faster than Bonferroni-adjusted conformal prediction.
Coverage remains near target under distribution shift where per-stage calibration fails.
The same reduction scales to at least six stages while independent conformal prediction coverage drops sharply.
The construction applies without change to retrieval-augmented generation and multi-step agent pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with existing per-stage calibration techniques to produce hybrid guarantees that tighten when some stages are easier to certify.
In practice it suggests auditing the exchangeability of the max-score sequence on held-out data before deployment.
The reduction may extend to other sequential settings such as multi-step planning or tool-use loops where a joint failure probability matters.

Load-bearing premise

The joint maximum nonconformity score must itself be exchangeable so that split conformal prediction applies directly to it.

What would settle it

Empirical coverage falling below 1-alpha on a pipeline whose stage-wise nonconformity scores are constructed from non-exchangeable data, for example by training successive stages on disjoint non-i.i.d. corpora.

read the original abstract

Modern NLP and LLM systems are pipelines: named entity recognition (NER) -> entity disambiguation (NED) -> entity typing, retrieval-augmented generation (retriever -> reader), and agentic chains of planner -> tool -> critic. Errors compound across stages, but existing uncertainty quantification methods either calibrate each stage independently (no joint coverage) or apply a Bonferroni union bound (joint coverage, but conservative). We present PASC (Pipeline-Aware Split Conformal), which reduces multi-stage joint coverage to a single scalar conformal prediction problem on the joint maximum nonconformity score. PASC provides a finite-sample distribution-free guarantee that all K stages are simultaneously covered with probability at least 1 - alpha, and is nearly tight up to a 1/(n+1) factor. On a three-stage NER -> NED -> entity-typing pipeline over CoNLL-2003, PASC achieves 96.4% end-to-end coverage versus 93.4% for Bonferroni and 86.5% for independent CP, at identical average prediction set size (1.083). Under distribution shift to WNUT-17 Twitter and WikiNEuRal Wikipedia data, PASC empirically maintains the target coverage in the tested shift settings while independent CP collapses to 59%. PASC requires a single quantile computation, runs 1.7x faster than Bonferroni, and scales to K = 6 stages where independent CP drops to 0.53 end-to-end coverage. The same joint-maximum-score reduction applies directly to compound LLM systems and agent pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PASC reduces joint coverage for multi-stage pipelines to standard split conformal on the max nonconformity score, delivering tighter guarantees than Bonferroni with matching empirical results.

read the letter

The main point is that this paper gives a clean reduction for getting simultaneous coverage across pipeline stages without the usual conservatism. By taking the maximum nonconformity score over all K stages for each calibration point and then running ordinary split conformal on that scalar, you get a finite-sample guarantee that every stage is covered jointly at level 1-alpha. The exchangeability carries over directly because the max operation preserves it under the usual assumptions, and the 1/(n+1) factor is the standard one. That is the actual novelty here, not just another Bonferroni variant. On the three-stage NER-NED-typing pipeline over CoNLL-2003 it hits 96.4% end-to-end coverage at the same average set size as the baselines, beats Bonferroni by a few points, and stays closer to target under the WNUT and WikiNEuRal shifts while independent per-stage CP falls apart. It also scales to K=6 without the coverage collapsing. Those numbers are concrete and the runtime claim (1.7x faster than Bonferroni) is a nice practical plus. The central argument holds up on the evidence given. The soft spots are modest. The abstract leaves the precise definition of the per-stage nonconformity functions and how they are computed in the full pipeline a bit implicit, so the full paper needs to show the exact construction and any edge cases for compound LLM setups. The experiments stay within named-entity pipelines; broader agentic or retrieval-augmented cases would strengthen the claim but are not required for a first paper. This is for people who actually deploy multi-stage NLP or LLM systems and care about joint reliability rather than per-module metrics. A reader working on conformal methods or trustworthy pipelines will find the reduction and the shift experiments useful. It is coherent on its own terms and the math is standard once the max reduction is accepted, so it deserves a serious referee. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper proposes PASC (Pipeline-Aware Split Conformal), a method that reduces joint coverage across K stages of an NLP or LLM pipeline to ordinary split conformal prediction applied to the per-example maximum nonconformity score. It claims a finite-sample, distribution-free guarantee that all stages are simultaneously covered with probability at least 1-alpha, nearly tight up to the usual 1/(n+1) factor. Experiments on a three-stage NER-NED-typing pipeline over CoNLL-2003 report 96.4% end-to-end coverage (vs. 93.4% Bonferroni, 86.5% independent CP) at comparable set size, with maintained coverage under shifts to WNUT-17 and WikiNEuRal while independent CP drops sharply; the approach is also shown to scale to K=6 stages.

Significance. If the reduction is valid, PASC supplies a lightweight, non-conservative route to joint coverage guarantees for compound pipelines, which is practically relevant for reliable multi-stage NLP and agentic LLM systems. The empirical demonstration of improved coverage and robustness under shift, together with the single-quantile computational simplicity, would constitute a useful contribution to conformal prediction for structured prediction pipelines.

major comments (2)

[Abstract / Method] Abstract and method section: the finite-sample guarantee is asserted via reduction to split CP on the joint-maximum score, but the manuscript provides no explicit derivation steps or exchangeability argument showing why s_i = max_k nonconformity_k(i) inherits the required exchangeability from the underlying calibration and test points. This step is load-bearing for the central claim.
[Experiments] Experiments, CoNLL-2003 results: coverage figures (96.4%, 93.4%, 86.5%) are reported without error bars, standard deviations, or number of random splits; this makes it impossible to assess whether the observed gap over Bonferroni is statistically reliable or sensitive to calibration-set choice.

minor comments (2)

[Abstract] Abstract: the nonconformity score definition for each stage (e.g., for NER, NED, typing) should be stated explicitly rather than left implicit.
[Method] The paper should include a short pseudocode or algorithmic box showing the single quantile computation for the max-score conformal set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of the contribution, and recommendation for minor revision. The comments identify two areas where the manuscript can be strengthened: an explicit derivation of the central theoretical reduction and improved reporting of experimental variability. We address each point below and will incorporate the changes in the revised version.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method section: the finite-sample guarantee is asserted via reduction to split CP on the joint-maximum score, but the manuscript provides no explicit derivation steps or exchangeability argument showing why s_i = max_k nonconformity_k(i) inherits the required exchangeability from the underlying calibration and test points. This step is load-bearing for the central claim.

Authors: We agree that the exchangeability argument for the joint-maximum nonconformity score requires an explicit derivation. Let the calibration examples and the test example be exchangeable. For each example i the scalar score is defined as s_i = max_{k=1}^K s_i^{(k)}, where s_i^{(k)} is the nonconformity score of stage k. The map that first computes the vector of per-stage scores and then takes the coordinate-wise maximum is a symmetric function applied identically to every example. Consequently the sequence s_1, …, s_{n+1} remains exchangeable. Standard split conformal prediction applied to these scalar scores therefore yields P(s_test ≤ q_{1-α}) ≥ 1-α, which is exactly the event that every stage is covered. We will insert a short lemma (or expanded paragraph) in the Methods section that states this argument step by step, including the preservation of exchangeability under the max operation. revision: yes
Referee: [Experiments] Experiments, CoNLL-2003 results: coverage figures (96.4%, 93.4%, 86.5%) are reported without error bars, standard deviations, or number of random splits; this makes it impossible to assess whether the observed gap over Bonferroni is statistically reliable or sensitive to calibration-set choice.

Authors: We acknowledge that the reported coverage numbers are presented without measures of variability, which limits assessment of their stability across calibration-set choices. The figures in the current manuscript were obtained from a single random split. In the revision we will repeat the CoNLL-2003 experiments over multiple random splits (10 repetitions), report mean coverage together with standard deviations, and explicitly state the number of splits used. The same protocol will be applied to the distribution-shift experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's derivation consists of a direct reduction of the multi-stage joint coverage problem to ordinary split conformal prediction applied to the scalar s_i = max_k nonconformity_k(i). Because the maximum operator preserves exchangeability when the underlying per-stage scores are exchangeable, the standard split-CP quantile on these scalars yields the claimed simultaneous coverage guarantee as a straightforward consequence of existing theory. No equations or steps reduce the result back to a fitted parameter, a self-citation chain, or an ansatz imported from the authors' prior work; the 1/(n+1) tightness factor is the usual split-CP bound and does not introduce new circularity. The method is therefore self-contained against the standard conformal prediction literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard exchangeability assumption of conformal prediction applied to the derived maximum score; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Calibration and test points are exchangeable
Required for the finite-sample guarantee of split conformal prediction on the joint maximum score.

pith-pipeline@v0.9.0 · 5820 in / 1263 out tokens · 33526 ms · 2026-05-20T21:28:15.624222+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Mitigating LLM hallucinations via conformal abstention, 4 2024

Abbasi-Yadkori, Y ., Kuzborskij, I., Stutz, D., Gy¨orgy, A., Fisch, A., Doucet, A., Beloshapka, I., Weng, W.-H., Yang, Y .-Y ., Szepesv´ari, C., Cemgil, A. T., and Tomasev, N. Mitigating LLM hallucinations via conformal abstention. arXiv preprint arXiv:2405.01563,

work page arXiv
[2]

Theoretical Foundations of Conformal Prediction

Angelopoulos, A. N., Barber, R. F., and Bates, S. Theoreti- cal foundations of conformal prediction.arXiv preprint arXiv:2411.11824, 2024a. Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. InInternational Conference on Learning Representations (ICLR), 2024b. arXiv:2208.02814. Angelopoulos, A. N., Bates, S., Cand...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Barber, R

doi: 10.1214/24-AOAS1998. Barber, R. F., Cand `es, E. J., Ramdas, A., and Tibshirani, R. J. The limits of distribution-free conditional predictive inference.Information and Inference: A Journal of the IMA,

work page doi:10.1214/24-aoas1998
[4]

De Cao, N., Izacard, G., Riedel, S., and Petroni, F

arXiv:2405.01976. De Cao, N., Izacard, G., Riedel, S., and Petroni, F. Autore- gressive entity retrieval. InInternational Conference on Learning Representations (ICLR),

work page arXiv
[5]

Conformal prediction with conditional guarantees

Gibbs, I., Cherian, J. J., and Cand `es, E. J. Conformal prediction with conditional guarantees.arXiv preprint arXiv:2305.12616,

work page arXiv
[6]

PromptPort : A reliability layer for cross-model structured extraction

Kotte, V . PromptPort: A reliability layer for cross-model structured extraction.arXiv preprint arXiv:2601.06151, 2026a. Kotte, V . UCCI: Uncertainty-calibrated confidence intervals for LLM extraction, 2026b. In preparation. Kotte, V . et al. Multi-stage entity recognition and resolution pipeline for production information extraction,

work page arXiv
[7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . RoBERTa: A robustly optimized BERT pretraining ap- proach.arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[8]

Nickel, M., Murphy, K., Tresp, V ., and Gabrilovich, E

arXiv:2402.10978. Nickel, M., Murphy, K., Tresp, V ., and Gabrilovich, E. A review of relational machine learning for knowledge graphs.Proceedings of the IEEE,

work page arXiv
[9]

Pradhan, S., Moschitti, A., Xue, N., Ng, H

arXiv:2106.09848. Pradhan, S., Moschitti, A., Xue, N., Ng, H. T., Bj¨orkelund, A., Uryupina, O., Zhang, Y ., and Zhong, Z. Towards robust linguistic analysis using OntoNotes. InComputa- tional Natural Language Learning (CoNLL),

work page arXiv
[10]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

Sharma, S., Yoon, D. S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V . Retrieval augmented generation for domain-specific question an- swering. InAAAI 2024 Workshop on Scientific Document Understanding,

work page 2024
[11]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

arXiv:2404.14760. Tedeschi, S., Maiorca, V ., Ciccarella, N., Esuli, A., Sebas- tiani, F., and Navigli, R. WikiNEuRal: Combined neural and knowledge-based silver data creation for multilingual NER. InFindings of the Association for Computational Linguistics (EMNLP),

work page arXiv
[12]

Tjong Kim Sang, E. F. and De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. InComputational Natural Language Learning (CoNLL),

work page 2003
[13]

Permutation mean

A. Sanity Checks and Empirical Validation A.1. Permutation Test for CP Validity (Experiment E0) To empirically validate that our CP implementation correctly implements the finite-sample guarantee, we performed a permutation test. We pooled calibration ( n= 1000 ) and test (n= 500 ) NER nonconformity scores and performed K= 200random re-splits, recomputing...

work page 2003
[14]

ncal Method E2E Cov. Avg Set Size 200Indep CP0.870±0.017 1.005±0.009 200Bonferroni0.964±0.017 1.271±0.091 200PASC0.934±0.000 1.122±0.000 500Indep CP0.865±0.004 1.000±0.000 500Bonferroni0.939±0.008 1.142±0.028 500PASC0.934±0.000 1.122±0.000 1000Indep CP0.865±0.005 1.083±0.000 1000Bonferroni0.934±0.001 1.083±0.000 1000PASC0.964±0.000 1.083±0.000 C. Full Num...

work page 2003

[1] [1]

Mitigating LLM hallucinations via conformal abstention, 4 2024

Abbasi-Yadkori, Y ., Kuzborskij, I., Stutz, D., Gy¨orgy, A., Fisch, A., Doucet, A., Beloshapka, I., Weng, W.-H., Yang, Y .-Y ., Szepesv´ari, C., Cemgil, A. T., and Tomasev, N. Mitigating LLM hallucinations via conformal abstention. arXiv preprint arXiv:2405.01563,

work page arXiv

[2] [2]

Theoretical Foundations of Conformal Prediction

Angelopoulos, A. N., Barber, R. F., and Bates, S. Theoreti- cal foundations of conformal prediction.arXiv preprint arXiv:2411.11824, 2024a. Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. InInternational Conference on Learning Representations (ICLR), 2024b. arXiv:2208.02814. Angelopoulos, A. N., Bates, S., Cand...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Barber, R

doi: 10.1214/24-AOAS1998. Barber, R. F., Cand `es, E. J., Ramdas, A., and Tibshirani, R. J. The limits of distribution-free conditional predictive inference.Information and Inference: A Journal of the IMA,

work page doi:10.1214/24-aoas1998

[4] [4]

De Cao, N., Izacard, G., Riedel, S., and Petroni, F

arXiv:2405.01976. De Cao, N., Izacard, G., Riedel, S., and Petroni, F. Autore- gressive entity retrieval. InInternational Conference on Learning Representations (ICLR),

work page arXiv

[5] [5]

Conformal prediction with conditional guarantees

Gibbs, I., Cherian, J. J., and Cand `es, E. J. Conformal prediction with conditional guarantees.arXiv preprint arXiv:2305.12616,

work page arXiv

[6] [6]

PromptPort : A reliability layer for cross-model structured extraction

Kotte, V . PromptPort: A reliability layer for cross-model structured extraction.arXiv preprint arXiv:2601.06151, 2026a. Kotte, V . UCCI: Uncertainty-calibrated confidence intervals for LLM extraction, 2026b. In preparation. Kotte, V . et al. Multi-stage entity recognition and resolution pipeline for production information extraction,

work page arXiv

[7] [7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . RoBERTa: A robustly optimized BERT pretraining ap- proach.arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[8] [8]

Nickel, M., Murphy, K., Tresp, V ., and Gabrilovich, E

arXiv:2402.10978. Nickel, M., Murphy, K., Tresp, V ., and Gabrilovich, E. A review of relational machine learning for knowledge graphs.Proceedings of the IEEE,

work page arXiv

[9] [9]

Pradhan, S., Moschitti, A., Xue, N., Ng, H

arXiv:2106.09848. Pradhan, S., Moschitti, A., Xue, N., Ng, H. T., Bj¨orkelund, A., Uryupina, O., Zhang, Y ., and Zhong, Z. Towards robust linguistic analysis using OntoNotes. InComputa- tional Natural Language Learning (CoNLL),

work page arXiv

[10] [10]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

Sharma, S., Yoon, D. S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V . Retrieval augmented generation for domain-specific question an- swering. InAAAI 2024 Workshop on Scientific Document Understanding,

work page 2024

[11] [11]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

arXiv:2404.14760. Tedeschi, S., Maiorca, V ., Ciccarella, N., Esuli, A., Sebas- tiani, F., and Navigli, R. WikiNEuRal: Combined neural and knowledge-based silver data creation for multilingual NER. InFindings of the Association for Computational Linguistics (EMNLP),

work page arXiv

[12] [12]

Tjong Kim Sang, E. F. and De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. InComputational Natural Language Learning (CoNLL),

work page 2003

[13] [13]

Permutation mean

A. Sanity Checks and Empirical Validation A.1. Permutation Test for CP Validity (Experiment E0) To empirically validate that our CP implementation correctly implements the finite-sample guarantee, we performed a permutation test. We pooled calibration ( n= 1000 ) and test (n= 500 ) NER nonconformity scores and performed K= 200random re-splits, recomputing...

work page 2003

[14] [14]

ncal Method E2E Cov. Avg Set Size 200Indep CP0.870±0.017 1.005±0.009 200Bonferroni0.964±0.017 1.271±0.091 200PASC0.934±0.000 1.122±0.000 500Indep CP0.865±0.004 1.000±0.000 500Bonferroni0.939±0.008 1.142±0.028 500PASC0.934±0.000 1.122±0.000 1000Indep CP0.865±0.005 1.083±0.000 1000Bonferroni0.934±0.001 1.083±0.000 1000PASC0.964±0.000 1.083±0.000 C. Full Num...

work page 2003