pith. machine review for the scientific record. sign in

arxiv: 2604.18245 · v2 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM protocolscorrection ratecorruption rateerror flowprotocol auditingmixture shiftcomposable stepsaccuracy prediction
0
0 comments X

The pith

LLM protocol steps can be audited and composed by tracking separate rates for fixing errors and breaking correct answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a paired measurement for any single step in an LLM protocol that records whether an answer was correct before the step and after it. From these two bits it computes a correction rate c, the chance a wrong answer becomes right, and a corruption rate γ, the chance a right answer becomes wrong. These rates directly predict the net accuracy change after the step and remain stable when conditioned on a difficulty proxy to handle shifts in the mix of easy and hard problems. The same rates support a test for whether the step carries enough state to be chained reliably into longer pipelines. When the rates pass these checks the step becomes a reusable module that can be turned on or off based on expected gain.

Core claim

We propose a paired-outcome measurement interface for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit E0 and a post-step correctness bit E1, separating correction from corruption through two rates: c equal to the probability that E1 equals 1 given E0 equals 0, and γ equal to the probability that E1 equals 0 given E0 equals 1. These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines. Under mixture shift the pooled estimates become biased, but conditioning on a difficulty proxy restores stability. Under state insufficiency a Markov factorization test is

What carries the argument

The paired correctness bits E0 and E1 that produce the correction rate c and corruption rate γ for auditing and composing each protocol step.

If this is right

  • Net accuracy after one step equals initial accuracy plus c times the initial error fraction minus γ times the initial accuracy.
  • A step is activated only when its estimated gain from c and γ is positive.
  • Conditioning the rates on a difficulty proxy removes bias from changes in the proportion of hard versus easy problems.
  • A Markov test on the correctness sequence identifies when additional state is needed for reliable multi-step composition.
  • Steps that pass the diagnostics form auditable modules whose total accuracy in a pipeline is the product of the individual rate effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same paired measurement could be applied to non-exact-match tasks by replacing the binary bits with graded correctness scores.
  • Protocol designers could search for new steps that improve the pair (c, γ) on a validation set before deployment.
  • In production the rates could be re-estimated periodically on recent traffic to decide whether to keep or drop a step.

Load-bearing premise

The binary correctness bits before and after a step carry enough history to keep the rates stable and transferable across mixtures and pipeline positions without extra state.

What would settle it

Measure actual accuracy change on a held-out mixture or multi-step pipeline and compare it to the change predicted from the estimated c and γ; any systematic mismatch shows the rates do not transfer.

Figures

Figures reproduced from arXiv: 2604.18245 by Fernando Reitich.

Figure 1
Figure 1. Figure 1: Phase diagram over synthetic slices for alt (Jeffreys-smoothed transition-rate estimates). Each point is one (helper → target,slice) estimate plotted at γ, ˆ cˆ  , where cˆ = Pr(E1 = 1 | E0 = 0,slice) and γˆ = Pr(E1 = 0 | E0 = 1,slice) are estimated from paired outcomes with Jeffreys smoothing. The slices comprise the arithmetic depth bins together with the 2×2 synthetic family. Estimates are pooled over … view at source ↗
Figure 2
Figure 2. Figure 2: alt seed-holdout depth profiles under Jeffreys smoothing (train seeds 123/124; test seed 125). Left: llama3.2→mistral. Right: mistral→llama3.2. For each depth bin d, the solid curve shows empirical post accuracy pˆ1(d) on the held-out seed. The dashed curve plots one-step predictions using a pooled (ˆc, γˆ) fit on the training seeds, evaluated at the held-out pˆ0(d). The dotted curve plots predictions usin… view at source ↗
Figure 3
Figure 3. Figure 3: Mixture-shift stress test for ALT (Jeffreys), shown for one representative seed-holdout fold of the directed pair mistral→llama3.2 (calibration: seeds 123/124; test: seed 125). “Pooled” uses a single (ˆc, γˆ) estimated on the calibration mixture; “depth-conditioned” estimates (ˆc(d), γˆ(d)) within each depth and aggregates predictions under the test mixture. Left: mean absolute error (MAE). Right: mean sig… view at source ↗
Figure 4
Figure 4. Figure 4: VER judge diagnostics (Jeffreys). Each point corresponds to one seed; axes are [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interface comparison across regimes (Jeffreys), shown in [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Phase diagram over synthetic slices for judge-K on synthetic tasks (K = 2, Jeffreys). Each point is one (solver → judge,slice) estimate plotted at (ˆγ, cˆ), inferred from paired pre/post outcomes under judge selection. The slices comprise the arithmetic depth bins together with the 2 × 2 synthetic family. Overlaid lines show the phase boundary c = p 1−p γ for representative baseline accuracies p. To make t… view at source ↗
Figure 7
Figure 7. Figure 7: Order-robust convergence on GSM8K under random permutations of the judged set. (a) Rates [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Transfer diagnostics for pooled (ˆc, γˆ) on GSM8K under judge-K. (a) Held-out transfer error versus training fraction, showing MAE and |bias| of pˆ emp 1 −pˆ pred 1 over repeated random splits. (b) Residuals by question-length bin with error bars ±1.96 SE for pˆ emp 1 (s) (binomial), plotting pˆ emp 1 (s) − pˆ pred 1 (s). 6.2 Audit of presentation effects at K = 4 Finally, we include a targeted audit that … view at source ↗
Figure 9
Figure 9. Figure 9: Position-bias audit on GSM8K under judge-K (K = 4, 200 items). We compare the standard presentation order (candidate 1 is the baseline attempt) to a deterministic per-item posshuffle that permutes candidate order before judging and maps selections back to original indices. Left: distribution over original chosen candidates; standard ordering concentrates strongly on original candidate 1, whereas posshuffle… view at source ↗
Figure 10
Figure 10. Figure 10: GSM8K judge-K = 4: accuracy by candidate-set stratum on the 200-item audit subset. Items are partitioned into no diversity, diversity without oracle headroom, and oracle headroom. “Anchor” is the baseline candidate, “Baseline judge-K” is standard ordering, “Posshuffle judge-K” is deterministic per￾item reshuffling before judging, and “Oracle best-of-K” is the empirical best achievable accuracy on the real… view at source ↗
Figure 11
Figure 11. Figure 11: Phase structure of refinement regimes. Left: Empirical [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Two-step composition seed-holdout depth profile (Jeffreys). Calibration uses training seeds [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
read the original abstract

Large language models are increasingly deployed as protocols: structured multi-call procedures that spend additional computation to transform a baseline answer into a final one. These protocols are evaluated only by end-to-end accuracy, giving limited insight into when they help, when they hurt, and whether their behavior transfers under distribution shift or composition. We propose a paired-outcome measurement interface for auditing a single protocol step on exact-match tasks. For each instance, the interface records a baseline correctness bit $E_0\in\{0,1\}$ and a post-step correctness bit $E_1\in\{0,1\}$, separating correction ($E_0=0\to E_1=1$) from corruption ($E_0=1\to E_1=0$) through two rates: $c=\Pr(E_1=1\mid E_0=0)$ and $\gamma=\Pr(E_1=0\mid E_0=1)$. These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines. We identify three failure mechanisms. Under mixture shift, pooled estimates of $(c,\gamma)$ become biased when calibration and deployment mixtures differ; conditioning on a difficulty proxy restores stability without additional model calls. Under presentation contamination, selection protocols alter the interface through stable presentation artifacts when candidate content is fixed. Under state insufficiency, the correctness bit may not carry enough history for multi-step pipelines to compose predictably; a Markov factorization test identifies when composition is valid and where additional state is needed. When a protocol step passes these diagnostics, it becomes an auditable module: gated by estimated gain, conditioned on a difficulty proxy to correct mixture bias, and composed into multi-step pipelines with predictable accuracy. We demonstrate these ideas on synthetic mathematical tasks and on GSM8K, where the calibrated interface correctly predicts when protocol steps should be activated or suppressed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a paired-outcome interface for auditing individual steps in LLM protocols on exact-match tasks. For each instance it records baseline correctness E0 and post-step correctness E1, defines correction rate c = Pr(E1=1 | E0=0) and corruption rate γ = Pr(E1=0 | E0=1), and claims these rates predict accuracy changes, identify three failure modes (mixture shift, presentation contamination, state insufficiency), and enable gated, conditioned, and composable protocol modules. The approach is demonstrated on synthetic mathematical tasks and GSM8K, where the calibrated rates correctly indicate when steps should be activated or suppressed.

Significance. If the rates prove stable under the proposed diagnostics, the work supplies a concrete empirical interface for decomposing protocol behavior beyond end-to-end accuracy. It explicitly credits the reusable measurement of (c, γ), the difficulty-proxy correction for mixture bias, and the Markov factorization test for composition validity as practical contributions that could support modular, auditable LLM pipelines.

major comments (2)
  1. [Abstract and §2] Abstract and §2 (interface definition): the claim that the rates 'predict accuracy changes' is a direct algebraic consequence of the law of total probability, p1 = p0(1-γ) + (1-p0)c. The manuscript should state this identity explicitly and separate it from the independent empirical claims (stability of (c, γ) across seeds and mixtures, transfer to GSM8K).
  2. [State insufficiency section] Section on state insufficiency (high-level description of Markov test): the test is presented without an explicit factorization equation (e.g., the precise condition Pr(Ek | E_{k-1}) = Pr(Ek | E_{k-1}, E_{k-2}, …)) or implementation details such as the number of steps used for validation or the statistical threshold. Because the multi-step composition claim rests on E0/E1 being a sufficient statistic, this omission is load-bearing.
minor comments (2)
  1. [Notation] Notation for c and γ should be introduced once and used uniformly in all equations, tables, and figure captions.
  2. [Experiments] GSM8K experiments would be strengthened by reporting standard errors or bootstrap intervals on the estimated rates to quantify stability across random seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments. We address each major point below, agreeing on the need for explicit separation and additional formalization, and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §2] Abstract and §2 (interface definition): the claim that the rates 'predict accuracy changes' is a direct algebraic consequence of the law of total probability, p1 = p0(1-γ) + (1-p0)c. The manuscript should state this identity explicitly and separate it from the independent empirical claims (stability of (c, γ) across seeds and mixtures, transfer to GSM8K).

    Authors: We agree that the accuracy prediction p₁ = p₀(1 - γ) + (1 - p₀)c follows directly from the law of total probability and is not itself an empirical result. In the revision we will insert this identity explicitly in §2, label it as an algebraic identity, and clearly demarcate it from the subsequent empirical claims concerning stability of (c, γ) across seeds and mixtures as well as transfer to GSM8K. revision: yes

  2. Referee: [State insufficiency section] Section on state insufficiency (high-level description of Markov test): the test is presented without an explicit factorization equation (e.g., the precise condition Pr(Ek | E_{k-1}) = Pr(Ek | E_{k-1}, E_{k-2}, …)) or implementation details such as the number of steps used for validation or the statistical threshold. Because the multi-step composition claim rests on E0/E1 being a sufficient statistic, this omission is load-bearing.

    Authors: We accept that the current high-level description of the Markov factorization test omits the explicit conditional-independence equation and the concrete validation parameters. The revised section will state the precise Markov condition Pr(E_k | E_{k-1}, …, E_0) = Pr(E_k | E_{k-1}), specify that validation is performed on sequences of up to four steps, and report the statistical threshold (likelihood-ratio test at α = 0.05) used to accept or reject the factorization. These additions will make the test fully specified and directly support the composition claims. revision: yes

Circularity Check

1 steps flagged

Accuracy 'prediction' via (c, γ) is definitional by total probability

specific steps
  1. fitted input called prediction [Abstract]
    "These rates predict accuracy changes and define a reusable empirical interface testable across seeds, mixtures, and pipelines."

    Let p = Pr(E0=1). Then Pr(E1=1) = c(1-p) + (1-γ)p exactly. Measuring c and γ from the same paired (E0, E1) outcomes therefore determines the accuracy change by algebraic identity; the 'prediction' is a tautological reparameterization rather than an independent forecast.

full rationale

The paper's central claim that the rates predict accuracy changes reduces directly to a re-expression of the observed paired correctness bits. By construction, post-step accuracy equals c(1-p) + (1-γ)p, so the claimed predictive interface adds no independent empirical content beyond the measurements themselves. This matches the fitted-input-called-prediction pattern. The remainder of the work (mixture conditioning, Markov test, gating) builds on this interface but does not escape the definitional core for the accuracy-prediction claim. No self-citations or other load-bearing circular steps appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on empirical rates derived from paired correctness bits with no explicit free parameters fitted beyond the rates themselves; it assumes binary correctness suffices for auditing and that a Markov property holds under sufficient state.

axioms (2)
  • domain assumption Binary correctness bits E0 and E1 are sufficient statistics for a protocol step's effect on accuracy.
    Invoked to treat c and γ as reusable across contexts and mixtures.
  • domain assumption A Markov factorization test can identify when composition is valid without additional state.
    Used to diagnose state insufficiency in multi-step pipelines.
invented entities (1)
  • Correction rate c and corruption rate γ no independent evidence
    purpose: Separate beneficial correction from harmful corruption in protocol steps.
    Defined directly from the paired-outcome interface; no independent falsifiable evidence outside the measurement is provided.

pith-pipeline@v0.9.0 · 5636 in / 1469 out tokens · 33798 ms · 2026-05-10T04:52:00.840531+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 6 canonical work pages · 2 internal anchors

  1. [2]

    Constitutional AI: Harmlessness from AI Feedback

    URLhttps://arxiv.org/abs/2212.08073. Karl Cobbe, Vineet Kosaraju, et al. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168,

  2. [3]

    Training Verifiers to Solve Math Word Problems

    URLhttps://arxiv.org/abs/2110.14168. David Dohan, Winnie Xu, et al. Language Model Cascades. arXiv preprint arXiv:2207.10342,

  3. [4]

    arXiv preprint arXiv:2207.10342 , year=

    URL https://arxiv.org/abs/2207.10342. Yilun Du, Shuang Li, et al. Improving Factuality and Reasoning in Language Models through Multiagent Debate. InProceedings of the 41st International Conference on Machine Learning, pp. 11733–11763,

  4. [5]

    Clémentine Fourrier, Nathan Habib, et al

    URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ deb3c28192f979302c157cb653c15e90-Paper-Conference.pdf. Clémentine Fourrier, Nathan Habib, et al. Open LLM Leaderboard v2.https://huggingface.co/spaces/ open-llm-leaderboard/open_llm_leaderboard,

  5. [6]

    Jie Huang, Xinyun Chen, et al

    URLhttps://proceedings.neurips.cc/paper_files/ paper/2017/file/4a8423d5e91fda00bb7e46540e2b0cf1-Paper.pdf. Jie Huang, Xinyun Chen, et al. Large Language Models Cannot Self-Correct Reasoning Yet. InThe Twelfth International Conference on Learning Representations (ICLR),

  6. [7]

    When can LLMs actually correct their own mistakes? A survey of self-correction

    doi: 10.1162/tacl_a_00713. URLhttps://aclanthology.org/2024.tacl-1.78/. Hunter Lightman, Vineet Kosaraju, et al. Let’s Verify Step by Step. InThe Twelfth International Conference on Learning Representations (ICLR),

  7. [9]

    Solving a million-step LLM task with zero errors.arXiv preprint arXiv:2511.09030, 2025

    URLhttps://arxiv.org/abs/2511.09030. OpenAI, Josh Achiam, Steven Adler, et al. GPT-4 Technical Report, 2024a. URLhttps://arxiv.org/ abs/2303.08774. OpenAI, Aaron Hurst, Adam Lerer, et al. GPT-4o System Card, 2024b. URLhttps://arxiv.org/abs/ 2410.21276. Ethan Perez, Saffron Huang, et al. Red Teaming Language Models with Language Models. InProceedings of th...

  8. [10]

    Joaquin Quiñonero-Candela, Masashi Sugiyama, et al

    URLhttps://aclanthology.org/2022.emnlp-main.225.pdf. Joaquin Quiñonero-Candela, Masashi Sugiyama, et al. (eds.).Dataset Shift in Machine Learning. MIT Press, Cambridge, MA,

  9. [11]

    Masashi Sugiyama and Motoaki Kawanabe.Machine Learning in Non-Stationary Environments: Introduc- tion to Covariate Shift Adaptation

    URLhttps://proceedings.neurips.cc/ paper_files/paper/2023/file/1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf. Masashi Sugiyama and Motoaki Kawanabe.Machine Learning in Non-Stationary Environments: Introduc- tion to Covariate Shift Adaptation. MIT Press, Cambridge, MA,

  10. [12]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.acl-long.511. URLhttps://aclanthology.org/2024.acl-long.511/. Xuezhi Wang, Jason Wei, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InThe Eleventh International Conference on Learning Representations (ICLR),

  11. [13]

    Shunyu Yao, Dian Yu, et al

    URLhttps://proceedings.neurips.cc/ paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf. Shunyu Yao, Dian Yu, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In A. Oh, T. Naumann, et al. (eds.),Advances in Neural Information Processing Systems (NeurIPS), volume 36, pp. 11809–11822. Curran Associ...

  12. [14]

    URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf. 35 A Estimation details, Jeffreys smoothing, and prediction Jeffreys-smoothed estimators for(c,γ), posterior credible intervals, one-step and mixture predictions, oracle headroom forjudge-K, and uncertainty propagation are spe...