pith. sign in

arxiv: 2606.31371 · v1 · pith:MEZRKB2Anew · submitted 2026-06-30 · 💻 cs.LG · cs.AI· cs.CL

Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?

Pith reviewed 2026-07-01 06:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM agentspreference couplingprobability calibrationevaluator biasfeedback loopsTTRLLLM-as-judge
0
0 comments X

The pith

Probability calibration on an LLM evaluator reduces preference coupling in agent feedback loops by 20-49 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether applying probability calibration to an evaluator's pairwise judgments can lessen how much those judgments bias the strategy an LLM agent learns through feedback. In a within-subjects experiment with five runs, standard binary win/loss updates are compared against calibrated probability-weighted updates using one model as executor and another as evaluator. The calibrated approach lowers the measured coupling coefficient and the divergence between resulting strategy distributions. A separate control with symmetric learning rates shows the reduction is not explained by changes in update balance alone. This positions calibration as a direct adjustment to the feedback signal itself.

Core claim

Applying probability calibration to the evaluator's pairwise judgments reduces the coupling coefficient gamma by 20-49% and Jensen-Shannon divergence by 45-67% compared with standard binary TTRL. The within-subjects design with DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, together with a symmetric-LR control, attributes the reduction to calibration rather than reduced update asymmetry. The study presents the calibrated TTRL protocol as a lightweight mitigation for preference coupling in LLM agent feedback loops.

What carries the argument

probability calibration applied to the evaluator's pairwise judgments to produce probability-weighted updates instead of binary win/loss signals

If this is right

  • The coupling coefficient gamma decreases by 20-49% when probability calibration replaces binary judgments.
  • Jensen-Shannon divergence between agent strategy distributions decreases by 45-67% under the same change.
  • The reduction in coupling persists after applying a symmetric learning rate control.
  • The calibrated TTRL protocol can be released and used as a lightweight adjustment in LLM-as-judge pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the reduction holds for additional model pairs, calibration could be added as a default preprocessing step for any LLM judge.
  • The same calibration step might be tested on multi-turn or multi-agent feedback loops where coupling could accumulate over iterations.
  • Measuring whether lower coupling also improves final task performance would link the distribution metric to downstream outcomes.

Load-bearing premise

The assumption that the symmetric-LR control and the specific choice of DeepSeek-V4-Pro executor and GLM5.2 evaluator isolate the effect of probability calibration from model-specific biases or task distribution.

What would settle it

Repeating the within-subjects comparison on a new pair of models or task set and observing no reduction in gamma or Jensen-Shannon divergence would falsify the mitigation claim.

read the original abstract

When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent's learned strategy distribution - a phenomenon termed evaluator preference coupling. Prior work has documented this coupling and established a diagnostic framework (EPC) to measure it, but has not investigated whether calibration techniques can mitigate the effect. We present the first study of evaluator calibration as mitigation: applying probability calibration to the evaluator's pairwise judgments to reduce spurious preference propagation. In a controlled within-subjects experiment (N=5) comparing standard binary TTRL (win/loss) with confidence-calibrated TTRL (probability-weighted updates) using DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, we find that calibration reduces the coupling coefficient gamma by 20-49% and Jensen-Shannon divergence by 45-67%. A symmetric-LR control confirms the effect is not due to reduced update asymmetry. We release the calibrated TTRL protocol and recommend it as a lightweight mitigation for LLM-as-judge deployment pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that applying probability calibration to an LLM evaluator's pairwise judgments in a TTRL feedback loop reduces evaluator preference coupling, as measured by the coupling coefficient gamma (20-49% reduction) and Jensen-Shannon divergence (45-67% reduction). This is demonstrated in a within-subjects experiment (N=5) comparing standard binary TTRL against confidence-calibrated TTRL, using DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, with a symmetric-LR control to rule out update asymmetry.

Significance. If the quantitative mitigation effect is robustly confirmed, the work offers a lightweight, practical intervention for reducing bias propagation in LLM-as-judge pipelines, extending the EPC diagnostic framework with an actionable calibration protocol. The release of the calibrated TTRL protocol is a positive contribution for reproducibility.

major comments (2)
  1. [Abstract / Experimental Results] Abstract and Experimental Results section: The central quantitative claims rest on effect-size ranges (gamma reduced 20-49%, JSD 45-67%) from only N=5 within-subjects paired trials, yet no per-condition standard deviations, confidence intervals, bootstrap estimates, or hypothesis tests are reported. This directly undermines the reliability of the mitigation percentages as evidence of a stable calibration effect rather than sampling variability or task-specific noise.
  2. [Experimental Setup] Experimental design description: The symmetric-LR control and fixed choice of two models (DeepSeek-V4-Pro executor, GLM5.2 evaluator) are presented as isolating the calibration effect, but with N=5 and no cross-model or cross-task replication, it remains unclear whether the observed reductions generalize beyond these specific model idiosyncrasies and task distribution.
minor comments (1)
  1. [Abstract] The abstract states 'we release the calibrated TTRL protocol' but does not specify the repository URL or license in the provided text; this should be added for immediate usability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, acknowledging limitations where appropriate while defending the controlled nature of the study.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results section: The central quantitative claims rest on effect-size ranges (gamma reduced 20-49%, JSD 45-67%) from only N=5 within-subjects paired trials, yet no per-condition standard deviations, confidence intervals, bootstrap estimates, or hypothesis tests are reported. This directly undermines the reliability of the mitigation percentages as evidence of a stable calibration effect rather than sampling variability or task-specific noise.

    Authors: We agree that the reported ranges would benefit from explicit measures of uncertainty. The within-subjects paired design was selected to control for task and model variance across the N=5 trials, but we acknowledge the small sample precludes strong claims of stability. In the revision we will add bootstrap confidence intervals and per-condition standard deviations computed from the raw trial data to quantify variability. revision: yes

  2. Referee: [Experimental Setup] Experimental design description: The symmetric-LR control and fixed choice of two models (DeepSeek-V4-Pro executor, GLM5.2 evaluator) are presented as isolating the calibration effect, but with N=5 and no cross-model or cross-task replication, it remains unclear whether the observed reductions generalize beyond these specific model idiosyncrasies and task distribution.

    Authors: The symmetric-LR control was introduced precisely to isolate calibration from update asymmetry, and the model pair was fixed to enable a clean within-subjects comparison. We do not claim the reductions hold universally; the experiment demonstrates the mitigation effect under these controlled conditions. We will expand the limitations and future-work sections to note the absence of cross-model or cross-task replication. revision: partial

Circularity Check

0 steps flagged

Empirical experiment reports measured reductions with no derivation chain

full rationale

The paper describes a within-subjects experiment (N=5) that directly measures the effect of probability calibration on coupling coefficient gamma and Jensen-Shannon divergence. No equations, fitted parameters, or self-citations are used to derive the reported percentage reductions; the values are presented as experimental outcomes. The work is therefore self-contained and contains no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the EPC diagnostic from prior work, the assumption that probability calibration can be applied to pairwise LLM judgments without introducing new artifacts, and the representativeness of the two models and task distribution used in the N=5 experiment.

axioms (2)
  • domain assumption The EPC diagnostic framework from prior work accurately quantifies preference coupling in LLM feedback loops.
    The paper uses the EPC metric to measure the mitigation effect without re-deriving or validating it in this study.
  • domain assumption Pairwise judgments from the evaluator model can be meaningfully calibrated to probabilities that reduce spurious preference propagation.
    This is the core premise of the mitigation technique tested in the experiment.

pith-pipeline@v0.9.1-grok · 5712 in / 1559 out tokens · 55461 ms · 2026-07-01T06:01:47.564434+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Z. Liu. A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics. TMLR submission, 2026

  2. [2]

    Z. Liu. Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems. arXiv:2606.20493, 2026

  3. [3]

    Z. Liu. Multimodal Evaluator Preference Collapse. arXiv:2606.16682, 2026

  4. [4]

    Y. Li. Who Drifted: the System or the Judge? arXiv:2606.15474, 2026

  5. [5]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS, 2023

  6. [6]

    Chiang, L

    W.-L. Chiang, L. Zheng, et al. Chatbot Arena. ICML, 2024

  7. [7]

    C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On Calibration of Modern Neural Networks. ICML, 2017

  8. [8]

    Niculescu-Mizil and R

    A. Niculescu-Mizil and R. Caruana. Predicting Good Probabilities with Supervised Learning. ICML, 2005

  9. [9]

    Bostr \"o m

    H. Bostr \"o m. Calibrating Random Forests. ICMLA, 2008

  10. [10]

    Grinsztajn, E

    L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on tabular data? NeurIPS, 2022

  11. [11]

    Z. Li, X. Li, C. Huang, G. Li, et al. Judging with Confidence: Calibrating Autoraters to Preference Distributions. arXiv:2510.00263, 2025

  12. [12]

    J. Leng, C. Huang, B. Zhu, and J. Huang. Taming Overconfidence in LLMs: Reward Calibration in RLHF. ICLR, 2025. arXiv:2410.09724

  13. [13]

    D. Singha. UARD: Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking. arXiv:2604.26360, 2026

  14. [14]

    Devarakonda, J

    S. Devarakonda, J. Huang, and P. Liang. Confidence-Gated RAG for Adaptive Retrieval in Sequential Agents. ICLR, 2026

  15. [15]

    Balashankar, S

    A. Balashankar, S. Chen, and J. Yao. InfAlign: Inference-Aware Language Model Alignment. NeurIPS, 2025

  16. [16]

    Z. Zuo, Y. Wang, and J. Li. TTRL-CoCoV: Test-Time Reinforcement Learning with Confidence Conditioned Verification. arXiv, 2026

  17. [17]

    Y. Wang, X. Zhang, and H. Chen. SCOPE: Beyond Majority Voting---Step-wise Confidence Weighting for Test-Time RL. arXiv:2512.15146, 2026