A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Probability calibration applied to LLM evaluator judgments reduces preference coupling gamma by 20-49% and Jensen-Shannon divergence by 45-67% in a within-subjects experiment with N=5.
The paper specifies the EPC protocol for measuring evaluator preference coupling and releases a time-bound reference snapshot of measurements across multiple LLM evaluators.
citing papers explorer
-
A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents
A diagnostic framework called EPC reveals that proprietary LLM evaluators can exhibit large preference shifts between versions, as evidenced by a GPT-4o May-to-June drift that inverted study conclusions, rendering single-snapshot evaluations unreliable.
-
Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
Probability calibration applied to LLM evaluator judgments reduces preference coupling gamma by 20-49% and Jensen-Shannon divergence by 45-67% in a within-subjects experiment with N=5.
-
EPC: A Standardized Protocol for Measuring Evaluator Preference Dynamics in LLM Agent Systems
The paper specifies the EPC protocol for measuring evaluator preference coupling and releases a time-bound reference snapshot of measurements across multiple LLM evaluators.