Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents
Pith reviewed 2026-06-29 05:18 UTC · model grok-4.3
The pith
Cross-modal coupling transfers evaluator preferences between text and visual tasks, inverting optimal strategies in multimodal self-evolving agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using GPT-4o to evaluate DeepSeek-chat outputs, a single strategy absorbs 48.4 percent of all weight in multimodal settings while visual-domain strategies receive only 9.1 percent combined; cross-modal exposure produces strategy inversion, with real-image inputs yielding the strongest directional signal (mean gamma_T->V of 1.145 and gamma_V->T of 0.937, 70 percent T->V) and cross-model evaluation producing JSD of 0.19-0.34, while self-evaluation yields zero coupling in 97 percent of runs with JSD of 0.003.
What carries the argument
Cross-modal coupling, quantified via coupling coefficients gamma and Jensen-Shannon divergence on strategy weight distributions before versus after cross-modal exposure, which measures transfer of evaluator preferences across modalities.
If this is right
- Cross-model evaluation produces strong coupling with JSD values between 0.19 and 0.34.
- Real-image inputs produce the most directionally consistent transfer signal, with 70 percent T->V directionality and Cohen's d of 0.56.
- Self-evaluation yields near-complete immunity, with zero coupling in 97 percent of runs and JSD of 0.003.
- Three methodological ablations and multi-executor validation confirm the coupling is not a structural artifact of the setup.
Where Pith is reading between the lines
- Designs that maintain separate evaluators per modality could limit unintended preference transfer between text and visual reasoning.
- The coupling matrix indexed by evaluator identity suggests testing additional model pairs to identify which architectures minimize drift.
- Extending the four-phase paradigm to tasks with audio or other modalities could test whether coupling generalizes beyond text-visual pairs.
Load-bearing premise
The four-phase isolation training paradigm successfully isolates cross-modal coupling from confounding factors such as model-specific behaviors or task differences between text and visual domains.
What would settle it
Observing no statistically significant shift in strategy weight distributions between pre- and post-cross-modal exposure phases under cross-model evaluation would falsify the coupling claim.
Figures
read the original abstract
When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal coupling: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure coupling coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong coupling (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero coupling (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the coupling matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Evaluator Preference Collapse (EPC) is amplified in multimodal self-evolving agents, with a single strategy (step_by_step) absorbing 48.4% of weight (3.2x text-only collapse) when GPT-4o evaluates DeepSeek-chat across text and visual tasks. It introduces cross-modal coupling, where preferences acquired on one modality transfer to corrupt strategy selection on another, producing strategy inversion. A four-phase isolation training paradigm plus Phase 3 validation (five evaluator configurations, N=80 repetitions, ~35,000 API calls) reports strong coupling (JSD 0.19-0.34), directional consistency for real-image inputs (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and near-complete immunity under self-evaluation (97% of N=30 runs yield JSD=0.003, d=0.07). Three ablations and multi-executor validation are said to rule out structural artifacts; cross-model evaluator architecture is identified as the primary risk factor, and the MM-EPC framework plus code/data are released.
Significance. If the isolation paradigm holds, the work identifies a previously undocumented transfer mechanism that could systematically degrade strategy selection in multimodal agent loops. The provision of coupling coefficients, directional gamma statistics, and reproducible code/data are concrete strengths that allow direct testing of the claimed immunity under self-evaluation and the modality-specific consistency effects.
major comments (2)
- [§3] §3 (Four-phase isolation paradigm): the claim that the paradigm isolates cross-modal coupling requires explicit confirmation that evaluator identity, executor model (GPT-4o vs. DeepSeek-chat), and task distribution remain fixed while only cross-modal exposure history varies; without this, the reported JSD values (0.19-0.34) and gamma coefficients could arise from residual model-specific preference or text-vs-image domain mismatch rather than modality transfer.
- [Phase 3] Phase 3 validation (N=80, five configurations): the three methodological ablations are stated to confirm the effect is not a structural artifact, yet the manuscript does not report the precise controls applied to task distribution or input modality statistics, leaving the directional consistency result (70% T->V, Cohen's d=0.56) vulnerable to alternative explanations.
minor comments (2)
- [Abstract] Abstract: the reported sample sizes and API-call counts are useful but the exact statistical procedure for computing JSD and Cohen's d should be stated once in the main text for reproducibility.
- [Notation] Notation: the coupling matrix indexed by evaluator identity is introduced late; an early formal definition (e.g., as a 2x2 matrix of gamma coefficients) would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the isolation paradigm and Phase 3 validation. We address each point below with clarifications drawn from the experimental design and agree that explicit statements will improve the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Four-phase isolation paradigm): the claim that the paradigm isolates cross-modal coupling requires explicit confirmation that evaluator identity, executor model (GPT-4o vs. DeepSeek-chat), and task distribution remain fixed while only cross-modal exposure history varies; without this, the reported JSD values (0.19-0.34) and gamma coefficients could arise from residual model-specific preference or text-vs-image domain mismatch rather than modality transfer.
Authors: The four-phase isolation paradigm holds evaluator identity (GPT-4o), executor model (DeepSeek-chat), and task distribution fixed across phases, varying only the cross-modal exposure history in Phase 2. This control structure is specified in the §3 methods to isolate modality transfer from model-specific or domain effects. We will add an explicit confirmation paragraph in the revised §3 to eliminate any ambiguity regarding these fixed factors. revision: yes
-
Referee: [Phase 3] Phase 3 validation (N=80, five configurations): the three methodological ablations are stated to confirm the effect is not a structural artifact, yet the manuscript does not report the precise controls applied to task distribution or input modality statistics, leaving the directional consistency result (70% T->V, Cohen's d=0.56) vulnerable to alternative explanations.
Authors: The three ablations control task distribution via balanced, statistically matched text and visual task sets and control input modality statistics through fixed sampling procedures for real-image versus text-proxy inputs. These controls are described in the Phase 3 validation subsection and underpin the directional consistency findings. We agree that more granular reporting of the exact task distribution and modality statistics would strengthen the section against alternative explanations and will expand this description in the revision. revision: yes
Circularity Check
No circularity; results are direct empirical measurements
full rationale
The paper's central claims rest on experimental measurements (JSD values, gamma coefficients, Cohen's d, strategy weights) obtained via API calls across four-phase isolation training, Phase 3 validation (N=80, ~35k calls), and ablations. These quantities are reported as observed statistics from model outputs, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No self-definitional loops, uniqueness theorems, or ansatzes appear in the methodology or results sections. The work is self-contained against external benchmarks via direct replication of the described protocol.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The four-phase isolation training paradigm isolates cross-modal effects without introducing structural artifacts from the specific models or tasks
- standard math Statistical measures (JSD, Cohen's d, gamma coefficients) validly quantify coupling strength across the tested configurations
invented entities (2)
-
cross-modal coupling
no independent evidence
-
strategy inversion
no independent evidence
Forward citations
Cited by 2 Pith papers
-
Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems
Contagion Networks framework measures evaluator bias propagation in 3-agent LLM systems using the same base model, reporting gamma values of 0.157-0.352 and a 72.4% reduction in contagion when increasing evaluator com...
-
Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems
Introduces Contagion Networks framework and measures preference propagation in 3-agent LLM setups, finding architectural priors dominate prompts, topology affects spread, and larger committees reduce contagion by ~69%.
Reference graph
Works this paper leans on
-
[1]
Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution
Z. Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution. arXiv, 2026
2026
-
[2]
OpenAI.GPT-4o System Card.arXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[3]
Google DeepMind.Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens.arXiv:2403.05530, 2024
Pith/arXiv arXiv 2024
-
[4]
Zheng, W.-L
L. Zheng, W.-L. Chiang, Y. Sheng, et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023
2023
-
[5]
Verga, H
P. Verga, H. Rashkin, and M. Bansal.Arbiter: A Robust Evaluation Framework for LLM- as-Judge.EMNLP, 2024
2024
-
[6]
Chiang, L
W.-L. Chiang, L. Zheng, Y. Sheng, et al.Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.ICML, 2024
2024
-
[7]
X. Li, T. Zhang, Y. Dubois, et al.AlpacaEval: An Automatic Evaluator of Instruction- following Models.ICLR, 2024
2024
-
[8]
W. Yuan, R. Y. Pang, K. Cho, et al.Self-Rewarding Language Models.ICML, 2024
2024
-
[9]
H. Chen, S. Yao, D. Yu, et al.Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.NeurIPS, 2024
2024
-
[10]
J. Lu, D. Batra, D. Parikh, and S. Lee.ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations.NeurIPS, 2019
2019
-
[11]
A. Aghajanyan, A. Gupta, A. Shrivastava, et al.CM3: A Casual Masked Multimodal Model of the Internet.arXiv:2101.08220, 2021
arXiv 2021
-
[12]
Alayrac, J
J.-B. Alayrac, J. Donahue, P. Luc, et al.Flamingo: A Visual Language Model for Few-Shot Learning.NeurIPS, 2022. 15
2022
-
[13]
T. Yu, R. Zhang, Z. Yang, et al.Reward Hacking in Multimodal RLHF: A Case Study in Vision-Language Models.ICLR, 2024
2024
-
[14]
Z. Sun, Y. Shen, Q. Zhou, et al.Aligning Multimodal Large Language Models with Human Preferences: Challenges and Solutions.arXiv:2405.07656, 2024
arXiv 2024
-
[15]
Shinn, F
N. Shinn, F. Cassano, A. Gopinath, et al.Reflexion: Language Agents with Verbal Rein- forcement Learning.NeurIPS, 2023
2023
-
[16]
NeurIPS, 2023
A.Madaan, N.Tandon, P.Gupta, etal.Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS, 2023
2023
-
[17]
S. Yao, J. Zhao, D. Yu, et al.ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023
2023
-
[18]
G. Wang, Y. Xie, Y. Jiang, et al.Voyager: An Open-Ended Embodied Agent with Large Language Models.ICML, 2024
2024
-
[19]
Arora, E
S. Arora, E. Hazan, and S. Kale.The Multiplicative Weights Update Method: A Meta- Algorithm and Applications.Theory of Computing, 2012
2012
-
[20]
DeepSeek-AI.DeepSeek-V3 Technical Report.arXiv:2412.19437, 2024
Pith/arXiv arXiv 2024
-
[21]
L. Gao, J. Schulman, and J. Hilton.Scaling Laws for Reward Model Overoptimization. ICML, 2023
2023
-
[22]
Casper, X
S. Casper, X. Davies, C. Shi, et al.Open Problems and Fundamental Limitations of Rein- forcement Learning from Human Feedback.TMLR, 2023
2023
-
[23]
Sharma, E
M. Sharma, E. Tong, T. Korbak, et al.Towards Understanding Sycophancy in Language Models.ICLR, 2024
2024
-
[24]
Perez, S
E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, et al.Discovering Language Model Behaviors with Model- Written Evaluations.NeurIPS, 2022
2022
-
[25]
Burns, P
C. Burns, P. Izmailov, J. H. Kirchner, et al.Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision.ICML, 2024. 16 A Experiment Reproducibility Requirements:Python 3.8+, LLM API access (api2d/DashScope). No GPU. Phase 1:python mm_epc_phase1.py Phase 2:python mm_epc_coupling.py Phase 3:python mm_epc_phase3_significance.py Output...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.