Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

Zewen Liu

arxiv: 2606.16682 · v3 · pith:G5GH7ARRnew · submitted 2026-06-15 · 💻 cs.LG · cs.CL

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

Zewen Liu This is my paper

Pith reviewed 2026-06-29 05:18 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords evaluator preference collapsecross-modal couplingmultimodal agentsself-evaluationstrategy selectionpreference driftJensen-Shannon divergence

0 comments

The pith

Cross-modal coupling transfers evaluator preferences between text and visual tasks, inverting optimal strategies in multimodal self-evolving agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Evaluator Preference Collapse is amplified when agents evaluate outputs across modalities, with one strategy absorbing 48.4 percent of weight in multimodal cases versus lower collapse in text-only runs. It demonstrates cross-modal coupling, where preferences acquired on one modality transfer to and corrupt selection on the other, producing strategy inversion after exposure. Validation across evaluator configurations and task types shows strong coupling under cross-model evaluation but near-zero coupling under self-evaluation. A reader would care because this mechanism can systematically bias how agents choose reasoning strategies when handling mixed text and image inputs in feedback loops.

Core claim

Using GPT-4o to evaluate DeepSeek-chat outputs, a single strategy absorbs 48.4 percent of all weight in multimodal settings while visual-domain strategies receive only 9.1 percent combined; cross-modal exposure produces strategy inversion, with real-image inputs yielding the strongest directional signal (mean gamma_T->V of 1.145 and gamma_V->T of 0.937, 70 percent T->V) and cross-model evaluation producing JSD of 0.19-0.34, while self-evaluation yields zero coupling in 97 percent of runs with JSD of 0.003.

What carries the argument

Cross-modal coupling, quantified via coupling coefficients gamma and Jensen-Shannon divergence on strategy weight distributions before versus after cross-modal exposure, which measures transfer of evaluator preferences across modalities.

If this is right

Cross-model evaluation produces strong coupling with JSD values between 0.19 and 0.34.
Real-image inputs produce the most directionally consistent transfer signal, with 70 percent T->V directionality and Cohen's d of 0.56.
Self-evaluation yields near-complete immunity, with zero coupling in 97 percent of runs and JSD of 0.003.
Three methodological ablations and multi-executor validation confirm the coupling is not a structural artifact of the setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designs that maintain separate evaluators per modality could limit unintended preference transfer between text and visual reasoning.
The coupling matrix indexed by evaluator identity suggests testing additional model pairs to identify which architectures minimize drift.
Extending the four-phase paradigm to tasks with audio or other modalities could test whether coupling generalizes beyond text-visual pairs.

Load-bearing premise

The four-phase isolation training paradigm successfully isolates cross-modal coupling from confounding factors such as model-specific behaviors or task differences between text and visual domains.

What would settle it

Observing no statistically significant shift in strategy weight distributions between pre- and post-cross-modal exposure phases under cross-model evaluation would falsify the coupling claim.

Figures

Figures reproduced from arXiv: 2606.16682 by Zewen Liu.

read the original abstract

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal coupling: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure coupling coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong coupling (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero coupling (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the coupling matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures cross-modal coupling in multimodal self-evaluation with concrete numbers and open code, but the four-phase isolation leaves some room for model and task confounders to explain the shifts.

read the letter

The key takeaway is that this work documents cross-modal coupling and strategy inversion in multimodal agent evaluation, with self-evaluation showing near-zero coupling. They extend text-only EPC findings by running GPT-4o evaluations on DeepSeek-chat outputs across text and visual tasks, reporting that one strategy grabs 48.4% weight and that preferences transfer with JSD values of 0.19-0.34.

What stands out is the four-phase paradigm plus the Phase 3 validation across 80 runs and roughly 35,000 API calls. They report directional gamma coefficients, 70% T-to-V consistency on real images, and 97% of self-evaluation runs showing zero coupling. The ablations and multi-executor checks are a plus, and releasing the MM-EPC framework with code lets others test the numbers directly.

The softer part is the isolation claim. The setup mixes evaluator and executor models while switching task domains, so residual model preferences or text-versus-image differences could produce the observed weight shifts without pure cross-modal transfer. The abstract says the phases and ablations rule this out, but the exact controls on evaluator identity and task distribution are not laid out here. Cohen's d of 0.56 is moderate, so even modest leaks would affect the result.

This is for researchers building feedback loops in multimodal agents or studying preference collapse. Readers who care about evaluator architecture risks will get usable measurements and a starting codebase. The empirical reporting and open resources are solid enough that it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that Evaluator Preference Collapse (EPC) is amplified in multimodal self-evolving agents, with a single strategy (step_by_step) absorbing 48.4% of weight (3.2x text-only collapse) when GPT-4o evaluates DeepSeek-chat across text and visual tasks. It introduces cross-modal coupling, where preferences acquired on one modality transfer to corrupt strategy selection on another, producing strategy inversion. A four-phase isolation training paradigm plus Phase 3 validation (five evaluator configurations, N=80 repetitions, ~35,000 API calls) reports strong coupling (JSD 0.19-0.34), directional consistency for real-image inputs (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and near-complete immunity under self-evaluation (97% of N=30 runs yield JSD=0.003, d=0.07). Three ablations and multi-executor validation are said to rule out structural artifacts; cross-model evaluator architecture is identified as the primary risk factor, and the MM-EPC framework plus code/data are released.

Significance. If the isolation paradigm holds, the work identifies a previously undocumented transfer mechanism that could systematically degrade strategy selection in multimodal agent loops. The provision of coupling coefficients, directional gamma statistics, and reproducible code/data are concrete strengths that allow direct testing of the claimed immunity under self-evaluation and the modality-specific consistency effects.

major comments (2)

[§3] §3 (Four-phase isolation paradigm): the claim that the paradigm isolates cross-modal coupling requires explicit confirmation that evaluator identity, executor model (GPT-4o vs. DeepSeek-chat), and task distribution remain fixed while only cross-modal exposure history varies; without this, the reported JSD values (0.19-0.34) and gamma coefficients could arise from residual model-specific preference or text-vs-image domain mismatch rather than modality transfer.
[Phase 3] Phase 3 validation (N=80, five configurations): the three methodological ablations are stated to confirm the effect is not a structural artifact, yet the manuscript does not report the precise controls applied to task distribution or input modality statistics, leaving the directional consistency result (70% T->V, Cohen's d=0.56) vulnerable to alternative explanations.

minor comments (2)

[Abstract] Abstract: the reported sample sizes and API-call counts are useful but the exact statistical procedure for computing JSD and Cohen's d should be stated once in the main text for reproducibility.
[Notation] Notation: the coupling matrix indexed by evaluator identity is introduced late; an early formal definition (e.g., as a 2x2 matrix of gamma coefficients) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the isolation paradigm and Phase 3 validation. We address each point below with clarifications drawn from the experimental design and agree that explicit statements will improve the manuscript.

read point-by-point responses

Referee: [§3] §3 (Four-phase isolation paradigm): the claim that the paradigm isolates cross-modal coupling requires explicit confirmation that evaluator identity, executor model (GPT-4o vs. DeepSeek-chat), and task distribution remain fixed while only cross-modal exposure history varies; without this, the reported JSD values (0.19-0.34) and gamma coefficients could arise from residual model-specific preference or text-vs-image domain mismatch rather than modality transfer.

Authors: The four-phase isolation paradigm holds evaluator identity (GPT-4o), executor model (DeepSeek-chat), and task distribution fixed across phases, varying only the cross-modal exposure history in Phase 2. This control structure is specified in the §3 methods to isolate modality transfer from model-specific or domain effects. We will add an explicit confirmation paragraph in the revised §3 to eliminate any ambiguity regarding these fixed factors. revision: yes
Referee: [Phase 3] Phase 3 validation (N=80, five configurations): the three methodological ablations are stated to confirm the effect is not a structural artifact, yet the manuscript does not report the precise controls applied to task distribution or input modality statistics, leaving the directional consistency result (70% T->V, Cohen's d=0.56) vulnerable to alternative explanations.

Authors: The three ablations control task distribution via balanced, statistically matched text and visual task sets and control input modality statistics through fixed sampling procedures for real-image versus text-proxy inputs. These controls are described in the Phase 3 validation subsection and underpin the directional consistency findings. We agree that more granular reporting of the exact task distribution and modality statistics would strengthen the section against alternative explanations and will expand this description in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements

full rationale

The paper's central claims rest on experimental measurements (JSD values, gamma coefficients, Cohen's d, strategy weights) obtained via API calls across four-phase isolation training, Phase 3 validation (N=80, ~35k calls), and ablations. These quantities are reported as observed statistics from model outputs, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No self-definitional loops, uniqueness theorems, or ansatzes appear in the methodology or results sections. The work is self-contained against external benchmarks via direct replication of the described protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that the staged isolation training successfully separates cross-modal transfer from model-specific or task-specific artifacts, and that strategy preference weights are directly comparable across text and visual modalities.

axioms (2)

domain assumption The four-phase isolation training paradigm isolates cross-modal effects without introducing structural artifacts from the specific models or tasks
Invoked to interpret the measured coupling coefficients and strategy inversion
standard math Statistical measures (JSD, Cohen's d, gamma coefficients) validly quantify coupling strength across the tested configurations
Used in Phase 3 validation across N=80 repetitions

invented entities (2)

cross-modal coupling no independent evidence
purpose: Describes transfer of evaluator preferences from one modality to strategy selection in another
Introduced to explain observed changes in strategy weights after cross-modal exposure
strategy inversion no independent evidence
purpose: Describes reversal of optimal strategy for a modality after cross-modal exposure
Observed phenomenon reported from the isolation training experiments

pith-pipeline@v0.9.1-grok · 5861 in / 1646 out tokens · 33040 ms · 2026-06-29T05:18:08.360975+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems
cs.LG 2026-06 unverdicted novelty 7.0

Contagion Networks framework measures evaluator bias propagation in 3-agent LLM systems using the same base model, reporting gamma values of 0.157-0.352 and a 72.4% reduction in contagion when increasing evaluator com...
Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems
cs.LG 2026-06 unverdicted novelty 6.0

Introduces Contagion Networks framework and measures preference propagation in 3-agent LLM setups, finding architectural priors dominate prompts, topology affects spread, and larger committees reduce contagion by ~69%.

Reference graph

Works this paper leans on

25 extracted references · 3 linked inside Pith · cited by 1 Pith paper

[1]

Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution

Z. Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution. arXiv, 2026

2026
[2]

OpenAI.GPT-4o System Card.arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[3]

Google DeepMind.Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens.arXiv:2403.05530, 2024

Pith/arXiv arXiv 2024
[4]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023

2023
[5]

Verga, H

P. Verga, H. Rashkin, and M. Bansal.Arbiter: A Robust Evaluation Framework for LLM- as-Judge.EMNLP, 2024

2024
[6]

Chiang, L

W.-L. Chiang, L. Zheng, Y. Sheng, et al.Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.ICML, 2024

2024
[7]

X. Li, T. Zhang, Y. Dubois, et al.AlpacaEval: An Automatic Evaluator of Instruction- following Models.ICLR, 2024

2024
[8]

W. Yuan, R. Y. Pang, K. Cho, et al.Self-Rewarding Language Models.ICML, 2024

2024
[9]

H. Chen, S. Yao, D. Yu, et al.Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.NeurIPS, 2024

2024
[10]

J. Lu, D. Batra, D. Parikh, and S. Lee.ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations.NeurIPS, 2019

2019
[11]

Aghajanyan, A

A. Aghajanyan, A. Gupta, A. Shrivastava, et al.CM3: A Casual Masked Multimodal Model of the Internet.arXiv:2101.08220, 2021

arXiv 2021
[12]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, et al.Flamingo: A Visual Language Model for Few-Shot Learning.NeurIPS, 2022. 15

2022
[13]

T. Yu, R. Zhang, Z. Yang, et al.Reward Hacking in Multimodal RLHF: A Case Study in Vision-Language Models.ICLR, 2024

2024
[14]

Z. Sun, Y. Shen, Q. Zhou, et al.Aligning Multimodal Large Language Models with Human Preferences: Challenges and Solutions.arXiv:2405.07656, 2024

arXiv 2024
[15]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, et al.Reflexion: Language Agents with Verbal Rein- forcement Learning.NeurIPS, 2023

2023
[16]

NeurIPS, 2023

A.Madaan, N.Tandon, P.Gupta, etal.Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS, 2023

2023
[17]

S. Yao, J. Zhao, D. Yu, et al.ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023

2023
[18]

G. Wang, Y. Xie, Y. Jiang, et al.Voyager: An Open-Ended Embodied Agent with Large Language Models.ICML, 2024

2024
[19]

Arora, E

S. Arora, E. Hazan, and S. Kale.The Multiplicative Weights Update Method: A Meta- Algorithm and Applications.Theory of Computing, 2012

2012
[20]

DeepSeek-AI.DeepSeek-V3 Technical Report.arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[21]

L. Gao, J. Schulman, and J. Hilton.Scaling Laws for Reward Model Overoptimization. ICML, 2023

2023
[22]

Casper, X

S. Casper, X. Davies, C. Shi, et al.Open Problems and Fundamental Limitations of Rein- forcement Learning from Human Feedback.TMLR, 2023

2023
[23]

Sharma, E

M. Sharma, E. Tong, T. Korbak, et al.Towards Understanding Sycophancy in Language Models.ICLR, 2024

2024
[24]

Perez, S

E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, et al.Discovering Language Model Behaviors with Model- Written Evaluations.NeurIPS, 2022

2022
[25]

Burns, P

C. Burns, P. Izmailov, J. H. Kirchner, et al.Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision.ICML, 2024. 16 A Experiment Reproducibility Requirements:Python 3.8+, LLM API access (api2d/DashScope). No GPU. Phase 1:python mm_epc_phase1.py Phase 2:python mm_epc_coupling.py Phase 3:python mm_epc_phase3_significance.py Output...

2024

[1] [1]

Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution

Z. Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution. arXiv, 2026

2026

[2] [2]

OpenAI.GPT-4o System Card.arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[3] [3]

Google DeepMind.Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens.arXiv:2403.05530, 2024

Pith/arXiv arXiv 2024

[4] [4]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y. Sheng, et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023

2023

[5] [5]

Verga, H

P. Verga, H. Rashkin, and M. Bansal.Arbiter: A Robust Evaluation Framework for LLM- as-Judge.EMNLP, 2024

2024

[6] [6]

Chiang, L

W.-L. Chiang, L. Zheng, Y. Sheng, et al.Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.ICML, 2024

2024

[7] [7]

X. Li, T. Zhang, Y. Dubois, et al.AlpacaEval: An Automatic Evaluator of Instruction- following Models.ICLR, 2024

2024

[8] [8]

W. Yuan, R. Y. Pang, K. Cho, et al.Self-Rewarding Language Models.ICML, 2024

2024

[9] [9]

H. Chen, S. Yao, D. Yu, et al.Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.NeurIPS, 2024

2024

[10] [10]

J. Lu, D. Batra, D. Parikh, and S. Lee.ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations.NeurIPS, 2019

2019

[11] [11]

Aghajanyan, A

A. Aghajanyan, A. Gupta, A. Shrivastava, et al.CM3: A Casual Masked Multimodal Model of the Internet.arXiv:2101.08220, 2021

arXiv 2021

[12] [12]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, et al.Flamingo: A Visual Language Model for Few-Shot Learning.NeurIPS, 2022. 15

2022

[13] [13]

T. Yu, R. Zhang, Z. Yang, et al.Reward Hacking in Multimodal RLHF: A Case Study in Vision-Language Models.ICLR, 2024

2024

[14] [14]

Z. Sun, Y. Shen, Q. Zhou, et al.Aligning Multimodal Large Language Models with Human Preferences: Challenges and Solutions.arXiv:2405.07656, 2024

arXiv 2024

[15] [15]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, et al.Reflexion: Language Agents with Verbal Rein- forcement Learning.NeurIPS, 2023

2023

[16] [16]

NeurIPS, 2023

A.Madaan, N.Tandon, P.Gupta, etal.Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS, 2023

2023

[17] [17]

S. Yao, J. Zhao, D. Yu, et al.ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023

2023

[18] [18]

G. Wang, Y. Xie, Y. Jiang, et al.Voyager: An Open-Ended Embodied Agent with Large Language Models.ICML, 2024

2024

[19] [19]

Arora, E

S. Arora, E. Hazan, and S. Kale.The Multiplicative Weights Update Method: A Meta- Algorithm and Applications.Theory of Computing, 2012

2012

[20] [20]

DeepSeek-AI.DeepSeek-V3 Technical Report.arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[21] [21]

L. Gao, J. Schulman, and J. Hilton.Scaling Laws for Reward Model Overoptimization. ICML, 2023

2023

[22] [22]

Casper, X

S. Casper, X. Davies, C. Shi, et al.Open Problems and Fundamental Limitations of Rein- forcement Learning from Human Feedback.TMLR, 2023

2023

[23] [23]

Sharma, E

M. Sharma, E. Tong, T. Korbak, et al.Towards Understanding Sycophancy in Language Models.ICLR, 2024

2024

[24] [24]

Perez, S

E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, et al.Discovering Language Model Behaviors with Model- Written Evaluations.NeurIPS, 2022

2022

[25] [25]

Burns, P

C. Burns, P. Izmailov, J. H. Kirchner, et al.Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision.ICML, 2024. 16 A Experiment Reproducibility Requirements:Python 3.8+, LLM API access (api2d/DashScope). No GPU. Phase 1:python mm_epc_phase1.py Phase 2:python mm_epc_coupling.py Phase 3:python mm_epc_phase3_significance.py Output...

2024