pith. sign in

arxiv: 2606.16682 · v3 · pith:G5GH7ARRnew · submitted 2026-06-15 · 💻 cs.LG · cs.CL

Multimodal Evaluator Preference Collapse: Cross-Modal Coupling in Self-Evolving Agents

Pith reviewed 2026-06-29 05:18 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords evaluator preference collapsecross-modal couplingmultimodal agentsself-evaluationstrategy selectionpreference driftJensen-Shannon divergence
0
0 comments X

The pith

Cross-modal coupling transfers evaluator preferences between text and visual tasks, inverting optimal strategies in multimodal self-evolving agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Evaluator Preference Collapse is amplified when agents evaluate outputs across modalities, with one strategy absorbing 48.4 percent of weight in multimodal cases versus lower collapse in text-only runs. It demonstrates cross-modal coupling, where preferences acquired on one modality transfer to and corrupt selection on the other, producing strategy inversion after exposure. Validation across evaluator configurations and task types shows strong coupling under cross-model evaluation but near-zero coupling under self-evaluation. A reader would care because this mechanism can systematically bias how agents choose reasoning strategies when handling mixed text and image inputs in feedback loops.

Core claim

Using GPT-4o to evaluate DeepSeek-chat outputs, a single strategy absorbs 48.4 percent of all weight in multimodal settings while visual-domain strategies receive only 9.1 percent combined; cross-modal exposure produces strategy inversion, with real-image inputs yielding the strongest directional signal (mean gamma_T->V of 1.145 and gamma_V->T of 0.937, 70 percent T->V) and cross-model evaluation producing JSD of 0.19-0.34, while self-evaluation yields zero coupling in 97 percent of runs with JSD of 0.003.

What carries the argument

Cross-modal coupling, quantified via coupling coefficients gamma and Jensen-Shannon divergence on strategy weight distributions before versus after cross-modal exposure, which measures transfer of evaluator preferences across modalities.

If this is right

  • Cross-model evaluation produces strong coupling with JSD values between 0.19 and 0.34.
  • Real-image inputs produce the most directionally consistent transfer signal, with 70 percent T->V directionality and Cohen's d of 0.56.
  • Self-evaluation yields near-complete immunity, with zero coupling in 97 percent of runs and JSD of 0.003.
  • Three methodological ablations and multi-executor validation confirm the coupling is not a structural artifact of the setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designs that maintain separate evaluators per modality could limit unintended preference transfer between text and visual reasoning.
  • The coupling matrix indexed by evaluator identity suggests testing additional model pairs to identify which architectures minimize drift.
  • Extending the four-phase paradigm to tasks with audio or other modalities could test whether coupling generalizes beyond text-visual pairs.

Load-bearing premise

The four-phase isolation training paradigm successfully isolates cross-modal coupling from confounding factors such as model-specific behaviors or task differences between text and visual domains.

What would settle it

Observing no statistically significant shift in strategy weight distributions between pre- and post-cross-modal exposure phases under cross-model evaluation would falsify the coupling claim.

Figures

Figures reproduced from arXiv: 2606.16682 by Zewen Liu.

Figure 1
Figure 1. Figure 1: Four-phase isolation training protocol. All evaluator comparisons use the same fixed [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
read the original abstract

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal coupling: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure coupling coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong coupling (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero coupling (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the coupling matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Evaluator Preference Collapse (EPC) is amplified in multimodal self-evolving agents, with a single strategy (step_by_step) absorbing 48.4% of weight (3.2x text-only collapse) when GPT-4o evaluates DeepSeek-chat across text and visual tasks. It introduces cross-modal coupling, where preferences acquired on one modality transfer to corrupt strategy selection on another, producing strategy inversion. A four-phase isolation training paradigm plus Phase 3 validation (five evaluator configurations, N=80 repetitions, ~35,000 API calls) reports strong coupling (JSD 0.19-0.34), directional consistency for real-image inputs (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and near-complete immunity under self-evaluation (97% of N=30 runs yield JSD=0.003, d=0.07). Three ablations and multi-executor validation are said to rule out structural artifacts; cross-model evaluator architecture is identified as the primary risk factor, and the MM-EPC framework plus code/data are released.

Significance. If the isolation paradigm holds, the work identifies a previously undocumented transfer mechanism that could systematically degrade strategy selection in multimodal agent loops. The provision of coupling coefficients, directional gamma statistics, and reproducible code/data are concrete strengths that allow direct testing of the claimed immunity under self-evaluation and the modality-specific consistency effects.

major comments (2)
  1. [§3] §3 (Four-phase isolation paradigm): the claim that the paradigm isolates cross-modal coupling requires explicit confirmation that evaluator identity, executor model (GPT-4o vs. DeepSeek-chat), and task distribution remain fixed while only cross-modal exposure history varies; without this, the reported JSD values (0.19-0.34) and gamma coefficients could arise from residual model-specific preference or text-vs-image domain mismatch rather than modality transfer.
  2. [Phase 3] Phase 3 validation (N=80, five configurations): the three methodological ablations are stated to confirm the effect is not a structural artifact, yet the manuscript does not report the precise controls applied to task distribution or input modality statistics, leaving the directional consistency result (70% T->V, Cohen's d=0.56) vulnerable to alternative explanations.
minor comments (2)
  1. [Abstract] Abstract: the reported sample sizes and API-call counts are useful but the exact statistical procedure for computing JSD and Cohen's d should be stated once in the main text for reproducibility.
  2. [Notation] Notation: the coupling matrix indexed by evaluator identity is introduced late; an early formal definition (e.g., as a 2x2 matrix of gamma coefficients) would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the isolation paradigm and Phase 3 validation. We address each point below with clarifications drawn from the experimental design and agree that explicit statements will improve the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Four-phase isolation paradigm): the claim that the paradigm isolates cross-modal coupling requires explicit confirmation that evaluator identity, executor model (GPT-4o vs. DeepSeek-chat), and task distribution remain fixed while only cross-modal exposure history varies; without this, the reported JSD values (0.19-0.34) and gamma coefficients could arise from residual model-specific preference or text-vs-image domain mismatch rather than modality transfer.

    Authors: The four-phase isolation paradigm holds evaluator identity (GPT-4o), executor model (DeepSeek-chat), and task distribution fixed across phases, varying only the cross-modal exposure history in Phase 2. This control structure is specified in the §3 methods to isolate modality transfer from model-specific or domain effects. We will add an explicit confirmation paragraph in the revised §3 to eliminate any ambiguity regarding these fixed factors. revision: yes

  2. Referee: [Phase 3] Phase 3 validation (N=80, five configurations): the three methodological ablations are stated to confirm the effect is not a structural artifact, yet the manuscript does not report the precise controls applied to task distribution or input modality statistics, leaving the directional consistency result (70% T->V, Cohen's d=0.56) vulnerable to alternative explanations.

    Authors: The three ablations control task distribution via balanced, statistically matched text and visual task sets and control input modality statistics through fixed sampling procedures for real-image versus text-proxy inputs. These controls are described in the Phase 3 validation subsection and underpin the directional consistency findings. We agree that more granular reporting of the exact task distribution and modality statistics would strengthen the section against alternative explanations and will expand this description in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements

full rationale

The paper's central claims rest on experimental measurements (JSD values, gamma coefficients, Cohen's d, strategy weights) obtained via API calls across four-phase isolation training, Phase 3 validation (N=80, ~35k calls), and ablations. These quantities are reported as observed statistics from model outputs, not quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No self-definitional loops, uniqueness theorems, or ansatzes appear in the methodology or results sections. The work is self-contained against external benchmarks via direct replication of the described protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that the staged isolation training successfully separates cross-modal transfer from model-specific or task-specific artifacts, and that strategy preference weights are directly comparable across text and visual modalities.

axioms (2)
  • domain assumption The four-phase isolation training paradigm isolates cross-modal effects without introducing structural artifacts from the specific models or tasks
    Invoked to interpret the measured coupling coefficients and strategy inversion
  • standard math Statistical measures (JSD, Cohen's d, gamma coefficients) validly quantify coupling strength across the tested configurations
    Used in Phase 3 validation across N=80 repetitions
invented entities (2)
  • cross-modal coupling no independent evidence
    purpose: Describes transfer of evaluator preferences from one modality to strategy selection in another
    Introduced to explain observed changes in strategy weights after cross-modal exposure
  • strategy inversion no independent evidence
    purpose: Describes reversal of optimal strategy for a modality after cross-modal exposure
    Observed phenomenon reported from the isolation training experiments

pith-pipeline@v0.9.1-grok · 5861 in / 1646 out tokens · 33040 ms · 2026-06-29T05:18:08.360975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems

    cs.LG 2026-06 unverdicted novelty 7.0

    Contagion Networks framework measures evaluator bias propagation in 3-agent LLM systems using the same base model, reporting gamma values of 0.157-0.352 and a 72.4% reduction in contagion when increasing evaluator com...

  2. Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems

    cs.LG 2026-06 unverdicted novelty 6.0

    Introduces Contagion Networks framework and measures preference propagation in 3-agent LLM setups, finding architectural priors dominate prompts, topology affects spread, and larger committees reduce contagion by ~69%.

Reference graph

Works this paper leans on

25 extracted references · 3 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution

    Z. Liu.Evaluator Preference Collapse: Self-Evaluation Bias in Test-Time Agent Evolution. arXiv, 2026

  2. [2]

    OpenAI.GPT-4o System Card.arXiv:2410.21276, 2024

  3. [3]

    Google DeepMind.Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens.arXiv:2403.05530, 2024

  4. [4]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, et al.Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.NeurIPS, 2023

  5. [5]

    Verga, H

    P. Verga, H. Rashkin, and M. Bansal.Arbiter: A Robust Evaluation Framework for LLM- as-Judge.EMNLP, 2024

  6. [6]

    Chiang, L

    W.-L. Chiang, L. Zheng, Y. Sheng, et al.Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.ICML, 2024

  7. [7]

    X. Li, T. Zhang, Y. Dubois, et al.AlpacaEval: An Automatic Evaluator of Instruction- following Models.ICLR, 2024

  8. [8]

    W. Yuan, R. Y. Pang, K. Cho, et al.Self-Rewarding Language Models.ICML, 2024

  9. [9]

    H. Chen, S. Yao, D. Yu, et al.Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models.NeurIPS, 2024

  10. [10]

    J. Lu, D. Batra, D. Parikh, and S. Lee.ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations.NeurIPS, 2019

  11. [11]

    Aghajanyan, A

    A. Aghajanyan, A. Gupta, A. Shrivastava, et al.CM3: A Casual Masked Multimodal Model of the Internet.arXiv:2101.08220, 2021

  12. [12]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, et al.Flamingo: A Visual Language Model for Few-Shot Learning.NeurIPS, 2022. 15

  13. [13]

    T. Yu, R. Zhang, Z. Yang, et al.Reward Hacking in Multimodal RLHF: A Case Study in Vision-Language Models.ICLR, 2024

  14. [14]

    Z. Sun, Y. Shen, Q. Zhou, et al.Aligning Multimodal Large Language Models with Human Preferences: Challenges and Solutions.arXiv:2405.07656, 2024

  15. [15]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, et al.Reflexion: Language Agents with Verbal Rein- forcement Learning.NeurIPS, 2023

  16. [16]

    NeurIPS, 2023

    A.Madaan, N.Tandon, P.Gupta, etal.Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS, 2023

  17. [17]

    S. Yao, J. Zhao, D. Yu, et al.ReAct: Synergizing Reasoning and Acting in Language Models. ICLR, 2023

  18. [18]

    G. Wang, Y. Xie, Y. Jiang, et al.Voyager: An Open-Ended Embodied Agent with Large Language Models.ICML, 2024

  19. [19]

    Arora, E

    S. Arora, E. Hazan, and S. Kale.The Multiplicative Weights Update Method: A Meta- Algorithm and Applications.Theory of Computing, 2012

  20. [20]

    DeepSeek-AI.DeepSeek-V3 Technical Report.arXiv:2412.19437, 2024

  21. [21]

    L. Gao, J. Schulman, and J. Hilton.Scaling Laws for Reward Model Overoptimization. ICML, 2023

  22. [22]

    Casper, X

    S. Casper, X. Davies, C. Shi, et al.Open Problems and Fundamental Limitations of Rein- forcement Learning from Human Feedback.TMLR, 2023

  23. [23]

    Sharma, E

    M. Sharma, E. Tong, T. Korbak, et al.Towards Understanding Sycophancy in Language Models.ICLR, 2024

  24. [24]

    Perez, S

    E. Perez, S. Ringer, K. Lukoši¯ ut˙ e, et al.Discovering Language Model Behaviors with Model- Written Evaluations.NeurIPS, 2022

  25. [25]

    Burns, P

    C. Burns, P. Izmailov, J. H. Kirchner, et al.Weak-to-Strong Generalization: Eliciting Strong Capabilities with Weak Supervision.ICML, 2024. 16 A Experiment Reproducibility Requirements:Python 3.8+, LLM API access (api2d/DashScope). No GPU. Phase 1:python mm_epc_phase1.py Phase 2:python mm_epc_coupling.py Phase 3:python mm_epc_phase3_significance.py Output...