pith. sign in

arxiv: 2606.05874 · v1 · pith:YMO22FR2new · submitted 2026-06-04 · 💻 cs.CL

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Pith reviewed 2026-06-28 01:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords stochastic collapsemultimodal large language modelsRandomBenchrandomness indexdistributional biasimplicit biasMLLM evaluation
0
0 comments X

The pith

Multimodal large language models exhibit stochastic collapse when asked to choose randomly among equivalent options.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that MLLMs do not produce uniform random outputs even when explicitly instructed to do so in situations where options have equal validity. It introduces RandomBench along with metrics for randomness and bias to demonstrate that models heavily favor particular choices, reaching 97% probability on the top option instead of 25%. This behavior, called stochastic collapse, remains consistent across languages and input types. A sympathetic reader would care because it implies reduced diversity in applications requiring neutral selection among valid alternatives.

Core claim

The central discovery is that MLLMs fail to maintain uniform randomness under explicit random instructions. Experiments show top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Ablation studies confirm these deviations persist across languages and representation formats.

What carries the argument

RandomBench, a benchmark for testing distributional neutrality in MLLMs, along with the metrics RI, BCI, and BII that quantify entropy and distributional bias.

Load-bearing premise

The tasks in RandomBench have options that are truly equivalent with no legitimate reason for the model to prefer one over another.

What would settle it

A model producing selections distributed uniformly at 25% each across many trials on RandomBench tasks would falsify the claim of stochastic collapse.

Figures

Figures reproduced from arXiv: 2606.05874 by Boyang Wang, Hongcheng Guo, Houtao Zhang, Huiyuan Zheng, Qingyi Si.

Figure 1
Figure 1. Figure 1: Overview of the RandomBench Framework. MLLMs. Results reveal a consistent failure mode that we term stochastic collapse: under semantic equiva￾lence, MLLMs often fail to sample from a near￾uniform distribution and instead concentrate proba￾bility mass on a small subset of choices. This col￾lapse appears in both text-only and vision-language settings, with the latter further exhibiting visual hi￾jacking, wh… view at source ↗
Figure 2
Figure 2. Figure 2: Curation pipeline of the RandomBench Framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the RandomBench Framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Multi-granularity evaluation of stochastic collapse on RandomBench. (a) Radar chart of RI score (EN) across cognitive dimensions and subcategories. (b) Cross-modal distribution of the RI score (EN). (c) Relationship between model capability and randomness consistency, revealing that stronger and more aligned models exhibit more severe stochastic collapse. CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation studies on implicit multimodal bias [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Image Generation Prompt. and 12 (Spanish), and 13 and 14 (Chinese). Spatial geometric preferences and coordinate mappings were also strongly modulated by instruc￾tional language, often producing opposite decision patterns within the same model. In the grid-point selection task, English prompts biased selections toward the bottom-left coordinate, whereas Chi￾nese prompts concentrated the probability mass on… view at source ↗
Figure 7
Figure 7. Figure 7: Cross-Modal Comparison of Bias Intensity ( [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Radar chart of RI score across cognitive dimensions and subcategories. locks on specific geometric tokens, as reflected by the JS divergence distributions in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: JS divergence: FR Photo vs. EN CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH-LITE GPT-5.1 GROK-4-FAST KIMI-K2.5 QWEN-3.6-PLUS 0.0 0.2 Mean JS Mean JS divergence (bootstrap 95% CI) 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7bcd44 #cea6239c #72fb161d CLAUDE-SONNET-4.6 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7b… view at source ↗
Figure 10
Figure 10. Figure 10: JS divergence: FR Text vs. EN The introduction of Greek alphabet primitives targeted the models’ localized text-frequency pri￾ors, which are characterized by highly non-uniform distributions across scientific and linguistic pre￾training corpora. This regime revealed a striking cross-model resonance, particularly with the γ to￾ken, as quantified by [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: JS divergence: ES Photo vs. EN CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH-LITE GPT-5.1 GROK-4-FAST KIMI-K2.5 QWEN-3.6-PLUS 0.0 0.2 Mean JS Mean JS divergence (bootstrap 95% CI) 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7bcd44 #cea6239c #72fb161d CLAUDE-SONNET-4.6 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7… view at source ↗
Figure 12
Figure 12. Figure 12: JS divergence: ES Text vs. EN cognitive template within the foundational layers of these models. Conversely, GPT 5.1 and Qwen 3.6 Plus exhibited divergent vocabulary biases, with GPT 5.1 predominantly favoring the β token and Qwen 3.6 Plus demonstrating a significant 40% selection rate for the α token. Such divergence underscores that implicit biases are finely modu￾lated by the unique token-frequency top… view at source ↗
Figure 13
Figure 13. Figure 13: JS divergence: ZH Photo vs. EN CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH-LITE GPT-5.1 GROK-4-FAST KIMI-K2.5 QWEN-3.6-PLUS 0.0 0.2 Mean JS Mean JS divergence (bootstrap 95% CI) 0.0 0.2 0.4 0.6 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7bcd44 #cea6239c #72fb161d CLAUDE-SONNET-4.6 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 … view at source ↗
Figure 14
Figure 14. Figure 14: JS divergence: ZH Text vs. EN candidates were rendered entirely uninterpretable and devoid of natural language logic, the models failed to maintain the maximum entropy baseline of a uniform distribution (see [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Data distribution between English and Chinese [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: JS divergence: Geometry labels [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: JS divergence: Greek labels [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: JS divergence: Random labels [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
read the original abstract

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces RandomBench, a benchmark for assessing whether MLLMs maintain distributionally neutral (uniform random) behavior when selecting among options with similar utility under explicit random instructions. It defines three metrics (RI, BCI, BII) to quantify entropy and bias, reports pervasive 'stochastic collapse' with top-1 probabilities reaching 97% (vs. ideal 25% baseline) and RI as low as 0.068 (Claude Sonnet 4.6), and shows the effect persists across languages and input formats via ablation studies.

Significance. If the benchmark options are verifiably equivalent, the work identifies a practically relevant limitation of current MLLMs in logic-neutral decision settings and supplies concrete metrics plus a reusable testbed. The cross-model, cross-language, and cross-format ablations constitute a strength, as does the explicit focus on entropy rather than utility-driven performance.

major comments (1)
  1. [Abstract and benchmark construction section] Abstract and benchmark construction section: the central claim that observed high top-1 probabilities and low RI constitute 'stochastic collapse' (rather than learned priors) requires that RandomBench options have verifiably identical utility. The manuscript states options have 'similar utility' but supplies no human equivalence ratings, pairwise utility controls, or ablation that rules out frequency/sequence biases acquired during training; without such validation the reported deviations are ambiguous.
minor comments (2)
  1. [Metrics section] Clarify the exact definition and normalization of RI, BCI, and BII (including any edge-case handling for deterministic outputs) in the metrics section.
  2. [Experiments section] Add the number of trials per task and per model to the experimental setup so that the reported top-1 probabilities and RI values can be assessed for statistical stability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about verifying option equivalence to support the stochastic collapse claim is substantive, and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract and benchmark construction section] Abstract and benchmark construction section: the central claim that observed high top-1 probabilities and low RI constitute 'stochastic collapse' (rather than learned priors) requires that RandomBench options have verifiably identical utility. The manuscript states options have 'similar utility' but supplies no human equivalence ratings, pairwise utility controls, or ablation that rules out frequency/sequence biases acquired during training; without such validation the reported deviations are ambiguous.

    Authors: We agree that the manuscript does not supply human equivalence ratings, pairwise utility controls, or a dedicated ablation isolating frequency/sequence biases from training. Options in RandomBench were constructed to be equivalent by design (interchangeable choices under explicit random instructions with no distinguishing utility features), and cross-language/cross-format ablations provide indirect evidence against certain priors. However, these steps do not fully eliminate the ambiguity the referee identifies. We will revise the benchmark construction section to add explicit discussion of this limitation and report a small human study confirming participant-rated equivalence among options. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential reductions

full rationale

The paper proposes RandomBench and three metrics (RI, BCI, BII) to measure MLLM behavior under random instructions, then reports experimental results on top-1 probabilities and RI values. No equations, fitted parameters, or derivations are presented that reduce a claimed prediction to its own inputs by construction. The work is a measurement study whose central claims rest on direct observation of model outputs rather than any self-definitional, fitted-input, or self-citation load-bearing step. The assumption that benchmark options have identical utility is a validity concern for interpretation but does not create circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are described. The work rests on the domain assumption that the selected tasks have truly equivalent options.

axioms (1)
  • domain assumption Selected tasks present options with identical utility and no model-intrinsic preference ordering.
    Required for the claim that deviation from uniform is collapse rather than legitimate preference.

pith-pipeline@v0.9.1-grok · 5726 in / 1019 out tokens · 14385 ms · 2026-06-28T01:27:23.250376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems

    cs.LG 2026-06 unverdicted novelty 7.0

    Contagion Networks framework measures evaluator bias propagation in 3-agent LLM systems using the same base model, reporting gamma values of 0.157-0.352 and a 72.4% reduction in contagion when increasing evaluator com...

  2. Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems

    cs.LG 2026-06 unverdicted novelty 6.0

    Introduces Contagion Networks framework and measures preference propagation in 3-agent LLM setups, finding architectural priors dominate prompts, topology affects spread, and larger committees reduce contagion by ~69%.

Reference graph

Works this paper leans on

83 extracted references · 3 canonical work pages · cited by 1 Pith paper

  1. [1]

    , author=

    Studies of interference in serial verbal reactions. , author=. Journal of experimental psychology , volume=. 1935 , publisher=

  2. [2]

    The Role of System 1 and System 2 Semantic Memory Structure in Human and

    Abramski, Katherine and Rossetti, Giulio and Stella, Massimo , journal =. The Role of System 1 and System 2 Semantic Memory Structure in Human and

  3. [3]

    Proceedings of the National Academy of Sciences , volume =

    Explicitly unbiased large language models still form biased associations , author =. Proceedings of the National Academy of Sciences , volume =

  4. [4]

    Advances in Neural Information Processing Systems , volume =

    Understanding information storage and transfer in multi-modal large language models , author =. Advances in Neural Information Processing Systems , volume =

  5. [5]

    Nature Reviews Psychology , volume =

    Dual-process theory and decision-making in large language models , author =. Nature Reviews Psychology , volume =

  6. [6]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

    Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

  7. [7]

    Preprints.org , doi =

    A Survey on Selection Bias in Large Language Models , author =. Preprints.org , doi =

  8. [8]

    Huang, Jen-tse and Qin, Jiaxu and Zhang, Jing and others , booktitle =

  9. [9]

    Li, Yuchen and Fan, Zhen and Chen, Ruizhe and others , booktitle =

  10. [10]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Language model probabilities are not calibrated in numeric contexts , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  11. [11]

    Advances in Neural Information Processing Systems , volume =

    Order-Independence Without Fine Tuning , author =. Advances in Neural Information Processing Systems , volume =

  12. [12]

    Locating and editing factual associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and editing factual associations in

  13. [13]

    Sivakumar, Ashwin and Zhang, Allen and Hakim, Zaid and others , booktitle =

  14. [14]

    Advances in Neural Information Processing Systems , volume =

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author =. Advances in Neural Information Processing Systems , volume =

  15. [15]

    2026 , pages =

    Wang, Sibo and Cao, Xiangkui and Zhang, Jie and Yuan, Zheng and Shan, Shiguang and Chen, Xilin and Gao, Wen , journal =. 2026 , pages =

  16. [16]

    Wang, Jingyi and Li, Ming and Zhang, Hao and others , journal =

  17. [17]

    Ye, Wenqian and Liu, Bo and Zheng, Guangtao and others , booktitle =

  18. [18]

    When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in

    Zhang, Zhuoran and Wang, Tengyue and Gong, Xilin and Shi, Yang and Wang, Haotian and Wang, Di and Hu, Lijie , journal =. When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in

  19. [19]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Calibrate Before Use: Improving Few-shot Performance of Language Models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =

  20. [20]

    Proceedings of the International Conference on Learning Representations , year =

    Large language models are not robust multiple choice selectors , author =. Proceedings of the International Conference on Learning Representations , year =

  21. [21]

    Mitigating Selection Bias in Large Language Models via Permutation-Aware

    Zheng, Jinquan and Yuan, Jia and Yao, Jiacheng and Gu, Chenyang and Zheng, Pujun and He, Guoxiu , booktitle =. Mitigating Selection Bias in Large Language Models via Permutation-Aware. 2026 , note =

  22. [22]

    2025 , howpublished =

  23. [23]

    2026 , howpublished =

    Claude Sonnet 4.6: Hybrid Reasoning Model , author =. 2026 , howpublished =

  24. [24]

    2025 , howpublished =

    Gemini 3.1 Flash-Lite: Built for Intelligence at Scale , author =. 2025 , howpublished =

  25. [26]

    2026 , howpublished =

  26. [27]

    2025 , howpublished =

    Grok 4 Fast: Cost-Efficient Reasoning at Scale , author =. 2025 , howpublished =

  27. [28]

    2025 , howpublished =

    Introduction to Techniques Used in. 2025 , howpublished =

  28. [29]

    2020 , eprint=

    StereoSet: Measuring stereotypical bias in pretrained language models , author=. 2020 , eprint=

  29. [30]

    2024 , eprint=

    Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

  30. [31]

    2025 , eprint=

    Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens , author=. 2025 , eprint=

  31. [32]

    2026 , eprint=

    Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts , author=. 2026 , eprint=

  32. [33]

    2026 , eprint=

    When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models , author=. 2026 , eprint=

  33. [37]

    2018 , eprint=

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author=. 2018 , eprint=

  34. [38]

    2025 , eprint=

    Agent Lightning: Train ANY AI Agents with Reinforcement Learning , author=. 2025 , eprint=

  35. [39]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  36. [40]

    2025 , eprint=

    Distributionally Robust Graph Out-of-Distribution Recommendation via Diffusion Model , author=. 2025 , eprint=

  37. [41]

    2024 , eprint=

    Fairness and Diversity in Recommender Systems: A Survey , author=. 2024 , eprint=

  38. [43]

    Findings of the association for computational linguistics: ACL 2022 , pages=

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , pages=

  39. [44]

    Shannon, C. E. , journal=. A mathematical theory of communication , year=

  40. [45]

    Katherine Abramski, Giulio Rossetti, and Massimo Stella. 2026. The role of system 1 and system 2 semantic memory structure in human and LLM biases. arXiv preprint arXiv:2604.12816

  41. [46]

    Alibaba Cloud . 2026. Qwen 3.6 technical blog. https://qwen.ai/blog?id=qwen3.6. Accessed: 2026-05-24

  42. [47]

    Anthropic . 2026. Claude sonnet 4.6: Hybrid reasoning model. https://www.anthropic.com/claude/sonnet. Accessed: 2026-05-24

  43. [48]

    Griffiths

    Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2024. https://arxiv.org/abs/2402.04105 Measuring implicit bias in explicitly unbiased large language models . Preprint, arXiv:2402.04105

  44. [49]

    Griffiths

    Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2025. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences, 122(8):e2416228122

  45. [50]

    Samyadeep Basu, Michael Grayson, Cecily Morrison, and 1 others. 2024. Understanding information storage and transfer in multi-modal large language models. In Advances in Neural Information Processing Systems, volume 37, pages 7400--7426

  46. [51]

    Ward, and David P

    Oliver Brady, Paul Nulty, Li Zhang, Tomas E. Ward, and David P. McGovern. 2025. Dual-process theory and decision-making in large language models. Nature Reviews Psychology, 4:777--792

  47. [52]

    ByteDance Seed Team . 2025. Introduction to techniques used in Seed1.6 . https://seed.bytedance.com/en/blog/introduction-to-techniques-used-in-seed1-6. Accessed: 2026-05-24

  48. [53]

    Meiqi Chen, Yixin Cao, Yan Zhang, and 1 others. 2024. Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16449--16469

  49. [54]

    Google DeepMind . 2025. Gemini 3.1 flash-lite: Built for intelligence at scale. https://blog.google/technology/ai/gemini-3-1-flash-lite/. Accessed: 2026-05-24

  50. [55]

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. https://arxiv.org/abs/1801.01290 Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor . Preprint, arXiv:1801.01290

  51. [56]

    Guoxiu He, Jinquan Zheng, and Fangqing Han. 2026. https://doi.org/10.20944/preprints202604.2234.v1 A survey on selection bias in large language models . Preprints.org

  52. [57]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

  53. [58]

    Chenhao Huang, Ziyu Shen, Yicong Ren, Huiyuan Zheng, Jiazheng Zhang, Mingxu Chai, Ming Zhang, Shihan Dou, Fan Mo, Jie Shi, and 1 others. 2025 a . Governance in motion: Co-evolution of constitutions and ai models for scalable safety. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17198--17221

  54. [59]

    Jen-tse Huang, Jiaxu Qin, Jing Zhang, and 1 others. 2025 b . VisBias : Measuring explicit and implicit social biases in vision-language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17981--18004

  55. [60]

    Kimi Team . 2026. https://arxiv.org/abs/2602.02276 Kimi k2.5: Visual agentic intelligence . arXiv preprint arXiv:2602.02276

  56. [61]

    Yuchen Li, Zhen Fan, Ruizhe Chen, and 1 others. 2025. FairSteer : Inference time debiasing for LLMs with dynamic activation steering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11293--11312

  57. [62]

    Charles Lovering, Michael Krumdick, Viet Dac Lai, Nilesh Reddy, and Greg Durrett. 2025. Language model probabilities are not calibrated in numeric contexts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29218--29257

  58. [63]

    Qiu, and Yuqing Yang

    Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, and Yuqing Yang. 2025. https://arxiv.org/abs/2508.03680 Agent lightning: Train any ai agents with reinforcement learning . Preprint, arXiv:2508.03680

  59. [64]

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263--2279

  60. [65]

    Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, and Cynthia Dwork. 2024. Order-independence without fine tuning. In Advances in Neural Information Processing Systems, volume 37, pages 72818--72839

  61. [66]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35, pages 17359--17372

  62. [67]

    Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. https://arxiv.org/abs/2004.09456 Stereoset: Measuring stereotypical bias in pretrained language models . Preprint, arXiv:2004.09456

  63. [68]

    OpenAI . 2025. GPT -5.1: Next-generation model for developers. https://openai.com/index/gpt-5-1-for-developers/. Accessed: 2026-05-24

  64. [69]

    Francesco Ortu, Zhijing Jin, Diego Doimo, and Alberto Cazzaniga. 2026. https://arxiv.org/abs/2507.13868 When seeing overrides knowing: Disentangling knowledge conflicts in vision-language models . Preprint, arXiv:2507.13868

  65. [70]

    C. E. Shannon. 1948. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x A mathematical theory of communication . The Bell System Technical Journal, 27(3):379--423

  66. [71]

    Ashwin Sivakumar, Allen Zhang, Zaid Hakim, and 1 others. 2025. SteerVLM : Robust model control through lightweight activation steering for vision language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23640--23665

  67. [72]

    J Ridley Stroop. 1935. Studies of interference in serial verbal reactions. Journal of experimental psychology, 18(6):643

  68. [73]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems, volume 36, pages 74952--74965

  69. [74]

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291

  70. [75]

    Jingyi Wang, Ming Li, Hao Zhang, and 1 others. 2026 a . V-FAT : Benchmarking visual fidelity against text-bias. arXiv preprint arXiv:2601.04897

  71. [76]

    Sibo Wang, Xiangkui Cao, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen, and Wen Gao. 2026 b . https://doi.org/10.1109/TPAMI.2026.3683747 VLBiasBench : A comprehensive benchmark for evaluating bias in large vision-language model . IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1--14

  72. [77]

    xAI . 2025. Grok 4 fast: Cost-efficient reasoning at scale. https://x.ai/news/grok-4-fast. Accessed: 2026-05-24

  73. [78]

    Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224

  74. [79]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

  75. [80]

    Wenqian Ye, Bo Liu, Guangtao Zheng, and 1 others. 2024. MM-SpuBench : Towards better understanding of spurious biases in multimodal LLMs . In Advances in Neural Information Processing Systems, volume 37

  76. [81]

    Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, and Lijie Hu. 2025. When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in MLLMs . arXiv preprint arXiv:2511.02243

  77. [82]

    Chu Zhao, Enneng Yang, Yuliang Liang, Jianzhe Zhao, Guibing Guo, and Xingwei Wang. 2025. https://arxiv.org/abs/2501.15555 Distributionally robust graph out-of-distribution recommendation via diffusion model . Preprint, arXiv:2501.15555

  78. [83]

    Yuying Zhao, Yu Wang, Yunchao Liu, Xueqi Cheng, Charu Aggarwal, and Tyler Derr. 2024. https://arxiv.org/abs/2307.04644 Fairness and diversity in recommender systems: A survey . Preprint, arXiv:2307.04644

  79. [84]

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697--12706

  80. [85]

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2024. Large language models are not robust multiple choice selectors. In Proceedings of the International Conference on Learning Representations

Showing first 80 references.