Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Boyang Wang; Hongcheng Guo; Houtao Zhang; Huiyuan Zheng; Qingyi Si

arxiv: 2606.05874 · v1 · pith:YMO22FR2new · submitted 2026-06-04 · 💻 cs.CL

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

Huiyuan Zheng , Houtao Zhang , Boyang Wang , Qingyi Si , Hongcheng Guo This is my paper

Pith reviewed 2026-06-28 01:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords stochastic collapsemultimodal large language modelsRandomBenchrandomness indexdistributional biasimplicit biasMLLM evaluation

0 comments

The pith

Multimodal large language models exhibit stochastic collapse when asked to choose randomly among equivalent options.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that MLLMs do not produce uniform random outputs even when explicitly instructed to do so in situations where options have equal validity. It introduces RandomBench along with metrics for randomness and bias to demonstrate that models heavily favor particular choices, reaching 97% probability on the top option instead of 25%. This behavior, called stochastic collapse, remains consistent across languages and input types. A sympathetic reader would care because it implies reduced diversity in applications requiring neutral selection among valid alternatives.

Core claim

The central discovery is that MLLMs fail to maintain uniform randomness under explicit random instructions. Experiments show top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Ablation studies confirm these deviations persist across languages and representation formats.

What carries the argument

RandomBench, a benchmark for testing distributional neutrality in MLLMs, along with the metrics RI, BCI, and BII that quantify entropy and distributional bias.

Load-bearing premise

The tasks in RandomBench have options that are truly equivalent with no legitimate reason for the model to prefer one over another.

What would settle it

A model producing selections distributed uniformly at 25% each across many trials on RandomBench tasks would falsify the claim of stochastic collapse.

Figures

Figures reproduced from arXiv: 2606.05874 by Boyang Wang, Hongcheng Guo, Houtao Zhang, Huiyuan Zheng, Qingyi Si.

**Figure 1.** Figure 1: Overview of the RandomBench Framework. MLLMs. Results reveal a consistent failure mode that we term stochastic collapse: under semantic equivalence, MLLMs often fail to sample from a nearuniform distribution and instead concentrate probability mass on a small subset of choices. This collapse appears in both text-only and vision-language settings, with the latter further exhibiting visual hijacking, wh… view at source ↗

**Figure 2.** Figure 2: Curation pipeline of the RandomBench Framework. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Statistics of the RandomBench Framework. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Multi-granularity evaluation of stochastic collapse on RandomBench. (a) Radar chart of RI score (EN) across cognitive dimensions and subcategories. (b) Cross-modal distribution of the RI score (EN). (c) Relationship between model capability and randomness consistency, revealing that stronger and more aligned models exhibit more severe stochastic collapse. CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH… view at source ↗

**Figure 5.** Figure 5: Ablation studies on implicit multimodal bias [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Image Generation Prompt. and 12 (Spanish), and 13 and 14 (Chinese). Spatial geometric preferences and coordinate mappings were also strongly modulated by instructional language, often producing opposite decision patterns within the same model. In the grid-point selection task, English prompts biased selections toward the bottom-left coordinate, whereas Chinese prompts concentrated the probability mass on… view at source ↗

**Figure 7.** Figure 7: Cross-Modal Comparison of Bias Intensity ( [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Radar chart of RI score across cognitive dimensions and subcategories. locks on specific geometric tokens, as reflected by the JS divergence distributions in [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: JS divergence: FR Photo vs. EN CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH-LITE GPT-5.1 GROK-4-FAST KIMI-K2.5 QWEN-3.6-PLUS 0.0 0.2 Mean JS Mean JS divergence (bootstrap 95% CI) 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7bcd44 #cea6239c #72fb161d CLAUDE-SONNET-4.6 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7b… view at source ↗

**Figure 10.** Figure 10: JS divergence: FR Text vs. EN The introduction of Greek alphabet primitives targeted the models’ localized text-frequency priors, which are characterized by highly non-uniform distributions across scientific and linguistic pretraining corpora. This regime revealed a striking cross-model resonance, particularly with the γ token, as quantified by [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: JS divergence: ES Photo vs. EN CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH-LITE GPT-5.1 GROK-4-FAST KIMI-K2.5 QWEN-3.6-PLUS 0.0 0.2 Mean JS Mean JS divergence (bootstrap 95% CI) 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7bcd44 #cea6239c #72fb161d CLAUDE-SONNET-4.6 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7… view at source ↗

**Figure 12.** Figure 12: JS divergence: ES Text vs. EN cognitive template within the foundational layers of these models. Conversely, GPT 5.1 and Qwen 3.6 Plus exhibited divergent vocabulary biases, with GPT 5.1 predominantly favoring the β token and Qwen 3.6 Plus demonstrating a significant 40% selection rate for the α token. Such divergence underscores that implicit biases are finely modulated by the unique token-frequency top… view at source ↗

**Figure 13.** Figure 13: JS divergence: ZH Photo vs. EN CLAUDE-SONNET-4.6 DOUBAO-SEED-1.6 GEMIINI-3.1-FLASH-LITE GPT-5.1 GROK-4-FAST KIMI-K2.5 QWEN-3.6-PLUS 0.0 0.2 Mean JS Mean JS divergence (bootstrap 95% CI) 0.0 0.2 0.4 0.6 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 #6e7bcd44 #cea6239c #72fb161d CLAUDE-SONNET-4.6 0.0 0.2 0.4 JS #65b51a26 #4d8d8874 #5fa466a8 #d9862b55 #d6e01c17 #ec9f9b5c #3bd8fdc7 … view at source ↗

**Figure 14.** Figure 14: JS divergence: ZH Text vs. EN candidates were rendered entirely uninterpretable and devoid of natural language logic, the models failed to maintain the maximum entropy baseline of a uniform distribution (see [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Data distribution between English and Chinese [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: JS divergence: Geometry labels [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: JS divergence: Greek labels [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: JS divergence: Random labels [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

read the original abstract

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper documents MLLMs defaulting to one choice under random instructions in neutral tasks, but the claim hinges on an unverified assumption that options truly have equal utility.

read the letter

This paper's main point is that MLLMs do not produce uniform random selections even when explicitly told to choose among equivalent options. They introduce RandomBench to test this in logic-neutral settings and define three metrics (RI, BCI, BII) to measure deviation from uniform distributions, with results showing top-1 probabilities up to 97% and RI as low as 0.068 on models like Claude Sonnet.

What is new is the benchmark and those specific metrics aimed at stochastic behavior rather than utility-driven performance. The experiments cover multiple MLLMs with ablations on languages and formats, which helps establish that the pattern is not superficial. This setup does identify a practical issue for applications like varied recommendations or schedules where coverage of valid options matters.

The soft spot is the assumption that benchmark options have identical utility with no hidden model preferences. The abstract notes "similar utility" but provides no validation such as human equivalence ratings or controls for training-data biases like number or sequence preferences. If those priors exist, the observed collapse is expected behavior rather than a failure of stochasticity, which undercuts the interpretation. The stress-test concern lands here because nothing in the provided details rules it out.

The work is aimed at researchers doing MLLM evaluation, especially around neutrality and generation diversity. A reader focused on benchmarks would find the metrics and setup worth examining. The methods and data construction need referee scrutiny to check the equivalence controls and statistical robustness.

I would send it to peer review so the assumption can be tested directly and the metrics refined if needed.

Referee Report

1 major / 2 minor

Summary. The paper introduces RandomBench, a benchmark for assessing whether MLLMs maintain distributionally neutral (uniform random) behavior when selecting among options with similar utility under explicit random instructions. It defines three metrics (RI, BCI, BII) to quantify entropy and bias, reports pervasive 'stochastic collapse' with top-1 probabilities reaching 97% (vs. ideal 25% baseline) and RI as low as 0.068 (Claude Sonnet 4.6), and shows the effect persists across languages and input formats via ablation studies.

Significance. If the benchmark options are verifiably equivalent, the work identifies a practically relevant limitation of current MLLMs in logic-neutral decision settings and supplies concrete metrics plus a reusable testbed. The cross-model, cross-language, and cross-format ablations constitute a strength, as does the explicit focus on entropy rather than utility-driven performance.

major comments (1)

[Abstract and benchmark construction section] Abstract and benchmark construction section: the central claim that observed high top-1 probabilities and low RI constitute 'stochastic collapse' (rather than learned priors) requires that RandomBench options have verifiably identical utility. The manuscript states options have 'similar utility' but supplies no human equivalence ratings, pairwise utility controls, or ablation that rules out frequency/sequence biases acquired during training; without such validation the reported deviations are ambiguous.

minor comments (2)

[Metrics section] Clarify the exact definition and normalization of RI, BCI, and BII (including any edge-case handling for deterministic outputs) in the metrics section.
[Experiments section] Add the number of trials per task and per model to the experimental setup so that the reported top-1 probabilities and RI values can be assessed for statistical stability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about verifying option equivalence to support the stochastic collapse claim is substantive, and we address it directly below.

read point-by-point responses

Referee: [Abstract and benchmark construction section] Abstract and benchmark construction section: the central claim that observed high top-1 probabilities and low RI constitute 'stochastic collapse' (rather than learned priors) requires that RandomBench options have verifiably identical utility. The manuscript states options have 'similar utility' but supplies no human equivalence ratings, pairwise utility controls, or ablation that rules out frequency/sequence biases acquired during training; without such validation the reported deviations are ambiguous.

Authors: We agree that the manuscript does not supply human equivalence ratings, pairwise utility controls, or a dedicated ablation isolating frequency/sequence biases from training. Options in RandomBench were constructed to be equivalent by design (interchangeable choices under explicit random instructions with no distinguishing utility features), and cross-language/cross-format ablations provide indirect evidence against certain priors. However, these steps do not fully eliminate the ambiguity the referee identifies. We will revise the benchmark construction section to add explicit discussion of this limitation and report a small human study confirming participant-rated equivalence among options. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential reductions

full rationale

The paper proposes RandomBench and three metrics (RI, BCI, BII) to measure MLLM behavior under random instructions, then reports experimental results on top-1 probabilities and RI values. No equations, fitted parameters, or derivations are presented that reduce a claimed prediction to its own inputs by construction. The work is a measurement study whose central claims rest on direct observation of model outputs rather than any self-definitional, fitted-input, or self-citation load-bearing step. The assumption that benchmark options have identical utility is a validity concern for interpretation but does not create circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities are described. The work rests on the domain assumption that the selected tasks have truly equivalent options.

axioms (1)

domain assumption Selected tasks present options with identical utility and no model-intrinsic preference ordering.
Required for the claim that deviation from uniform is collapse rather than legitimate preference.

pith-pipeline@v0.9.1-grok · 5726 in / 1019 out tokens · 14385 ms · 2026-06-28T01:27:23.250376+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems
cs.LG 2026-06 unverdicted novelty 7.0

Contagion Networks framework measures evaluator bias propagation in 3-agent LLM systems using the same base model, reporting gamma values of 0.157-0.352 and a 72.4% reduction in contagion when increasing evaluator com...
Contagion Networks: Evaluator Preference Propagation in Multi-Agent LLM Systems
cs.LG 2026-06 unverdicted novelty 6.0

Introduces Contagion Networks framework and measures preference propagation in 3-agent LLM setups, finding architectural priors dominate prompts, topology affects spread, and larger committees reduce contagion by ~69%.

Reference graph

Works this paper leans on

83 extracted references · 3 canonical work pages · cited by 1 Pith paper

[1]

, author=

Studies of interference in serial verbal reactions. , author=. Journal of experimental psychology , volume=. 1935 , publisher=

1935
[2]

The Role of System 1 and System 2 Semantic Memory Structure in Human and

Abramski, Katherine and Rossetti, Giulio and Stella, Massimo , journal =. The Role of System 1 and System 2 Semantic Memory Structure in Human and
[3]

Proceedings of the National Academy of Sciences , volume =

Explicitly unbiased large language models still form biased associations , author =. Proceedings of the National Academy of Sciences , volume =
[4]

Advances in Neural Information Processing Systems , volume =

Understanding information storage and transfer in multi-modal large language models , author =. Advances in Neural Information Processing Systems , volume =
[5]

Nature Reviews Psychology , volume =

Dual-process theory and decision-making in large language models , author =. Nature Reviews Psychology , volume =
[6]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

2024
[7]

Preprints.org , doi =

A Survey on Selection Bias in Large Language Models , author =. Preprints.org , doi =
[8]

Huang, Jen-tse and Qin, Jiaxu and Zhang, Jing and others , booktitle =
[9]

Li, Yuchen and Fan, Zhen and Chen, Ruizhe and others , booktitle =
[10]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Language model probabilities are not calibrated in numeric contexts , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
[11]

Advances in Neural Information Processing Systems , volume =

Order-Independence Without Fine Tuning , author =. Advances in Neural Information Processing Systems , volume =
[12]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and editing factual associations in
[13]

Sivakumar, Ashwin and Zhang, Allen and Hakim, Zaid and others , booktitle =
[14]

Advances in Neural Information Processing Systems , volume =

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author =. Advances in Neural Information Processing Systems , volume =
[15]

2026 , pages =

Wang, Sibo and Cao, Xiangkui and Zhang, Jie and Yuan, Zheng and Shan, Shiguang and Chen, Xilin and Gao, Wen , journal =. 2026 , pages =

2026
[16]

Wang, Jingyi and Li, Ming and Zhang, Hao and others , journal =
[17]

Ye, Wenqian and Liu, Bo and Zheng, Guangtao and others , booktitle =
[18]

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in

Zhang, Zhuoran and Wang, Tengyue and Gong, Xilin and Shi, Yang and Wang, Haotian and Wang, Di and Hu, Lijie , journal =. When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in
[19]

Proceedings of the 38th International Conference on Machine Learning , pages =

Calibrate Before Use: Improving Few-shot Performance of Language Models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =
[20]

Proceedings of the International Conference on Learning Representations , year =

Large language models are not robust multiple choice selectors , author =. Proceedings of the International Conference on Learning Representations , year =
[21]

Mitigating Selection Bias in Large Language Models via Permutation-Aware

Zheng, Jinquan and Yuan, Jia and Yao, Jiacheng and Gu, Chenyang and Zheng, Pujun and He, Guoxiu , booktitle =. Mitigating Selection Bias in Large Language Models via Permutation-Aware. 2026 , note =

2026
[22]

2025 , howpublished =

2025
[23]

2026 , howpublished =

Claude Sonnet 4.6: Hybrid Reasoning Model , author =. 2026 , howpublished =

2026
[24]

2025 , howpublished =

Gemini 3.1 Flash-Lite: Built for Intelligence at Scale , author =. 2025 , howpublished =

2025
[26]

2026 , howpublished =

2026
[27]

2025 , howpublished =

Grok 4 Fast: Cost-Efficient Reasoning at Scale , author =. 2025 , howpublished =

2025
[28]

2025 , howpublished =

Introduction to Techniques Used in. 2025 , howpublished =

2025
[29]

2020 , eprint=

StereoSet: Measuring stereotypical bias in pretrained language models , author=. 2020 , eprint=

2020
[30]

2024 , eprint=

Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

2024
[31]

2025 , eprint=

Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens , author=. 2025 , eprint=

2025
[32]

2026 , eprint=

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts , author=. 2026 , eprint=

2026
[33]

2026 , eprint=

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models , author=. 2026 , eprint=

2026
[37]

2018 , eprint=

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author=. 2018 , eprint=

2018
[38]

2025 , eprint=

Agent Lightning: Train ANY AI Agents with Reinforcement Learning , author=. 2025 , eprint=

2025
[39]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[40]

2025 , eprint=

Distributionally Robust Graph Out-of-Distribution Recommendation via Diffusion Model , author=. 2025 , eprint=

2025
[41]

2024 , eprint=

Fairness and Diversity in Recommender Systems: A Survey , author=. 2024 , eprint=

2024
[43]

Findings of the association for computational linguistics: ACL 2022 , pages=

Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , pages=

2022
[44]

Shannon, C. E. , journal=. A mathematical theory of communication , year=
[45]

Katherine Abramski, Giulio Rossetti, and Massimo Stella. 2026. The role of system 1 and system 2 semantic memory structure in human and LLM biases. arXiv preprint arXiv:2604.12816

Pith/arXiv arXiv 2026
[46]

Alibaba Cloud . 2026. Qwen 3.6 technical blog. https://qwen.ai/blog?id=qwen3.6. Accessed: 2026-05-24

2026
[47]

Anthropic . 2026. Claude sonnet 4.6: Hybrid reasoning model. https://www.anthropic.com/claude/sonnet. Accessed: 2026-05-24

2026
[48]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2024. https://arxiv.org/abs/2402.04105 Measuring implicit bias in explicitly unbiased large language models . Preprint, arXiv:2402.04105

arXiv 2024
[49]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2025. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences, 122(8):e2416228122

2025
[50]

Samyadeep Basu, Michael Grayson, Cecily Morrison, and 1 others. 2024. Understanding information storage and transfer in multi-modal large language models. In Advances in Neural Information Processing Systems, volume 37, pages 7400--7426

2024
[51]

Ward, and David P

Oliver Brady, Paul Nulty, Li Zhang, Tomas E. Ward, and David P. McGovern. 2025. Dual-process theory and decision-making in large language models. Nature Reviews Psychology, 4:777--792

2025
[52]

ByteDance Seed Team . 2025. Introduction to techniques used in Seed1.6 . https://seed.bytedance.com/en/blog/introduction-to-techniques-used-in-seed1-6. Accessed: 2026-05-24

2025
[53]

Meiqi Chen, Yixin Cao, Yan Zhang, and 1 others. 2024. Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16449--16469

2024
[54]

Google DeepMind . 2025. Gemini 3.1 flash-lite: Built for intelligence at scale. https://blog.google/technology/ai/gemini-3-1-flash-lite/. Accessed: 2026-05-24

2025
[55]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. https://arxiv.org/abs/1801.01290 Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor . Preprint, arXiv:1801.01290

Pith/arXiv arXiv 2018
[56]

Guoxiu He, Jinquan Zheng, and Fangqing Han. 2026. https://doi.org/10.20944/preprints202604.2234.v1 A survey on selection bias in large language models . Preprints.org

work page doi:10.20944/preprints202604.2234.v1 2026
[57]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

Pith/arXiv arXiv 2020
[58]

Chenhao Huang, Ziyu Shen, Yicong Ren, Huiyuan Zheng, Jiazheng Zhang, Mingxu Chai, Ming Zhang, Shihan Dou, Fan Mo, Jie Shi, and 1 others. 2025 a . Governance in motion: Co-evolution of constitutions and ai models for scalable safety. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17198--17221

2025
[59]

Jen-tse Huang, Jiaxu Qin, Jing Zhang, and 1 others. 2025 b . VisBias : Measuring explicit and implicit social biases in vision-language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17981--18004

2025
[60]

Kimi Team . 2026. https://arxiv.org/abs/2602.02276 Kimi k2.5: Visual agentic intelligence . arXiv preprint arXiv:2602.02276

Pith/arXiv arXiv 2026
[61]

Yuchen Li, Zhen Fan, Ruizhe Chen, and 1 others. 2025. FairSteer : Inference time debiasing for LLMs with dynamic activation steering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11293--11312

2025
[62]

Charles Lovering, Michael Krumdick, Viet Dac Lai, Nilesh Reddy, and Greg Durrett. 2025. Language model probabilities are not calibrated in numeric contexts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29218--29257

2025
[63]

Qiu, and Yuqing Yang

Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, and Yuqing Yang. 2025. https://arxiv.org/abs/2508.03680 Agent lightning: Train any ai agents with reinforcement learning . Preprint, arXiv:2508.03680

arXiv 2025
[64]

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263--2279

2022
[65]

Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, and Cynthia Dwork. 2024. Order-independence without fine tuning. In Advances in Neural Information Processing Systems, volume 37, pages 72818--72839

2024
[66]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35, pages 17359--17372

2022
[67]

Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. https://arxiv.org/abs/2004.09456 Stereoset: Measuring stereotypical bias in pretrained language models . Preprint, arXiv:2004.09456

arXiv 2020
[68]

OpenAI . 2025. GPT -5.1: Next-generation model for developers. https://openai.com/index/gpt-5-1-for-developers/. Accessed: 2026-05-24

2025
[69]

Francesco Ortu, Zhijing Jin, Diego Doimo, and Alberto Cazzaniga. 2026. https://arxiv.org/abs/2507.13868 When seeing overrides knowing: Disentangling knowledge conflicts in vision-language models . Preprint, arXiv:2507.13868

Pith/arXiv arXiv 2026
[70]

C. E. Shannon. 1948. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x A mathematical theory of communication . The Bell System Technical Journal, 27(3):379--423

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[71]

Ashwin Sivakumar, Allen Zhang, Zaid Hakim, and 1 others. 2025. SteerVLM : Robust model control through lightweight activation steering for vision language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23640--23665

2025
[72]

J Ridley Stroop. 1935. Studies of interference in serial verbal reactions. Journal of experimental psychology, 18(6):643

1935
[73]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems, volume 36, pages 74952--74965

2023
[74]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291

Pith/arXiv arXiv 2023
[75]

Jingyi Wang, Ming Li, Hao Zhang, and 1 others. 2026 a . V-FAT : Benchmarking visual fidelity against text-bias. arXiv preprint arXiv:2601.04897

arXiv 2026
[76]

Sibo Wang, Xiangkui Cao, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen, and Wen Gao. 2026 b . https://doi.org/10.1109/TPAMI.2026.3683747 VLBiasBench : A comprehensive benchmark for evaluating bias in large vision-language model . IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1--14

work page doi:10.1109/tpami.2026.3683747 2026
[77]

xAI . 2025. Grok 4 fast: Cost-efficient reasoning at scale. https://x.ai/news/grok-4-fast. Accessed: 2026-05-24

2025
[78]

Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224

arXiv 2023
[79]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

Pith/arXiv arXiv 2022
[80]

Wenqian Ye, Bo Liu, Guangtao Zheng, and 1 others. 2024. MM-SpuBench : Towards better understanding of spurious biases in multimodal LLMs . In Advances in Neural Information Processing Systems, volume 37

2024
[81]

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, and Lijie Hu. 2025. When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in MLLMs . arXiv preprint arXiv:2511.02243

arXiv 2025
[82]

Chu Zhao, Enneng Yang, Yuliang Liang, Jianzhe Zhao, Guibing Guo, and Xingwei Wang. 2025. https://arxiv.org/abs/2501.15555 Distributionally robust graph out-of-distribution recommendation via diffusion model . Preprint, arXiv:2501.15555

arXiv 2025
[83]

Yuying Zhao, Yu Wang, Yunchao Liu, Xueqi Cheng, Charu Aggarwal, and Tyler Derr. 2024. https://arxiv.org/abs/2307.04644 Fairness and diversity in recommender systems: A survey . Preprint, arXiv:2307.04644

arXiv 2024
[84]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697--12706

2021
[85]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2024. Large language models are not robust multiple choice selectors. In Proceedings of the International Conference on Learning Representations

2024

Showing first 80 references.

[1] [1]

, author=

Studies of interference in serial verbal reactions. , author=. Journal of experimental psychology , volume=. 1935 , publisher=

1935

[2] [2]

The Role of System 1 and System 2 Semantic Memory Structure in Human and

Abramski, Katherine and Rossetti, Giulio and Stella, Massimo , journal =. The Role of System 1 and System 2 Semantic Memory Structure in Human and

[3] [3]

Proceedings of the National Academy of Sciences , volume =

Explicitly unbiased large language models still form biased associations , author =. Proceedings of the National Academy of Sciences , volume =

[4] [4]

Advances in Neural Information Processing Systems , volume =

Understanding information storage and transfer in multi-modal large language models , author =. Advances in Neural Information Processing Systems , volume =

[5] [5]

Nature Reviews Psychology , volume =

Dual-process theory and decision-making in large language models , author =. Nature Reviews Psychology , volume =

[6] [6]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages =

2024

[7] [7]

Preprints.org , doi =

A Survey on Selection Bias in Large Language Models , author =. Preprints.org , doi =

[8] [8]

Huang, Jen-tse and Qin, Jiaxu and Zhang, Jing and others , booktitle =

[9] [9]

Li, Yuchen and Fan, Zhen and Chen, Ruizhe and others , booktitle =

[10] [10]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Language model probabilities are not calibrated in numeric contexts , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

[11] [11]

Advances in Neural Information Processing Systems , volume =

Order-Independence Without Fine Tuning , author =. Advances in Neural Information Processing Systems , volume =

[12] [12]

Locating and editing factual associations in

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and editing factual associations in

[13] [13]

Sivakumar, Ashwin and Zhang, Allen and Hakim, Zaid and others , booktitle =

[14] [14]

Advances in Neural Information Processing Systems , volume =

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author =. Advances in Neural Information Processing Systems , volume =

[15] [15]

2026 , pages =

Wang, Sibo and Cao, Xiangkui and Zhang, Jie and Yuan, Zheng and Shan, Shiguang and Chen, Xilin and Gao, Wen , journal =. 2026 , pages =

2026

[16] [16]

Wang, Jingyi and Li, Ming and Zhang, Hao and others , journal =

[17] [17]

Ye, Wenqian and Liu, Bo and Zheng, Guangtao and others , booktitle =

[18] [18]

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in

Zhang, Zhuoran and Wang, Tengyue and Gong, Xilin and Shi, Yang and Wang, Haotian and Wang, Di and Hu, Lijie , journal =. When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in

[19] [19]

Proceedings of the 38th International Conference on Machine Learning , pages =

Calibrate Before Use: Improving Few-shot Performance of Language Models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =

[20] [20]

Proceedings of the International Conference on Learning Representations , year =

Large language models are not robust multiple choice selectors , author =. Proceedings of the International Conference on Learning Representations , year =

[21] [21]

Mitigating Selection Bias in Large Language Models via Permutation-Aware

Zheng, Jinquan and Yuan, Jia and Yao, Jiacheng and Gu, Chenyang and Zheng, Pujun and He, Guoxiu , booktitle =. Mitigating Selection Bias in Large Language Models via Permutation-Aware. 2026 , note =

2026

[22] [22]

2025 , howpublished =

2025

[23] [23]

2026 , howpublished =

Claude Sonnet 4.6: Hybrid Reasoning Model , author =. 2026 , howpublished =

2026

[24] [24]

2025 , howpublished =

Gemini 3.1 Flash-Lite: Built for Intelligence at Scale , author =. 2025 , howpublished =

2025

[25] [26]

2026 , howpublished =

2026

[26] [27]

2025 , howpublished =

Grok 4 Fast: Cost-Efficient Reasoning at Scale , author =. 2025 , howpublished =

2025

[27] [28]

2025 , howpublished =

Introduction to Techniques Used in. 2025 , howpublished =

2025

[28] [29]

2020 , eprint=

StereoSet: Measuring stereotypical bias in pretrained language models , author=. 2020 , eprint=

2020

[29] [30]

2024 , eprint=

Measuring Implicit Bias in Explicitly Unbiased Large Language Models , author=. 2024 , eprint=

2024

[30] [31]

2025 , eprint=

Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens , author=. 2025 , eprint=

2025

[31] [32]

2026 , eprint=

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts , author=. 2026 , eprint=

2026

[32] [33]

2026 , eprint=

When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models , author=. 2026 , eprint=

2026

[33] [37]

2018 , eprint=

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , author=. 2018 , eprint=

2018

[34] [38]

2025 , eprint=

Agent Lightning: Train ANY AI Agents with Reinforcement Learning , author=. 2025 , eprint=

2025

[35] [39]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[36] [40]

2025 , eprint=

Distributionally Robust Graph Out-of-Distribution Recommendation via Diffusion Model , author=. 2025 , eprint=

2025

[37] [41]

2024 , eprint=

Fairness and Diversity in Recommender Systems: A Survey , author=. 2024 , eprint=

2024

[38] [43]

Findings of the association for computational linguistics: ACL 2022 , pages=

Chartqa: A benchmark for question answering about charts with visual and logical reasoning , author=. Findings of the association for computational linguistics: ACL 2022 , pages=

2022

[39] [44]

Shannon, C. E. , journal=. A mathematical theory of communication , year=

[40] [45]

Katherine Abramski, Giulio Rossetti, and Massimo Stella. 2026. The role of system 1 and system 2 semantic memory structure in human and LLM biases. arXiv preprint arXiv:2604.12816

Pith/arXiv arXiv 2026

[41] [46]

Alibaba Cloud . 2026. Qwen 3.6 technical blog. https://qwen.ai/blog?id=qwen3.6. Accessed: 2026-05-24

2026

[42] [47]

Anthropic . 2026. Claude sonnet 4.6: Hybrid reasoning model. https://www.anthropic.com/claude/sonnet. Accessed: 2026-05-24

2026

[43] [48]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2024. https://arxiv.org/abs/2402.04105 Measuring implicit bias in explicitly unbiased large language models . Preprint, arXiv:2402.04105

arXiv 2024

[44] [49]

Griffiths

Xuechunzi Bai, Angelina Wang, Ilia Sucholutsky, and Thomas L. Griffiths. 2025. Explicitly unbiased large language models still form biased associations. Proceedings of the National Academy of Sciences, 122(8):e2416228122

2025

[45] [50]

Samyadeep Basu, Michael Grayson, Cecily Morrison, and 1 others. 2024. Understanding information storage and transfer in multi-modal large language models. In Advances in Neural Information Processing Systems, volume 37, pages 7400--7426

2024

[46] [51]

Ward, and David P

Oliver Brady, Paul Nulty, Li Zhang, Tomas E. Ward, and David P. McGovern. 2025. Dual-process theory and decision-making in large language models. Nature Reviews Psychology, 4:777--792

2025

[47] [52]

ByteDance Seed Team . 2025. Introduction to techniques used in Seed1.6 . https://seed.bytedance.com/en/blog/introduction-to-techniques-used-in-seed1-6. Accessed: 2026-05-24

2025

[48] [53]

Meiqi Chen, Yixin Cao, Yan Zhang, and 1 others. 2024. Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16449--16469

2024

[49] [54]

Google DeepMind . 2025. Gemini 3.1 flash-lite: Built for intelligence at scale. https://blog.google/technology/ai/gemini-3-1-flash-lite/. Accessed: 2026-05-24

2025

[50] [55]

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. https://arxiv.org/abs/1801.01290 Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor . Preprint, arXiv:1801.01290

Pith/arXiv arXiv 2018

[51] [56]

Guoxiu He, Jinquan Zheng, and Fangqing Han. 2026. https://doi.org/10.20944/preprints202604.2234.v1 A survey on selection bias in large language models . Preprints.org

work page doi:10.20944/preprints202604.2234.v1 2026

[52] [57]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

Pith/arXiv arXiv 2020

[53] [58]

Chenhao Huang, Ziyu Shen, Yicong Ren, Huiyuan Zheng, Jiazheng Zhang, Mingxu Chai, Ming Zhang, Shihan Dou, Fan Mo, Jie Shi, and 1 others. 2025 a . Governance in motion: Co-evolution of constitutions and ai models for scalable safety. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17198--17221

2025

[54] [59]

Jen-tse Huang, Jiaxu Qin, Jing Zhang, and 1 others. 2025 b . VisBias : Measuring explicit and implicit social biases in vision-language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17981--18004

2025

[55] [60]

Kimi Team . 2026. https://arxiv.org/abs/2602.02276 Kimi k2.5: Visual agentic intelligence . arXiv preprint arXiv:2602.02276

Pith/arXiv arXiv 2026

[56] [61]

Yuchen Li, Zhen Fan, Ruizhe Chen, and 1 others. 2025. FairSteer : Inference time debiasing for LLMs with dynamic activation steering. In Findings of the Association for Computational Linguistics: ACL 2025, pages 11293--11312

2025

[57] [62]

Charles Lovering, Michael Krumdick, Viet Dac Lai, Nilesh Reddy, and Greg Durrett. 2025. Language model probabilities are not calibrated in numeric contexts. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 29218--29257

2025

[58] [63]

Qiu, and Yuqing Yang

Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K. Qiu, and Yuqing Yang. 2025. https://arxiv.org/abs/2508.03680 Agent lightning: Train any ai agents with reinforcement learning . Preprint, arXiv:2508.03680

arXiv 2025

[59] [64]

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the association for computational linguistics: ACL 2022, pages 2263--2279

2022

[60] [65]

Reid McIlroy-Young, Katrina Brown, Conlan Olson, Linjun Zhang, and Cynthia Dwork. 2024. Order-independence without fine tuning. In Advances in Neural Information Processing Systems, volume 37, pages 72818--72839

2024

[61] [66]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems, volume 35, pages 17359--17372

2022

[62] [67]

Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. https://arxiv.org/abs/2004.09456 Stereoset: Measuring stereotypical bias in pretrained language models . Preprint, arXiv:2004.09456

arXiv 2020

[63] [68]

OpenAI . 2025. GPT -5.1: Next-generation model for developers. https://openai.com/index/gpt-5-1-for-developers/. Accessed: 2026-05-24

2025

[64] [69]

Francesco Ortu, Zhijing Jin, Diego Doimo, and Alberto Cazzaniga. 2026. https://arxiv.org/abs/2507.13868 When seeing overrides knowing: Disentangling knowledge conflicts in vision-language models . Preprint, arXiv:2507.13868

Pith/arXiv arXiv 2026

[65] [70]

C. E. Shannon. 1948. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x A mathematical theory of communication . The Bell System Technical Journal, 27(3):379--423

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948

[66] [71]

Ashwin Sivakumar, Allen Zhang, Zaid Hakim, and 1 others. 2025. SteerVLM : Robust model control through lightweight activation steering for vision language models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 23640--23665

2025

[67] [72]

J Ridley Stroop. 1935. Studies of interference in serial verbal reactions. Journal of experimental psychology, 18(6):643

1935

[68] [73]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. In Advances in Neural Information Processing Systems, volume 36, pages 74952--74965

2023

[69] [74]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291

Pith/arXiv arXiv 2023

[70] [75]

Jingyi Wang, Ming Li, Hao Zhang, and 1 others. 2026 a . V-FAT : Benchmarking visual fidelity against text-bias. arXiv preprint arXiv:2601.04897

arXiv 2026

[71] [76]

Sibo Wang, Xiangkui Cao, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen, and Wen Gao. 2026 b . https://doi.org/10.1109/TPAMI.2026.3683747 VLBiasBench : A comprehensive benchmark for evaluating bias in large vision-language model . IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1--14

work page doi:10.1109/tpami.2026.3683747 2026

[72] [77]

xAI . 2025. Grok 4 fast: Cost-efficient reasoning at scale. https://x.ai/news/grok-4-fast. Accessed: 2026-05-24

2025

[73] [78]

Hui Yang, Sifu Yue, and Yunzhong He. 2023. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224

arXiv 2023

[74] [79]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

Pith/arXiv arXiv 2022

[75] [80]

Wenqian Ye, Bo Liu, Guangtao Zheng, and 1 others. 2024. MM-SpuBench : Towards better understanding of spurious biases in multimodal LLMs . In Advances in Neural Information Processing Systems, volume 37

2024

[76] [81]

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, and Lijie Hu. 2025. When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in MLLMs . arXiv preprint arXiv:2511.02243

arXiv 2025

[77] [82]

Chu Zhao, Enneng Yang, Yuliang Liang, Jianzhe Zhao, Guibing Guo, and Xingwei Wang. 2025. https://arxiv.org/abs/2501.15555 Distributionally robust graph out-of-distribution recommendation via diffusion model . Preprint, arXiv:2501.15555

arXiv 2025

[78] [83]

Yuying Zhao, Yu Wang, Yunchao Liu, Xueqi Cheng, Charu Aggarwal, and Tyler Derr. 2024. https://arxiv.org/abs/2307.04644 Fairness and diversity in recommender systems: A survey . Preprint, arXiv:2307.04644

arXiv 2024

[79] [84]

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, pages 12697--12706

2021

[80] [85]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2024. Large language models are not robust multiple choice selectors. In Proceedings of the International Conference on Learning Representations

2024