pith. machine review for the scientific record. sign in

arxiv: 2604.13403 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal in-context learningvision-language modelstask mapping constructiontask mapping transfercross-modal alignmentfew-shot adaptationlayer-wise analysis
0
0 comments X

The pith

Multimodal in-context learning matches text-only performance in zero-shot settings but degrades sharply with few-shot demonstrations because models fail to build and transfer aligned task mappings across vision and language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why adding visual inputs to in-context learning does not deliver the same gains seen in text-only models. It finds that performance stays comparable without examples but drops once a handful of image-text demonstrations are provided. The authors break the process into two stages—constructing a task mapping from the examples and then applying that mapping to the query—and track these stages layer by layer. Their analysis shows the bottleneck occurs because visual and textual representations lack alignment at the reasoning level, so the learned mapping does not transfer reliably to new queries. This insight matters for anyone trying to adapt large multimodal models to new tasks using only a few examples at inference time.

Core claim

Using identical task formulations across modalities, multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. The models lack reasoning-level alignment between visual and textual representations and fail to reliably transfer learned task mappings to queries. Guided by this layer-wise decomposition into task mapping construction and task mapping transfer, a simple inference-stage enhancement that reinforces the transfer step improves results.

What carries the argument

The decomposition of multimodal ICL into task mapping construction (building the mapping from demonstrations) and task mapping transfer (applying it to the query), tracked across model layers.

If this is right

  • The performance gap appears only when demonstrations are supplied and is absent in pure zero-shot use.
  • A lightweight inference-time intervention that strengthens task mapping transfer measurably reduces the degradation without retraining.
  • Current models succeed at zero-shot multimodal adaptation but cannot yet leverage example-based adaptation as effectively as text-only models.
  • Layer-wise inspection reveals that the transfer failure occurs after the construction stage, pinpointing where cross-modal information is lost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future pretraining could add explicit objectives that force visual and textual reasoning representations to occupy a shared space, potentially closing the few-shot gap at the source.
  • The same construction-versus-transfer lens could be applied to other multimodal adaptation regimes such as retrieval-augmented generation or chain-of-thought prompting with images.
  • If the transfer bottleneck is modality-specific, audio-language or video-language models may exhibit analogous few-shot drops that the same analysis would detect.

Load-bearing premise

The proposed split into task mapping construction and task mapping transfer, together with the layer-wise measurements, correctly isolates the true internal cause of the multimodal ICL gap rather than a side effect.

What would settle it

Train or fine-tune a multimodal model with an auxiliary objective that explicitly aligns reasoning-level features between vision and language on the same tasks, then measure whether the few-shot ICL degradation disappears while zero-shot performance stays unchanged.

Figures

Figures reproduced from arXiv: 2604.13403 by Sharon Li, Yu Wang.

Figure 1
Figure 1. Figure 1: (a–b) Illustration of constructed outlier detection task (Chen et al., 2025a), where (a) shows the zero￾shot setup and (b) shows the few-shot setup. In the few-shot scenario (2-shot), the model must infer from the demonstrations whether the query should be solved based on shape or color, and then apply this inferred rule to identify the outlier item in either the image or the sentence. (c) Performance comp… view at source ↗
Figure 3
Figure 3. Figure 3: Proportion of error types in text-only ICL [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative examples of error cases. (a) False task recognition: the model misreads the task mapping in the demonstrations and outputs the wrong attribute (shape star) while the true minority is color gray. (b) Correct task recognition but false answer: al￾though the model identifies the correct task mapping (detecting the OOD sample by shape feature), it still pre￾dicts the wrong minority because it atten… view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise visualization of attention from demonstration’s labels tokens (or the last token) to image tokens in a multimodal ICL example. The four demonstrations are labeled by the color outlier, but the model exhibits the False Task Recognition on the query and incorrectly predicts star instead of gray. Red bounding boxes denote ground-truth evidence regions. Demonstration label tokens form clear object-l… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise attention ratios over different image regions for correct (•) vs. incorrect (+) pre￾dictions. Relative attention from demonstration label tokens to correct, false, and irrelevant evidence regions. Both models show a strong peak on correct evidence at the midlayer, indicating stable visual grounding within the demonstrations. mance before and after the intervention on different model families (Ta… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of Correct and Error Samples in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An overview of datasets in our papers. For the datasets sourced from TrueMICL (Chen et al., 2025a) (Outlier Detection, Operator Induction, Clock Math), the label to the query requires the model to learn the relationship between images and text in the demos. Meanwhile, for the natural-image datasets (OK-VQA (Marino et al., 2019)), the model can leverage query-relevant demonstrations to enhance inference on … view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise Relative Attention per To￾ken of demonstration text and image tokens for different MLLMs families under the 4-shot setup. Qwen2.5-VL-7B shows a consistently text-dominant pattern across all layers, while Gemma-3-12B displays a modality-switching pattern. Layer-wise Attention Allocation to Image and Text Demonstrations. The results in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise Attention Head Activation Pat￾terns on Image and Text Demonstrations under the 4-shot setup. Only a few heads show noticeable atten￾tion, while most heads remain inactive. tion text, while attention to demonstration image tokens is even sparser. Most heads remain nearly inactive for both modalities. These results indicate that multimodal ICL relies on only a small subset of specialized heads rat… view at source ↗
Figure 10
Figure 10. Figure 10: Additional layer-wise visualization of attention from demonstration label tokens (or the last token) to image tokens in a multimodal ICL example. The four demonstrations are labeled by the shape outlier, yet the model exhibits Correct Task Recognition but False Answer on the query, incorrectly predicting triangle instead of square. Red bounding boxes denote ground-truth evidence regions. Demonstration lab… view at source ↗
read the original abstract

In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that multimodal ICL in MLLMs matches text-only ICL in zero-shot settings but degrades markedly with few-shot demonstrations. It decomposes multimodal ICL into task-mapping construction (from demonstrations) and task-mapping transfer (to queries), then uses layer-wise activation and similarity analyses to attribute the gap to missing reasoning-level cross-modal alignment and unreliable transfer. Guided by these observations, the authors propose a simple inference-stage enhancement to reinforce transfer and release an analysis framework.

Significance. If the empirical patterns and decomposition hold under scrutiny, the work supplies concrete mechanistic hypotheses for why multimodal ICL underperforms and a lightweight practical fix, together with reproducible code. These elements could usefully inform both model architecture choices and future causal studies of cross-modal ICL.

major comments (2)
  1. [Layer-wise analysis and task-mapping decomposition (Sections 4–5)] The central attribution of the few-shot degradation to insufficient reasoning-level alignment rests on layer-wise similarity metrics and activation patterns without causal interventions (e.g., representation editing, targeted layer ablation, or counterfactual prompting). This makes it difficult to distinguish whether the reported alignment/transfer failures are causal mechanisms or downstream correlates of modality-specific tokenization and attention.
  2. [Experimental setup and results (Section 3)] The claim that multimodal ICL 'degrades significantly' under few-shot demonstrations is presented as robust, yet the manuscript provides no details on data splits, number of runs, or statistical tests for the zero-shot vs. few-shot gap; without these, it is unclear whether the observed difference is load-bearing or sensitive to post-hoc choices.
minor comments (2)
  1. The abstract states that 'identical task formulations across modalities' are used, but the precise construction of visual versus textual demonstrations (token counts, prompt templates, image preprocessing) is not summarized in the main text; a short table or paragraph would improve clarity.
  2. Figure captions for the layer-wise plots should explicitly state the similarity metric (e.g., cosine, CCA) and whether shaded regions represent standard error across tasks or seeds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our experimental details and to clarify the scope of our mechanistic claims.

read point-by-point responses
  1. Referee: [Layer-wise analysis and task-mapping decomposition (Sections 4–5)] The central attribution of the few-shot degradation to insufficient reasoning-level alignment rests on layer-wise similarity metrics and activation patterns without causal interventions (e.g., representation editing, targeted layer ablation, or counterfactual prompting). This makes it difficult to distinguish whether the reported alignment/transfer failures are causal mechanisms or downstream correlates of modality-specific tokenization and attention.

    Authors: We agree that our layer-wise analyses rely on observational metrics (activation patterns and similarity scores) rather than causal interventions, and therefore cannot definitively establish that the observed alignment and transfer failures are the direct causal drivers of the performance gap rather than correlates of tokenization or attention differences. The patterns we report are consistent across multiple models, tasks, and layers, which supports our hypotheses, but we acknowledge the correlational nature of the evidence. In the revised manuscript we have added an explicit limitations paragraph in the discussion section noting this point and outlining how future work could employ representation editing or targeted ablations to test causality. We have also softened the language in Sections 4–5 from “attribution” to “consistent evidence supporting the hypothesis that...” to avoid overclaiming. revision: partial

  2. Referee: [Experimental setup and results (Section 3)] The claim that multimodal ICL 'degrades significantly' under few-shot demonstrations is presented as robust, yet the manuscript provides no details on data splits, number of runs, or statistical tests for the zero-shot vs. few-shot gap; without these, it is unclear whether the observed difference is load-bearing or sensitive to post-hoc choices.

    Authors: We appreciate this observation. The original manuscript did not report these experimental controls. We have now expanded Section 3 and the appendix with: (i) explicit descriptions of the data splits and task sampling procedure, (ii) results averaged over five independent runs using different random seeds for demonstration selection, and (iii) paired t-test results showing that the zero-shot to few-shot degradation is statistically significant (p < 0.01) across all evaluated settings. These additions confirm that the reported gap is not sensitive to the particular choices described in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical decomposition and layer-wise measurements are independent of fitted inputs or self-referential definitions

full rationale

The paper's core contribution is an empirical analysis of multimodal ICL performance gaps, using identical task formulations to compare zero-shot vs. few-shot behavior, followed by a decomposition into task mapping construction and transfer that is then measured via layer-wise activations and similarity metrics. No equations, parameter fits, or predictions are described that reduce to the inputs by construction. The proposed inference-stage enhancement is presented as a direct consequence of the observed measurements rather than a renamed fit. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical mechanistic study with no new theoretical axioms, free parameters, or invented entities; it relies on standard assumptions of transformer-based multimodal models and existing ICL evaluation protocols.

pith-pipeline@v0.9.0 · 5516 in / 1162 out tokens · 29199 ms · 2026-05-10T13:38:09.330571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 39 canonical work pages · 3 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. https://arxiv.org/abs/2502.13923 Qwen2.5-vl technical report . Preprint, arXiv:2502.13923

  2. [2]

    Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, and Benjamin Piwowarski. 2024. What makes multimodal in-context learning work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539--1550

  3. [4]

    Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, Volker Tresp, and Jindong Gu. 2024 a . https://arxiv.org/abs/2311.18021 Can multimodal large language models truly perform multimodal in-context learning? Preprint, arXiv:2311.18021

  4. [5]

    Shuo Chen, Jianzhe Liu, Zhen Han, Yan Xia, Daniel Cremers, Philip Torr, Volker Tresp, and Jindong Gu. 2025 a . https://arxiv.org/abs/2507.15807 True multimodal in-context learning needs attention to the visual context . Preprint, arXiv:2507.15807

  5. [6]

    Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, and 1 others. 2025 b . Ocean-ocr: Towards general ocr application via a vision-language model. arXiv preprint arXiv:2501.15558

  6. [7]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, and 1 others. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185--24198

  7. [8]

    Hakaze Cho, Mariko Kato, Yoshihiro Sakai, and Naoya Inoue. 2025. https://arxiv.org/abs/2410.04468 Revisiting in-context learning inference circuit in large language models . Preprint, arXiv:2410.04468

  8. [9]

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. 2023. https://arxiv.org/abs/2212.10559 Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers . In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models

  9. [10]

    Sivan Doveh, Shaked Perek, M Jehanzeb Mirza, Amit Alfassy, Assaf Arbelle, Shimon Ullman, and Leonid Karlinsky. 2024. Towards multimodal in-context learning for vision & language models. arXiv preprint arXiv:2403.12736

  10. [11]

    Jun Gao, Qian Qiao, Ziqiang Cao, Zili Wang, and Wenjie Li. 2024. https://arxiv.org/abs/2406.07588 Aim: Let any multi-modal large language models embrace efficient in-context learning . Preprint, arXiv:2406.07588

  11. [12]

    Jun Gao, Qian Qiao, Tianxiang Wu, Zili Wang, Ziqiang Cao, and Wenjie Li. 2025. Aim: Let any multimodal large language models embrace efficient in-context learning. In Proceedings of the AAAI Conference on Artificial Intelligence, number 3 in 39, pages 3077--3085

  12. [13]

    Chi Han, Ziqi Wang, Han Zhao, and Heng Ji. 2023 a . https://arxiv.org/abs/2305.12766 Explaining emergent in-context learning as kernel regression . arXiv preprint arXiv:2305.12766

  13. [14]

    Xiaochuang Han, Daniel Simig, Todor Mihaylov, Yulia Tsvetkov, Asli Celikyilmaz, and Tianlu Wang. 2023 b . https://arxiv.org/abs/2306.15091 Understanding in-context learning via supportive pretraining data . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12660--12673

  14. [15]

    Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, and Roei Herzig. 2024. Multimodal task vectors enable many-shot multimodal in-context learning. Advances in Neural Information Processing Systems, 37:22124--22153

  15. [16]

    Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, and Zsolt Kira. 2025. Mimicking or reasoning: Rethinking multi-modal in-context learning in vision-language models. arXiv preprint arXiv:2506.07936

  16. [17]

    Hong Jun Jeon, Jason D Lee, Qi Lei, and Benjamin Van Roy. 2024. https://arxiv.org/abs/2401.15530 An information-theoretic analysis of in-context learning . In Forty-first International Conference on Machine Learning

  17. [18]

    Hongrui Jia, Chaoya Jiang, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, and Shikun Zhang. 2024. Symdpo: Boosting in-context learning of large multimodal models with symbol demonstration direct preference optimization. arXiv preprint arXiv:2411.11909

  18. [19]

    Jannik Kossen, Yarin Gal, and Tom Rainforth. 2024. https://arxiv.org/abs/2307.12375 In-context learning learns label relationships but is not conventional learning . In The Twelfth International Conference on Learning Representations

  19. [20]

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. 2023 a . Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425

  20. [21]

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. 2023 b . Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726

  21. [22]

    Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, and Xu Yang. 2023 c . How to configure good in-context sequence for visual question answering. arXiv preprint arXiv:2312.01571

  22. [23]

    Yanshu Li. 2025. https://arxiv.org/abs/2503.04839 Advancing multimodal in-context learning in large vision-language models with task-aware demonstrations . Preprint, arXiv:2503.04839

  23. [24]

    Yanshu Li, Jianjiang Yang, Zhennan Shen, Ligong Han, Haoyan Xu, and Ruixiang Tang. 2025 a . https://arxiv.org/abs/2508.07871 Catp: Contextually adaptive token pruning for efficient and enhanced multimodal in-context learning . Preprint, arXiv:2508.07871

  24. [25]

    Yanshu Li, Jianjiang Yang, Ziteng Yang, Bozheng Li, Hongyang He, Zhengtao Yao, Ligong Han, Yingjie Victor Chen, Songlin Fei, Dongfang Liu, and Ruixiang Tang. 2025 b . https://arxiv.org/abs/2505.17097 Cama: Enhancing multimodal in-context learning with context-aware modulated attention . Preprint, arXiv:2505.17097

  25. [26]

    Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, and Ruixiang Tang. 2025 c . https://arxiv.org/abs/2505.17098 Taco: Enhancing multimodal in-context learning via task mapping-guided sequence configuration . Preprint, arXiv:2505.17098

  26. [27]

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. 2024. https://llava-vl.github.io/blog/2024-01-30-llava-next/ Llava-next: Improved reasoning, ocr, and world knowledge

  27. [28]

    Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, and Hanghang Tong. 2025. https://arxiv.org/abs/2510.17771 Seeing but not believing: Probing the disconnect between visual attention and answer correctness in vlms . Preprint, arXiv:2510.17771

  28. [29]

    Yang Luo, Zangwei Zheng, Zirui Zhu, and Yang You. 2024. How does the textual information affect the retrieval of multimodal in-context learning? arXiv preprint arXiv:2404.12866

  29. [30]

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. 2019. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR)

  30. [31]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. https://arxiv.org/abs/2202.12837 Rethinking the role of demonstrations: What makes in-context learning work? Preprint, arXiv:2202.12837

  31. [32]

    Yaniv Nikankin, Dana Arad, Yossi Gandelsman, and Yonatan Belinkov. 2025. https://arxiv.org/abs/2506.09047 Same task, different circuits: Disentangling modality-specific mechanisms in vlms . Preprint, arXiv:2506.09047

  32. [33]

    Jane Pan. 2023. https://arxiv.org/abs/2305.09731 What in-context learning “learns” in-context: Disentangling task recognition and task learning . Master's thesis, Princeton University

  33. [35]

    Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, and Wanxiang Che. 2024 b . What factors affect multi-modal in-context learning? an in-depth exploration. arXiv preprint arXiv:2410.20482

  34. [36]

    Aaditya Singh, Stephanie Chan, Ted Moskovitz, Erin Grant, Andrew Saxe, and Felix Hill. 2024 a . https://arxiv.org/abs/2311.08360 The transient nature of emergent in-context learning in transformers . Advances in Neural Information Processing Systems, 36

  35. [37]

    Aaditya K Singh, Ted Moskovitz, Felix Hill, Stephanie CY Chan, and Andrew M Saxe. 2024 b . https://arxiv.org/abs/2404.07129 What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation . In Forty-first International Conference on Machine Learning

  36. [38]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  37. [39]

    Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. https://arxiv.org/abs/2305.14160 Label words are anchors: An information flow perspective for understanding in-context learning . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840--9855

  38. [40]

    Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, and Yu Zhou. 2025. An empirical study on configuring in-context learning demonstrations for unleashing mllms' sentimental perception capability. arXiv preprint arXiv:2505.16193

  39. [41]

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2021. https://arxiv.org/abs/2111.02080 An explanation of in-context learning as implicit bayesian inference . arXiv preprint arXiv:2111.02080

  40. [42]

    Nan Xu, Fei Wang, Sheng Zhang, Hoifung Poon, and Muhao Chen. 2024. From introspection to best practices: Principled analysis of demonstrations in multimodal in-context learning. arXiv preprint arXiv:2407.00902

  41. [43]

    Xu Yang, Yongliang Wu, Mingzhuo Yang, Haokun Chen, and Xin Geng. 2024. Exploring diverse in-context configurations for image captioning. Advances in Neural Information Processing Systems, 36

  42. [44]

    Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. 2023 a . https://arxiv.org/abs/2306.09927 Trained transformers learn linear models in-context . arXiv preprint arXiv:2306.09927

  43. [45]

    Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. 2023 b . What makes good examples for visual in-context learning? Advances in Neural Information Processing Systems, 36:17773--17794

  44. [46]

    Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. 2023. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915

  45. [47]

    Bowen Zheng, Ming Ma, Zhongqiao Lin, and Tianming Yang. 2024. https://arxiv.org/abs/2406.16007 Distributed rule vectors is a key mechanism in large language models' in-context learning . arXiv preprint arXiv:2406.16007

  46. [48]

    Yuxiang Zhou, Jiazheng Li, Yanzheng Xiang, Hanqi Yan, Lin Gui, and Yulan He. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.795 The mystery of in-context learning: A comprehensive survey on interpretation and analysis . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14365--14378, Miami, Florida, USA. As...

  47. [49]

    Yongshuo Zong, Ondrej Bohdal, and Timothy Hospedales. 2024. Vl-icl bench: The devil in the details of benchmarking multimodal in-context learning. arXiv preprint arXiv:2403.13164

  48. [50]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  49. [51]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...