pith. machine review for the scientific record. sign in

arxiv: 2605.14448 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.CL· cs.IR

Recognition: no theorem link

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:48 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR
keywords multimodal embeddingsadaptive reasoningdual-LoRAchain-of-thoughtretrievalefficiencyMLLMsrouting gate
0
0 comments X

The pith

A dual-LoRA architecture with a routing gate lets multimodal embeddings add chain-of-thought reasoning only when it improves results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Think When Needed (TWN), which equips a frozen multimodal backbone with separate reasoning and embedding LoRA adapters plus a learned gate that decides per input whether to generate chain-of-thought steps before producing the final embedding. The gate is trained self-supervised so that reasoning is skipped on simple queries where it would add cost or even lower quality. Gradients are detached between the two adapters during training to prevent optimization conflicts while keeping total added parameters to 3-5 percent of the backbone. On the 78-task MMEB-V2 benchmark the method reaches state-of-the-art embedding quality while using up to 50 percent fewer reasoning tokens than always-on generative baselines.

Core claim

TWN attaches a reasoning LoRA and an embedding LoRA to a shared frozen multimodal large language model backbone, detaches gradients at their interface to avoid conflicts, and routes each input through a self-supervised gate that triggers chain-of-thought generation only when it improves the downstream embedding; the reasoning quality is further refined by embedding-guided reinforcement learning.

What carries the argument

Dual-LoRA architecture with self-supervised routing gate that selectively activates reasoning before producing the embedding.

If this is right

  • Retrieval quality rises because unnecessary reasoning is withheld from inputs where it degrades performance.
  • Inference token count falls by up to half compared with full generative pipelines.
  • Total model size remains close to the original backbone because only two small LoRAs are added.
  • Joint training of reasoning and embedding stays stable through gradient detachment at the adapter interface.
  • Reinforcement learning guided by embedding quality further improves the generated reasoning chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selective-routing idea could be tested on language-only or vision-only backbones to balance depth and efficiency.
  • Gradient detachment between specialized adapters offers a general recipe for training multiple task heads on one frozen model.
  • Dynamic decision modules like the gate may reduce error propagation in downstream applications such as retrieval-augmented generation.
  • The approach implies that future embedding systems could treat reasoning as an optional compute budget allocated per example.

Load-bearing premise

The routing gate reliably identifies which inputs benefit from reasoning and which are harmed by it, while gradient detachment fully resolves adapter conflicts without introducing new biases.

What would settle it

Replace the learned routing gate with random decisions or forced always-on reasoning and measure whether MMEB-V2 embedding quality drops below the adaptive version.

Figures

Figures reproduced from arXiv: 2605.14448 by Guanghao Zhang, Hao Jiang, Longxiang Zhang, Pipei Huang, Weilong Dai.

Figure 1
Figure 1. Figure 1: (a) Comparison of multimodal embedding frameworks: (i) discriminative, (ii) unified gen [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TWN. (a) Stage 1: Supervised fine-tuning jointly trains the reasoning and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-category CoT trigger rate (%) of the adaptive routing strategy. Blue bars show query [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Base template for CoT generation prompts. The Think step produces step-by-step reasoning, [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Strict verification prompt for CoT quality judgment. Evaluates both reasoning quality and [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hallucination-only verification prompt for CoT quality judgment. Evaluates only reasoning [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training sample from the composed image retrieval task (CIRR). The query combines a [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training sample from the video retrieval task. The query provides a detailed textual [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training sample from the visual document question answering task (DocVQA). The [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training loss curves during Stage 1 (SFT) for TWN-4B and TWN-8B. (a) Next-token [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics during Stage 2 (RL) for TWN-4B and TWN-8B. (a) Gap reward [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Positive case 1 (A-OKVQA): CoT reasons about the indoor domestic setting to correctly [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Positive case 2 (ChartQA): CoT reads specific bar values and computes the numerical [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Positive case 3 (EgoSchema): CoT analyzes the hand motion pattern across video frames [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Negative case 1 (Visual7W-Pointing): A simple visual grounding task where CoT’s [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Negative case 2 (MSCOCO): CoT generates 348 tokens of overthinking that confuses [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Negative case 3 (MVBench): An 843-token CoT trace with extensive self-correction [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Think When Needed (TWN), a unified multimodal embedding framework built on a frozen backbone with a dual-LoRA architecture consisting of separate reasoning and embedding adapters. A self-supervised routing gate adaptively decides per input whether to generate chain-of-thought reasoning before producing the final embedding, with gradients detached at the LoRA interface to avoid optimization conflicts; an embedding-guided RL stage further refines CoT quality. On the 78 tasks of MMEB-V2 the method reports state-of-the-art embedding quality while adding only 3-5% parameters and using up to 50% fewer reasoning tokens than full generative baselines.

Significance. If the routing gate reliably identifies inputs where reasoning improves versus harms embedding quality, the work would constitute a meaningful advance in efficient multimodal representation learning. The dual-LoRA design with gradient detachment provides a lightweight way to combine generative reasoning with discriminative embedding objectives, and the adaptive mechanism directly addresses the overhead of indiscriminate CoT generation in prior generative embedding methods.

major comments (3)
  1. [Adaptive Think Mechanism / Routing Gate] The self-supervised routing gate is trained on the same data used for final evaluation, creating a risk of circularity in which the gate simply reproduces patterns already present in the supervised CoT data rather than learning genuine task-difficulty signals. The manuscript should add an analysis (e.g., correlation of gate decisions with per-task performance deltas or difficulty proxies) to substantiate that the reported token savings and SOTA quality are not artifacts of this training setup.
  2. [Dual-LoRA Architecture] Gradient detachment at the LoRA interface is asserted to eliminate optimization conflicts, yet no ablation compares joint versus detached training or measures residual bias in the resulting adapters. This mechanism is load-bearing for the claim that both reasoning and embedding modes maintain high quality without new biases.
  3. [Experiments on MMEB-V2] The experimental claims on MMEB-V2 require explicit confirmation that the routing threshold/temperature was not tuned post-hoc on the test set and that all baselines received equivalent hyperparameter search budgets; without these details the comparative efficiency and quality numbers cannot be fully interpreted.
minor comments (2)
  1. [Notation] The notation distinguishing the routing gate's output probability from embedding similarity scores should be clarified in the method section to avoid reader confusion.
  2. [Figures] Figure illustrating token savings would be strengthened by reporting variance across runs or statistical tests supporting the 'up to 50%' reduction claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the referee's focus on key methodological and experimental aspects of TWN. Below we address each major comment point by point, providing clarifications based on the manuscript and committing to revisions where they strengthen the work without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [Adaptive Think Mechanism / Routing Gate] The self-supervised routing gate is trained on the same data used for final evaluation, creating a risk of circularity in which the gate simply reproduces patterns already present in the supervised CoT data rather than learning genuine task-difficulty signals. The manuscript should add an analysis (e.g., correlation of gate decisions with per-task performance deltas or difficulty proxies) to substantiate that the reported token savings and SOTA quality are not artifacts of this training setup.

    Authors: The routing gate is trained via a self-supervised contrastive objective that directly compares embedding quality (measured by retrieval metrics on the same batch) with and without the reasoning path, using no direct supervision from the CoT labels themselves. This setup is designed to capture genuine task-difficulty signals from the data distribution. Nevertheless, to rigorously address the circularity concern, we will add a dedicated analysis subsection that reports correlations between gate decisions and per-task performance deltas as well as difficulty proxies (e.g., input length and semantic complexity metrics) on held-out validation data. revision: yes

  2. Referee: [Dual-LoRA Architecture] Gradient detachment at the LoRA interface is asserted to eliminate optimization conflicts, yet no ablation compares joint versus detached training or measures residual bias in the resulting adapters. This mechanism is load-bearing for the claim that both reasoning and embedding modes maintain high quality without new biases.

    Authors: Gradient detachment is introduced precisely to isolate the generative and discriminative objectives and thereby avoid the conflicts that arise in joint optimization. While the manuscript provides the theoretical rationale and reports strong empirical results under the detached regime, we agree that an explicit ablation would further substantiate the claim. We will therefore add an ablation table comparing joint (non-detached) training against our detached dual-LoRA setup, including quantitative measures of residual bias such as isolated reasoning accuracy and embedding retrieval performance. revision: yes

  3. Referee: [Experiments on MMEB-V2] The experimental claims on MMEB-V2 require explicit confirmation that the routing threshold/temperature was not tuned post-hoc on the test set and that all baselines received equivalent hyperparameter search budgets; without these details the comparative efficiency and quality numbers cannot be fully interpreted.

    Authors: All routing hyperparameters, including the decision threshold and temperature, were selected solely on a held-out validation split drawn from the MMEB-V2 training distribution; the test set was never accessed during tuning. Baselines were re-evaluated under an identical hyperparameter search budget and protocol as described in our experimental setup and supplementary material. We will insert explicit statements confirming these practices in the revised experimental section to ensure the comparative results are fully interpretable. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on empirical benchmark results.

full rationale

The paper's core derivation introduces a dual-LoRA architecture with gradient detachment and a self-supervised routing gate for adaptive CoT decisions. These are presented as design choices whose benefits are validated through evaluation on the external MMEB-V2 benchmark (78 tasks), not by reducing performance metrics to the training inputs or gate decisions by construction. No equations or steps equate predictions directly to fitted parameters, and no load-bearing self-citations or uniqueness theorems are invoked to force the architecture. The efficiency and SOTA claims are therefore independent of the internal mechanisms rather than tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The method introduces a new routing gate whose training objective is defined in terms of embedding quality, plus the assumption that LoRA adapters can be optimized independently once gradients are detached. No new physical entities are postulated.

free parameters (2)
  • LoRA rank and alpha for reasoning and embedding adapters
    Standard LoRA hyperparameters that control adapter capacity and must be chosen or tuned.
  • Routing gate threshold or temperature
    Controls the decision boundary for invoking CoT and is fitted during self-supervised training.
axioms (2)
  • domain assumption Gradient detachment at the LoRA interface prevents destructive interference between reasoning and embedding objectives.
    Invoked to justify the dual-adapter training procedure.
  • domain assumption Self-supervised signals derived from embedding similarity are sufficient to train a reliable routing gate.
    Underpins the adaptive think mechanism.
invented entities (1)
  • Self-supervised routing gate no independent evidence
    purpose: Decides per input whether to generate CoT reasoning before embedding.
    New component whose output controls token usage and final embedding quality.

pith-pipeline@v0.9.0 · 5569 in / 1599 out tokens · 20425 ms · 2026-05-15T02:48:12.288585+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 6 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

  3. [3]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qin- glong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  4. [4]

    Think then embed: Generative context improves multimodal embedding.arXiv preprint arXiv:2510.05014, 2025

    Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Yonghuan Yang, Jun Xiao, et al. Think then embed: Generative context improves multimodal embedding.arXiv preprint arXiv:2510.05014, 2025

  5. [5]

    Flashattention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations, 2024

  6. [6]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Daya Guo, Dejian Yang, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  7. [7]

    Colpali: Efficient document retrieval with vision language models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In International Conference on Learning Representations, 2025

  8. [8]

    Scaling deep contrastive learning batch size under memory limited setup

    Luyu Gao and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. InWorkshop on Representation Learning for NLP, 2021

  9. [9]

    Breaking the modality barrier: Universal embedding learning with multimodal llms

    Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, and Jiankang Deng. Breaking the modality barrier: Universal embedding learning with multimodal llms. InACM International Conference on Multimedia, 2025

  10. [10]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  11. [11]

    Cumulated gain-based evaluation of IR techniques

    Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446, 2002

  12. [12]

    Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, 2021

  13. [13]

    E5-V: universal embeddings with multi- modal large language models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models.arXiv preprint arXiv:2407.12580, 2024

  14. [14]

    Vlm2vec: Training vision-language models for massive multimodal embedding tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. InInternational Conference on Learning Representations, 2025

  15. [15]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems, 2022

  16. [16]

    Llave: Large language and vision embedding models with hardness-weighted contrastive learning

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning. InConference on Empirical Methods in Natural Language Processing, 2025. 10

  17. [17]

    Ume-r1: Exploring reasoning-driven generative multimodal embeddings.arXiv preprint arXiv:2511.00405, 2025

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Ume-r1: Exploring reasoning-driven generative multimodal embeddings.arXiv preprint arXiv:2511.00405, 2025

  18. [18]

    Nv-embed: Improved techniques for training llms as generalist embedding models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. InInternational Conference on Learning Representations, 2025

  19. [19]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems, 2020

  20. [20]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, 2023

  21. [21]

    Mm-embed: Universal multimodal retrieval with multimodal llms

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. InInternational Conference on Learning Representations, 2025

  22. [22]

    Reasoning guided embeddings: Leveraging mllm reasoning for improved multimodal retrieval

    Chunxu Liu, Jiyuan Yang, Ruopeng Gao, Yuhan Zhu, Feng Zhu, Rui Zhao, and Limin Wang. Reasoning guided embeddings: Leveraging mllm reasoning for improved multimodal retrieval. arXiv preprint arXiv:2511.16150, 2025

  23. [23]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023

  24. [24]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Yajie Zhang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4015–4025, 2025

  25. [25]

    Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Transactions on Machine Learning Research, 2025

    Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, et al. Vlm2vec-v2: Advancing multimodal embedding for videos, images, and visual documents.Transactions on Machine Learning Research, 2025

  26. [26]

    Mteb: Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. InConference of the European Chapter of the Association for Compu- tational Linguistics, 2023

  27. [27]

    Vladva: Discriminative fine-tuning of lvlms

    Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Brais Martinez, and Georgios Tzimiropoulos. Vladva: Discriminative fine-tuning of lvlms. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  28. [28]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, 2021

  29. [29]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020

  30. [30]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  31. [31]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. InAnnual Meeting of the Association for Computational Linguistics, 2023

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 11

  33. [33]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  34. [34]

    Improving text embeddings with large language models

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. InAnnual Meeting of the Association for Computational Linguistics, 2024

  35. [35]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  36. [36]

    Chi, Fei Xia, Quoc Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Fei Xia, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022

  37. [37]

    Cafe: Unifying representation and generation with contrastive-autoregressive finetuning

    Hao Yu, Zhuokai Zhao, Shen Yan, Lukasz Korycki, Jianyu Wang, Baosheng He, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, and Hanchao Yu. Cafe: Unifying representation and generation with contrastive-autoregressive finetuning. InIEEE/CVF International Conference on Computer Vision Workshops, 2025

  38. [38]

    Visrag: Vision-based retrieval-augmented generation on multi-modality documents

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenwen Liu, Shuo Wang, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. InInternational Conference on Learning Representations, 2025

  39. [39]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InAdvances in Neural Information Processing Systems, 2020

  40. [40]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE/CVF International Conference on Computer Vision, 2023

  41. [41]

    Hauptmann, Yonatan Bisk, and Yiming Yang

    Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander G. Hauptmann, Yonatan Bisk, and Yiming Yang. Direct preference optimization of video large multimodal models from language model reward. InAnnual Conference of the North American Chapter of the Association for Computational Linguistics, 2025

  42. [42]

    Bridging modalities: Improving universal multimodal retrieval by multimodal large language models

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meis- han Zhang, Wenjie Li, and Min Zhang. Bridging modalities: Improving universal multimodal retrieval by multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 9274–9285, 2025

  43. [43]

    think" field completely empty (

    Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, and Yongping Xiong. Megapairs: Massive data synthesis for universal multimodal retrieval. InAnnual Meeting of the Association for Computational Linguistics, 2025. 12 A Implementation Details Multimodal Input Processing.Images are processed at resolutions from 4,...