arxiv: 2604.18512 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

Nitish Shukla , Surgan Jandial , Arun Ross

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelspreference optimizationmulti-image reasoninghardness-aware alignmentDPOvisual searchhierarchical data constructioncross-image comparison

0 comments

The pith

A prompt-driven simple-to-hard progression of preference pairs closes the multi-image reasoning gap in vision-language models without harming single-image performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing multi-image alignment for VLMs relies too much on localized prompts that skip global visual search and autonomous comparison. It introduces a Simple-to-Hard framework that builds preference data in three escalating stages using only prompt complexity, producing chosen and rejected pairs that work across different models. When applied to LLaVA and Qwen-VL, the resulting data lifts multi-image benchmark scores while single-image scores stay strong. A sympathetic reader would care because real-world visual tasks often involve comparing several images, and this offers a general way to instill that skill through preference optimization.

Core claim

The central claim is that systematically constructing multi-image preference data across three hierarchical reasoning levels—single-image localized reasoning, multi-image localized comparison, and global visual search—via prompt-driven complexity yields chosen/rejected pairs that improve multi-image reasoning performance on VLMs while preserving single-image capabilities, outperforming prior model-specific alignment methods.

What carries the argument

The Simple-to-Hard (S2H) learning framework that generates multi-image preference pairs across three increasing capability levels using prompt complexity rather than model-specific heuristics.

If this is right

VLMs trained with S2H data achieve significant gains over baseline alignment methods on multi-image reasoning benchmarks.
The method maintains strong single-image reasoning performance while building multi-image comparison skills.
Because pairs rely on prompt complexity instead of model-specific attributes, the data transfers across architectures such as LLaVA and Qwen-VL.
The three-level hierarchy supplies a structured way to advance holistic visual preference alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged prompt-based data construction could be tested for building long-horizon reasoning in text-only models.
If the levels prove cumulative, the same structure might shorten training time in other preference optimization settings by ordering tasks by required capability.
The approach opens the possibility of automatically generating alignment data for any visual task where difficulty can be controlled through prompt wording.

Load-bearing premise

That prompt complexity alone reliably creates chosen/rejected pairs whose quality and difficulty hierarchy do not depend on the particular model being aligned.

What would settle it

Training the same base VLM on S2H data versus random or non-hierarchical preference data and finding no gain on a global visual search benchmark while single-image scores drop would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.18512 by Arun Ross, Nitish Shukla, Surgan Jandial.

**Figure 2.** Figure 2: Overview of our method. We transform single-image data sources (e.g., LLaVA 665k, ImageNet) into multi-image datasets across visual modalities with progressively increasing cognitive load. Our synthetic transformation pipeline creates task hierarchies that advance from basic single-image understanding to complex multi-image reasoning requiring visual comparison, spatial reasoning, and cross-modal integrati… view at source ↗

**Figure 3.** Figure 3: Simple-to-Hard (S2H) DPO Data Format. Level 1: The query directly points to a specific image (e.g., “Caption the first image”), and rejected pairs provide responses completely unrelated to the target image. Level 2: The query references multiple images to enforce multi-image reasoning (e.g., “Compare the first and third images”). Level 3: The query is open-ended and requires the model to examine all images… view at source ↗

**Figure 4.** Figure 4: Outputs of our method on Global Visual Search. Our approach effectively isolates the queried concept [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of the number of distractors (2–5) on [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

S2H-DPO gives a prompt-based hierarchy for multi-image VLM preference data, but the reported gains need the missing numbers and ablations to evaluate.

read the letter

The main thing to know is that this paper lays out a concrete recipe for building multi-image preference pairs using prompts that step up in complexity across three levels, without leaning on model-specific signals like hallucinations. The levels run from single-image localized reasoning, to targeted multi-image comparisons, to full global search across images. That structure targets a real gap where current alignment methods stick to pre-specified image indices and skip autonomous cross-image work. Switching to prompt-driven pair creation instead of model attributes is a practical move that could make the data more reusable across VLMs like LLaVA and Qwen-VL. The authors also flag the need to preserve single-image performance while adding multi-image strength, which lines up with how these models get used in practice. If the experiments hold, the data-generation pipeline itself could be the reusable part for other alignment work. The soft spot is that the abstract states significant improvements and benchmark gains but supplies none of the actual numbers, error bars, or ablation breakdowns. Without those, it is impossible to tell whether the three levels genuinely build harder capabilities or just vary the prompt surface, or whether the pairs stay high-quality when the target model changes. The model-agnostic claim and the hierarchy validity both rest on the full experimental section, which is not visible here. This is for researchers doing VLM alignment or multi-image reasoning tasks who want a fresh data-construction angle. A reader focused on preference optimization methods could extract the prompt recipe and test it themselves, even if the performance claims stay unverified for now. The proposal is specific enough and the problem area important enough that it deserves a serious referee to check the experiments and controls. I would send it to peer review so the empirical side gets proper scrutiny rather than desk-rejecting on the abstract alone.

Referee Report

3 major / 1 minor

Summary. The paper introduces S2H-DPO, a Simple-to-Hard learning framework for vision-language models. It systematically constructs multi-image preference data across three hierarchical reasoning levels (single-image localized reasoning, multi-image localized comparison, and global visual search) using prompt-driven complexity to generate chosen/rejected pairs that are intended to be model-agnostic, unlike prior methods relying on model-specific attributes. Evaluations on LLaVA and Qwen-VL are claimed to show significant gains in multi-image reasoning while preserving single-image performance.

Significance. If the empirical results and ablations hold, the work could advance VLM alignment by offering a scalable, prompt-based pipeline for complex multi-image tasks that emphasizes global search and cross-image comparison rather than localized reasoning. This addresses a noted capability gap and could influence preference optimization practices for holistic visual understanding.

major comments (3)

[§4 Experiments] §4 Experiments: The central claim of 'significant improvements over baseline methods across benchmarks' and 'diverse multi-image reasoning data significantly enhances multi-image reasoning performance' rests on an unverified empirical assertion, as the abstract (and provided context) supplies no specific benchmark scores, error bars, ablation details, or comparisons to baselines such as standard DPO.
[§3 Method] §3 Method (hierarchical levels): The assumption that the three levels require progressively harder capabilities (rather than merely different surface forms) is load-bearing for the S2H framework but unsupported by evidence such as performance ablations showing degradation when higher levels are withheld; this directly affects the hardness-aware claim.
[§3.1 Data Generation] §3.1 Data Generation: The claim that prompt-driven complexity produces chosen/rejected pairs whose quality is independent of the specific model being aligned lacks cross-model validation experiments (e.g., data generated from one VLM used to align another), which is necessary to substantiate the model-agnostic property.

minor comments (1)

[Abstract] Abstract: Including at least one quantitative result (e.g., a benchmark delta) would strengthen the summary of contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we outline planned revisions to address the concerns raised.

read point-by-point responses

Referee: [§4 Experiments] §4 Experiments: The central claim of 'significant improvements over baseline methods across benchmarks' and 'diverse multi-image reasoning data significantly enhances multi-image reasoning performance' rests on an unverified empirical assertion, as the abstract (and provided context) supplies no specific benchmark scores, error bars, ablation details, or comparisons to baselines such as standard DPO.

Authors: We thank the referee for pointing this out. While the abstract summarizes the results, the full paper in Section 4 includes comprehensive experimental results with specific benchmark scores on multi-image reasoning tasks, comparisons to standard DPO and other baselines, ablation studies, and error bars from repeated experiments on both LLaVA and Qwen-VL models. To address the concern about visibility, we will include a concise summary of the key quantitative improvements in the revised abstract and introduction. revision: yes
Referee: [§3 Method] §3 Method (hierarchical levels): The assumption that the three levels require progressively harder capabilities (rather than merely different surface forms) is load-bearing for the S2H framework but unsupported by evidence such as performance ablations showing degradation when higher levels are withheld; this directly affects the hardness-aware claim.

Authors: The three levels are designed to progressively build capabilities from localized single-image reasoning to multi-image comparison and finally to global visual search, which we argue requires increasing levels of visual understanding and reasoning. We acknowledge that explicit ablations isolating the contribution of each level by withholding higher levels are not presented in the current version. In the revision, we will add such ablation experiments to empirically validate the progressive hardness and the benefits of the S2H curriculum. revision: yes
Referee: [§3.1 Data Generation] §3.1 Data Generation: The claim that prompt-driven complexity produces chosen/rejected pairs whose quality is independent of the specific model being aligned lacks cross-model validation experiments (e.g., data generated from one VLM used to align another), which is necessary to substantiate the model-agnostic property.

Authors: Our data generation relies on prompt-driven complexity rather than model-specific attributes like hallucinations, making it intended to be model-agnostic. We demonstrate its effectiveness by applying the same data generation pipeline to align two distinct models, LLaVA and Qwen-VL, with positive results on both. However, we agree that explicit cross-model transfer experiments—generating data using one VLM's responses and using it to align a different VLM—would provide stronger evidence. We will include such experiments in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The S2H-DPO paper describes an empirical data-generation pipeline that constructs preference pairs via prompt-driven complexity across three fixed hierarchical levels of multi-image reasoning. All central claims rest on external benchmark evaluations (LLaVA, Qwen-VL) and comparisons to baseline methods rather than any internal derivation, fitted parameter, or self-referential equation. No load-bearing step reduces by construction to a quantity defined inside the paper itself; the hierarchy and model-agnostic claim are presented as design choices whose validity is tested empirically outside the generation procedure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the three prompt-defined difficulty tiers produce preference pairs whose ordering is model-agnostic and that standard DPO training will transfer the resulting capability gains.

axioms (1)

domain assumption Standard DPO loss and preference optimization assumptions hold for the generated pairs.
Invoked when the method is described as a modification of direct preference optimization.

pith-pipeline@v0.9.0 · 5525 in / 1190 out tokens · 19960 ms · 2026-05-10T05:33:36.994358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

127 extracted references · 28 canonical work pages · 12 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[3]

VLF eedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Li, Lei and Xie, Zhihui and Li, Mukai and Chen, Shunian and Wang, Peiyi and Chen, Liang and Yang, Yazheng and Wang, Benyou and Kong, Lingpeng and Liu, Qi. VLF eedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

2024
[4]

C hat C o T : Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models

Chen, Zhipeng and Zhou, Kun and Zhang, Beichen Gong, Zheng and Zhao, Xin and Wen, Ji-Rong. C hat C o T : Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP
[5]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[7]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[8]

Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others , booktitle=
[9]

and Ma, Wei-Chiu and Krishna, Ranjay , title =

Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A. and Ma, Wei-Chiu and Krishna, Ranjay , title =. Proceedings of European Conference on Computer Vision (ECCV). , year=
[10]

Transactions on Machine Learning Research , year=

MANTIS: Interleaved Multi-Image Instruction Tuning , author=. Transactions on Machine Learning Research , year=
[11]

A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr, Alane and Zhou, Stephanie and Zhang, Ally and Zhang, Iris and Bai, Huajun and Artzi, Yoav. A Corpus for Reasoning about Natural Language Grounded in Photographs. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
[12]

Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others , booktitle=
[13]

2024 , isbn =

Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and Zhao, Feng , title =. 2024 , isbn =

2024
[14]

Advances in Neural Information Processing Systems , year=

Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , year=
[15]

Yu, Weihao and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Liu, Zicheng and Wang, Xinchao and Wang, Lijuan , journal=
[16]

Proceedings of The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Evaluating object hallucination in large vision-language models , author=. Proceedings of The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
[17]

Liu, Yuan and Duan, Haodong and Zhang, Yuanhan and Li, Bo and Zhang, Songyang and Zhao, Wangbo and Yuan, Yike and Wang, Jiaqi and He, Conghui and Liu, Ziwei and others , journal=
[18]

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , journal=
[19]

ECCV , year=

A diagram is worth a dozen images , author=. ECCV , year=
[20]

Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=

Improved baselines with visual instruction tuning , author=. Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=
[21]

Anurag Das and Adrian Bulat and Alberto Baldrati and Ioannis Maniadis Metaxas and Bernt Schiele and Georgios Tzimiropoulos and Brais Martinez , year=
[22]

2025 , archivePrefix=

The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs , author=. 2025 , archivePrefix=

2025
[23]

``Thinking'' Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models

Furniturewala, Shaz and Jandial, Surgan and Java, Abhinav and Banerjee, Pragyan and Shahid, Simra and Bhatia, Sumit and Jaidka, Kokil. ``Thinking'' Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024

2024
[24]

Bertie Vidgen and Adarsh Agrawal and Ahmed M. Ahmed and Victor Akinwande and Namir Al-Nuaimi and Najla Alfaraj and Elie Alhajjar and Lora Aroyo and Trupti Bavalatti and Max Bartolo and Borhane Blili-Hamelin and Kurt Bollacker and Rishi Bomassani and Marisa Ferrara Boston and Siméon Campos and Kal Chakra and Canyu Chen and Cody Coleman and Zacharie Delpier...
[25]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[26]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned

Betley, Jan and Tan, Daniel Chee Hian and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart\'. Emergent Misalignment: Narrow finetuning can produce broadly misaligned. Proceedings of the 42nd International Conference on Machine Learning , pages =
[27]

and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, J

Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, J. and Hilton, Jacob and Kelton, Fraser and Miller, Luke E. and Simens, Maddie and Askell, Amanda and Welinder, P. and Christiano, P. and Leike, J. and Lowe, Ryan J...
[28]

Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

2025 , archivePrefix=

Vision language models are blind: Failing to translate detailed visual features into words , author=. 2025 , archivePrefix=

2025
[30]

Do GUI Grounders Truly Understand UI Elements?

Jandial, Surgan and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito. Do GUI Grounders Truly Understand UI Elements?. Proceedings of Findings of the A ssociation for C omputational L inguistics: EACL. 2026

2026
[31]

On the Fine-Grained Planning Abilities of VLM Web Agents

Jandial, Surgan and Wang, Yinong Oliver and Bajcsy, Andrea and De la Torre, Fernando. On the Fine-Grained Planning Abilities of VLM Web Agents. Proceedings of Findings of the Association for Computational Linguistics: EMNLP. 2025

2025
[32]

Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Cao, Yuhang and Qian, Rui and Chen, Lin and Guo, Qipeng and Duan, Haodong and Wang, Bin and Ouyang, Linke and others , journal=
[33]

Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
[34]

Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=
[35]

2024 , url=

Claude 3.5 Sonnet , author=. 2024 , url=

2024
[36]

Liu, Ziyu and Chu, Tao and Zang, Yuhang and Wei, Xilin and Dong, Xiaoyi and Zhang, Pan and Liang, Zijian and Xiong, Yuanjun and Qiao, Yu and Lin, Dahua and others , journal=
[37]

Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Li, Wei and Zhang, Pan and Dong, Xiaoyi and Xiong, Yuanjun and Lin, Dahua and Wang, Jiaqi , journal=
[38]

Ma, Yubo and Zang, Yuhang and Chen, Liangyu and Chen, Meiqi and Jiao, Yizhu and Li, Xinze and Lu, Xinyuan and Liu, Ziyu and Ma, Yan and Dong, Xiaoyi and others , journal=
[39]

Aligning large multimodal models with factually augmented

Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liang-Yan and Wang, Yu-Xiong and Yang, Yiming and others , journal=. Aligning large multimodal models with factually augmented
[40]

Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , booktitle=
[41]

Beyond hallucinations: Enhancing

Zhao, Zhiyuan and Wang, Bin and Ouyang, Linke and Dong, Xiaoyi and Wang, Jiaqi and He, Conghui , journal=. Beyond hallucinations: Enhancing
[42]

Yu, Tianyu and Zhang, Haoye and Yao, Yuan and Dang, Yunkai and Chen, Da and Lu, Xiaoman and Cui, Ganqu and He, Taiwen and Liu, Zhiyuan and Chua, Tat-Seng and others , journal=
[44]

Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Wei, Furu , journal=
[46]

Feng Li and Renrui Zhang and Hao Zhang and Yuanhan Zhang and Bo Li and Wei Li and Zejun MA and Chunyuan Li , booktitle=
[47]

Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren , journal=
[48]

Video- LL a VA : Learning United Visual Representation by Alignment Before Projection

Lin, Bin and Ye, Yang and Zhu, Bin and Cui, Jiaxi and Ning, Munan and Jin, Peng and Yuan, Li. Video- LL a VA : Learning United Visual Representation by Alignment Before Projection. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
[49]

Lin, Ji and Yin, Hongxu and Ping, Wei and Molchanov, Pavlo and Shoeybi, Mohammad and Han, Song , booktitle=
[50]

Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Jingkang Yang and Ziwei Liu , year=
[51]

Awadalla, Anas and Gao, Irena and Gardner, Josh and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Sagawa, Shiori and others , journal=
[52]

Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=
[53]

Advances in Neural Information Processing Systems , year=

Lauren. Advances in Neural Information Processing Systems , year=
[54]

Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

What matters when building vision-language models? , author=. Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[55]

Adept Blog , year=

Introducing our multimodal models , author=. Adept Blog , year=
[56]

2023 , abstract =

Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven , title =. 2023 , abstract =

2023
[57]

Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Song XiXuan and Jiazheng Xu and Keqin Chen and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang , booktitle=. Cog
[58]

Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=

Generative multimodal models are in-context learners , author=. Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=
[59]

A survey on data selection for language models

A survey on data selection for language models , author=. arXiv preprint arXiv:2402.16827 , year=

work page arXiv
[60]

A long way to go: Investigating length correlations in

Singhal, Prasann and Goyal, Tanya and Xu, Jiacheng and Durrett, Greg , journal=. A long way to go: Investigating length correlations in
[61]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

work page internal anchor Pith review arXiv
[62]

Ouali, Yassine and Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios , booktitle=
[63]

2024 , booktitle =

Chen, Lin and Wei, Xilin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Lin, Bin and Tang, Zhenyu and Yuan, Li and Qiao, Yu and Lin, Dahua and Zhao, Feng and Wang, Jiaqi , title =. 2024 , booktitle =

2024
[64]

Milebench: Benchmarking

Song, Dingjie and Chen, Shunian and Chen, Guiming Hardy and Yu, Fei and Wan, Xiang and Wang, Benyou , journal=. Milebench: Benchmarking
[65]

arXiv preprint arXiv:2404.10501 , year=

Self-Supervised Visual Preference Alignment , author=. arXiv preprint arXiv:2404.10501 , year=

work page arXiv
[66]

arXiv preprint arXiv:2408.10433 , year=

CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs , author=. arXiv preprint arXiv:2408.10433 , year=

work page arXiv
[67]

CVPR , year=

Towards vqa models that can read , author=. CVPR , year=
[68]

ECCV , year=

Microsoft coco: Common objects in context , author=. ECCV , year=
[69]

Hudson, Drew A and Manning, Christopher D , booktitle=
[70]

OCR-VQA: Visual Question Answering by Reading Text in Images

Anand Mishra and Shashank Shekhar and Ajeet Kumar Singh and Anirban Chakraborty. OCR-VQA: Visual Question Answering by Reading Text in Images. ICDAR. 2019

2019
[71]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[72]

Awadalla, Anas and Xue, Le and Lo, Oscar and Shu, Manli and Lee, Hannah and Guha, Etash Kumar and Jordan, Matt and Shen, Sheng and Awadalla, Mohamed and Savarese, Silvio and others , journal=
[73]

Proceedings of the International Conference on Neural Information Processing Systems , year=

Training language models to follow instructions with human feedback , author=. Proceedings of the International Conference on Neural Information Processing Systems , year=
[74]

Ziyu Liu and Yuhang Zang and Xiaoyi Dong and Pan Zhang and Yuhang Cao and Haodong Duan and Conghui He and Yuanjun Xiong and Dahua Lin and Jiaqi Wang , booktitle=
[75]

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=
[76]

Journal of Applied Research in Memory and Cognition , volume=

Desirable difficulties in theory and practice , author=. Journal of Applied Research in Memory and Cognition , volume=. 2020 , publisher=

2020
[78]

Fanqing Meng and Jin Wang and Chuanhao Li and Quanfeng Lu and Hao Tian and Tianshuo Yang and Jiaqi Liao and Xizhou Zhu and Jifeng Dai and Yu Qiao and Ping Luo and Kaipeng Zhang and Wenqi Shao , booktitle=
[79]

Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding , author=. arXiv preprint arXiv:2406.09411 , year=

work page arXiv
[81]

Proceedings of the International Conference on Neural Information Processing Systems , year=

Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the International Conference on Neural Information Processing Systems , year=
[82]

arXiv preprint arXiv:2407.11691 , year=

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models , author=. arXiv preprint arXiv:2407.11691 , year=

work page arXiv
[83]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

arXiv preprint arXiv:2404.19733 , year=

Iterative reasoning preference optimization , author=. arXiv preprint arXiv:2404.19733 , year=

work page arXiv
[85]

ImageNet: A large-scale hierarchical image database , year=

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=
[86]

Kinship Verification on Families In The Wild with Marginalized Denoising Metric Learning

Wang, Shuyang and Robinson, Joseph P and Fu, Yun. Kinship Verification on Families In The Wild with Marginalized Denoising Metric Learning. Proceedings of Automatic Face and Gesture Recognition (FG). 2017

2017

Showing first 80 references.