Recognition: unknown
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models
Pith reviewed 2026-05-10 05:33 UTC · model grok-4.3
The pith
A prompt-driven simple-to-hard progression of preference pairs closes the multi-image reasoning gap in vision-language models without harming single-image performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that systematically constructing multi-image preference data across three hierarchical reasoning levels—single-image localized reasoning, multi-image localized comparison, and global visual search—via prompt-driven complexity yields chosen/rejected pairs that improve multi-image reasoning performance on VLMs while preserving single-image capabilities, outperforming prior model-specific alignment methods.
What carries the argument
The Simple-to-Hard (S2H) learning framework that generates multi-image preference pairs across three increasing capability levels using prompt complexity rather than model-specific heuristics.
If this is right
- VLMs trained with S2H data achieve significant gains over baseline alignment methods on multi-image reasoning benchmarks.
- The method maintains strong single-image reasoning performance while building multi-image comparison skills.
- Because pairs rely on prompt complexity instead of model-specific attributes, the data transfers across architectures such as LLaVA and Qwen-VL.
- The three-level hierarchy supplies a structured way to advance holistic visual preference alignment.
Where Pith is reading between the lines
- Similar staged prompt-based data construction could be tested for building long-horizon reasoning in text-only models.
- If the levels prove cumulative, the same structure might shorten training time in other preference optimization settings by ordering tasks by required capability.
- The approach opens the possibility of automatically generating alignment data for any visual task where difficulty can be controlled through prompt wording.
Load-bearing premise
That prompt complexity alone reliably creates chosen/rejected pairs whose quality and difficulty hierarchy do not depend on the particular model being aligned.
What would settle it
Training the same base VLM on S2H data versus random or non-hierarchical preference data and finding no gain on a global visual search benchmark while single-image scores drop would falsify the central claim.
Figures
read the original abstract
Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces S2H-DPO, a Simple-to-Hard learning framework for vision-language models. It systematically constructs multi-image preference data across three hierarchical reasoning levels (single-image localized reasoning, multi-image localized comparison, and global visual search) using prompt-driven complexity to generate chosen/rejected pairs that are intended to be model-agnostic, unlike prior methods relying on model-specific attributes. Evaluations on LLaVA and Qwen-VL are claimed to show significant gains in multi-image reasoning while preserving single-image performance.
Significance. If the empirical results and ablations hold, the work could advance VLM alignment by offering a scalable, prompt-based pipeline for complex multi-image tasks that emphasizes global search and cross-image comparison rather than localized reasoning. This addresses a noted capability gap and could influence preference optimization practices for holistic visual understanding.
major comments (3)
- [§4 Experiments] §4 Experiments: The central claim of 'significant improvements over baseline methods across benchmarks' and 'diverse multi-image reasoning data significantly enhances multi-image reasoning performance' rests on an unverified empirical assertion, as the abstract (and provided context) supplies no specific benchmark scores, error bars, ablation details, or comparisons to baselines such as standard DPO.
- [§3 Method] §3 Method (hierarchical levels): The assumption that the three levels require progressively harder capabilities (rather than merely different surface forms) is load-bearing for the S2H framework but unsupported by evidence such as performance ablations showing degradation when higher levels are withheld; this directly affects the hardness-aware claim.
- [§3.1 Data Generation] §3.1 Data Generation: The claim that prompt-driven complexity produces chosen/rejected pairs whose quality is independent of the specific model being aligned lacks cross-model validation experiments (e.g., data generated from one VLM used to align another), which is necessary to substantiate the model-agnostic property.
minor comments (1)
- [Abstract] Abstract: Including at least one quantitative result (e.g., a benchmark delta) would strengthen the summary of contributions.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we outline planned revisions to address the concerns raised.
read point-by-point responses
-
Referee: [§4 Experiments] §4 Experiments: The central claim of 'significant improvements over baseline methods across benchmarks' and 'diverse multi-image reasoning data significantly enhances multi-image reasoning performance' rests on an unverified empirical assertion, as the abstract (and provided context) supplies no specific benchmark scores, error bars, ablation details, or comparisons to baselines such as standard DPO.
Authors: We thank the referee for pointing this out. While the abstract summarizes the results, the full paper in Section 4 includes comprehensive experimental results with specific benchmark scores on multi-image reasoning tasks, comparisons to standard DPO and other baselines, ablation studies, and error bars from repeated experiments on both LLaVA and Qwen-VL models. To address the concern about visibility, we will include a concise summary of the key quantitative improvements in the revised abstract and introduction. revision: yes
-
Referee: [§3 Method] §3 Method (hierarchical levels): The assumption that the three levels require progressively harder capabilities (rather than merely different surface forms) is load-bearing for the S2H framework but unsupported by evidence such as performance ablations showing degradation when higher levels are withheld; this directly affects the hardness-aware claim.
Authors: The three levels are designed to progressively build capabilities from localized single-image reasoning to multi-image comparison and finally to global visual search, which we argue requires increasing levels of visual understanding and reasoning. We acknowledge that explicit ablations isolating the contribution of each level by withholding higher levels are not presented in the current version. In the revision, we will add such ablation experiments to empirically validate the progressive hardness and the benefits of the S2H curriculum. revision: yes
-
Referee: [§3.1 Data Generation] §3.1 Data Generation: The claim that prompt-driven complexity produces chosen/rejected pairs whose quality is independent of the specific model being aligned lacks cross-model validation experiments (e.g., data generated from one VLM used to align another), which is necessary to substantiate the model-agnostic property.
Authors: Our data generation relies on prompt-driven complexity rather than model-specific attributes like hallucinations, making it intended to be model-agnostic. We demonstrate its effectiveness by applying the same data generation pipeline to align two distinct models, LLaVA and Qwen-VL, with positive results on both. However, we agree that explicit cross-model transfer experiments—generating data using one VLM's responses and using it to align a different VLM—would provide stronger evidence. We will include such experiments in the revised manuscript. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The S2H-DPO paper describes an empirical data-generation pipeline that constructs preference pairs via prompt-driven complexity across three fixed hierarchical levels of multi-image reasoning. All central claims rest on external benchmark evaluations (LLaVA, Qwen-VL) and comparisons to baseline methods rather than any internal derivation, fitted parameter, or self-referential equation. No load-bearing step reduces by construction to a quantity defined inside the paper itself; the hierarchy and model-agnostic claim are presented as design choices whose validity is tested empirically outside the generation procedure.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard DPO loss and preference optimization assumptions hold for the generated pairs.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[3]
VLF eedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Li, Lei and Xie, Zhihui and Li, Mukai and Chen, Shunian and Wang, Peiyi and Chen, Liang and Yang, Yazheng and Wang, Benyou and Kong, Lingpeng and Liu, Qi. VLF eedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024
2024
-
[4]
C hat C o T : Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models
Chen, Zhipeng and Zhou, Kun and Zhang, Beichen Gong, Zheng and Zhao, Xin and Wen, Ji-Rong. C hat C o T : Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models. Proceedings of the Findings of the Association for Computational Linguistics: EMNLP
-
[5]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
-
[7]
2016 , publisher=
Deep learning , author=. 2016 , publisher=
2016
-
[8]
Yue, Xiang and Ni, Yuansheng and Zhang, Kai and Zheng, Tianyu and Liu, Ruoqi and Zhang, Ge and Stevens, Samuel and Jiang, Dongfu and Ren, Weiming and Sun, Yuxuan and others , booktitle=
-
[9]
and Ma, Wei-Chiu and Krishna, Ranjay , title =
Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A. and Ma, Wei-Chiu and Krishna, Ranjay , title =. Proceedings of European Conference on Computer Vision (ECCV). , year=
-
[10]
Transactions on Machine Learning Research , year=
MANTIS: Interleaved Multi-Image Instruction Tuning , author=. Transactions on Machine Learning Research , year=
-
[11]
A Corpus for Reasoning about Natural Language Grounded in Photographs
Suhr, Alane and Zhou, Stephanie and Zhang, Ally and Zhang, Iris and Bai, Huajun and Artzi, Yoav. A Corpus for Reasoning about Natural Language Grounded in Photographs. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
-
[12]
Li, Kunchang and Wang, Yali and He, Yinan and Li, Yizhuo and Wang, Yi and Liu, Yi and Wang, Zun and Xu, Jilan and Chen, Guo and Luo, Ping and others , booktitle=
-
[13]
2024 , isbn =
Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and Zhao, Feng , title =. 2024 , isbn =
2024
-
[14]
Advances in Neural Information Processing Systems , year=
Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , year=
-
[15]
Yu, Weihao and Yang, Zhengyuan and Li, Linjie and Wang, Jianfeng and Lin, Kevin and Liu, Zicheng and Wang, Xinchao and Wang, Lijuan , journal=
-
[16]
Proceedings of The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
Evaluating object hallucination in large vision-language models , author=. Proceedings of The Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=
-
[17]
Liu, Yuan and Duan, Haodong and Zhang, Yuanhan and Li, Bo and Zhang, Songyang and Zhao, Wangbo and Yuan, Yike and Wang, Jiaqi and He, Conghui and Liu, Ziwei and others , journal=
-
[18]
Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , journal=
-
[19]
ECCV , year=
A diagram is worth a dozen images , author=. ECCV , year=
-
[20]
Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=
Improved baselines with visual instruction tuning , author=. Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=
-
[21]
Anurag Das and Adrian Bulat and Alberto Baldrati and Ioannis Maniadis Metaxas and Bernt Schiele and Georgios Tzimiropoulos and Brais Martinez , year=
-
[22]
2025 , archivePrefix=
The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs , author=. 2025 , archivePrefix=
2025
-
[23]
``Thinking'' Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models
Furniturewala, Shaz and Jandial, Surgan and Java, Abhinav and Banerjee, Pragyan and Shahid, Simra and Bhatia, Sumit and Jaidka, Kokil. ``Thinking'' Fair and Slow: On the Efficacy of Structured Prompts for Debiasing Language Models. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2024
2024
-
[24]
Bertie Vidgen and Adarsh Agrawal and Ahmed M. Ahmed and Victor Akinwande and Namir Al-Nuaimi and Najla Alfaraj and Elie Alhajjar and Lora Aroyo and Trupti Bavalatti and Max Bartolo and Borhane Blili-Hamelin and Kurt Bollacker and Rishi Bomassani and Marisa Ferrara Boston and Siméon Campos and Kal Chakra and Canyu Chen and Cody Coleman and Zacharie Delpier...
-
[25]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[26]
Emergent Misalignment: Narrow finetuning can produce broadly misaligned
Betley, Jan and Tan, Daniel Chee Hian and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart\'. Emergent Misalignment: Narrow finetuning can produce broadly misaligned. Proceedings of the 42nd International Conference on Machine Learning , pages =
-
[27]
and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, J
Ouyang, Long and Wu, Jeff and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll L. and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, J. and Hilton, Jacob and Kelton, Fraser and Miller, Luke E. and Simens, Maddie and Askell, Amanda and Welinder, P. and Christiano, P. and Leike, J. and Lowe, Ryan J...
-
[28]
Constitutional AI: Harmlessness from AI Feedback
Constitutional AI: Harmlessness from AI Feedback , author=. arXiv preprint arXiv:2212.08073 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
2025 , archivePrefix=
Vision language models are blind: Failing to translate detailed visual features into words , author=. 2025 , archivePrefix=
2025
-
[30]
Do GUI Grounders Truly Understand UI Elements?
Jandial, Surgan and Li, Yinheng and Wagle, Justin and Koishida, Kazuhito. Do GUI Grounders Truly Understand UI Elements?. Proceedings of Findings of the A ssociation for C omputational L inguistics: EACL. 2026
2026
-
[31]
On the Fine-Grained Planning Abilities of VLM Web Agents
Jandial, Surgan and Wang, Yinong Oliver and Bajcsy, Andrea and De la Torre, Fernando. On the Fine-Grained Planning Abilities of VLM Web Agents. Proceedings of Findings of the Association for Computational Linguistics: EMNLP. 2025
2025
-
[32]
Zhang, Pan and Dong, Xiaoyi and Zang, Yuhang and Cao, Yuhang and Qian, Rui and Chen, Lin and Guo, Qipeng and Duan, Haodong and Wang, Bin and Ouyang, Linke and others , journal=
-
[33]
Achiam, Josh and Adler, Steven and Agarwal, Sandhini and Ahmad, Lama and Akkaya, Ilge and Aleman, Florencia Leoni and Almeida, Diogo and Altenschmidt, Janko and Altman, Sam and Anadkat, Shyamal and others , journal=
-
[34]
Hurst, Aaron and Lerer, Adam and Goucher, Adam P and Perelman, Adam and Ramesh, Aditya and Clark, Aidan and Ostrow, AJ and Welihinda, Akila and Hayes, Alan and Radford, Alec and others , journal=
-
[35]
2024 , url=
Claude 3.5 Sonnet , author=. 2024 , url=
2024
-
[36]
Liu, Ziyu and Chu, Tao and Zang, Yuhang and Wei, Xilin and Dong, Xiaoyi and Zhang, Pan and Liang, Zijian and Xiong, Yuanjun and Qiao, Yu and Lin, Dahua and others , journal=
-
[37]
Liu, Ziyu and Sun, Zeyi and Zang, Yuhang and Li, Wei and Zhang, Pan and Dong, Xiaoyi and Xiong, Yuanjun and Lin, Dahua and Wang, Jiaqi , journal=
-
[38]
Ma, Yubo and Zang, Yuhang and Chen, Liangyu and Chen, Meiqi and Jiao, Yizhu and Li, Xinze and Lu, Xinyuan and Liu, Ziyu and Ma, Yan and Dong, Xiaoyi and others , journal=
-
[39]
Aligning large multimodal models with factually augmented
Sun, Zhiqing and Shen, Sheng and Cao, Shengcao and Liu, Haotian and Li, Chunyuan and Shen, Yikang and Gan, Chuang and Gui, Liang-Yan and Wang, Yu-Xiong and Yang, Yiming and others , journal=. Aligning large multimodal models with factually augmented
-
[40]
Yu, Tianyu and Yao, Yuan and Zhang, Haoye and He, Taiwen and Han, Yifeng and Cui, Ganqu and Hu, Jinyi and Liu, Zhiyuan and Zheng, Hai-Tao and Sun, Maosong and Chua, Tat-Seng , booktitle=
-
[41]
Beyond hallucinations: Enhancing
Zhao, Zhiyuan and Wang, Bin and Ouyang, Linke and Dong, Xiaoyi and Wang, Jiaqi and He, Conghui , journal=. Beyond hallucinations: Enhancing
-
[42]
Yu, Tianyu and Zhang, Haoye and Yao, Yuan and Dang, Yunkai and Chen, Da and Lu, Xiaoman and Cui, Ganqu and He, Taiwen and Liu, Zhiyuan and Chua, Tat-Seng and others , journal=
-
[44]
Peng, Zhiliang and Wang, Wenhui and Dong, Li and Hao, Yaru and Huang, Shaohan and Ma, Shuming and Wei, Furu , journal=
-
[46]
Feng Li and Renrui Zhang and Hao Zhang and Yuanhan Zhang and Bo Li and Wei Li and Zejun MA and Chunyuan Li , booktitle=
-
[47]
Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren , journal=
-
[48]
Video- LL a VA : Learning United Visual Representation by Alignment Before Projection
Lin, Bin and Ye, Yang and Zhu, Bin and Cui, Jiaxi and Ning, Munan and Jin, Peng and Yuan, Li. Video- LL a VA : Learning United Visual Representation by Alignment Before Projection. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
-
[49]
Lin, Ji and Yin, Hongxu and Ping, Wei and Molchanov, Pavlo and Shoeybi, Mohammad and Han, Song , booktitle=
-
[50]
Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Jingkang Yang and Ziwei Liu , year=
-
[51]
Awadalla, Anas and Gao, Irena and Gardner, Josh and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Sagawa, Shiori and others , journal=
-
[52]
Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=
-
[53]
Advances in Neural Information Processing Systems , year=
Lauren. Advances in Neural Information Processing Systems , year=
-
[54]
Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
What matters when building vision-language models? , author=. Proceedings of The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
-
[55]
Adept Blog , year=
Introducing our multimodal models , author=. Adept Blog , year=
-
[56]
2023 , abstract =
Dai, Wenliang and Li, Junnan and Li, Dongxu and Tiong, Anthony Meng Huat and Zhao, Junqi and Wang, Weisheng and Li, Boyang and Fung, Pascale and Hoi, Steven , title =. 2023 , abstract =
2023
-
[57]
Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Song XiXuan and Jiazheng Xu and Keqin Chen and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang , booktitle=. Cog
-
[58]
Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=
Generative multimodal models are in-context learners , author=. Proceedings of Computer Vision and Pattern Recognition (CVPR) , year=
-
[59]
A survey on data selection for language models
A survey on data selection for language models , author=. arXiv preprint arXiv:2402.16827 , year=
-
[60]
A long way to go: Investigating length correlations in
Singhal, Prasann and Goyal, Tanya and Xu, Jiacheng and Durrett, Greg , journal=. A long way to go: Investigating length correlations in
-
[61]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=
work page internal anchor Pith review arXiv
-
[62]
Ouali, Yassine and Bulat, Adrian and Martinez, Brais and Tzimiropoulos, Georgios , booktitle=
-
[63]
2024 , booktitle =
Chen, Lin and Wei, Xilin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Lin, Bin and Tang, Zhenyu and Yuan, Li and Qiao, Yu and Lin, Dahua and Zhao, Feng and Wang, Jiaqi , title =. 2024 , booktitle =
2024
-
[64]
Milebench: Benchmarking
Song, Dingjie and Chen, Shunian and Chen, Guiming Hardy and Yu, Fei and Wan, Xiang and Wang, Benyou , journal=. Milebench: Benchmarking
-
[65]
arXiv preprint arXiv:2404.10501 , year=
Self-Supervised Visual Preference Alignment , author=. arXiv preprint arXiv:2404.10501 , year=
-
[66]
arXiv preprint arXiv:2408.10433 , year=
CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs , author=. arXiv preprint arXiv:2408.10433 , year=
-
[67]
CVPR , year=
Towards vqa models that can read , author=. CVPR , year=
-
[68]
ECCV , year=
Microsoft coco: Common objects in context , author=. ECCV , year=
-
[69]
Hudson, Drew A and Manning, Christopher D , booktitle=
-
[70]
OCR-VQA: Visual Question Answering by Reading Text in Images
Anand Mishra and Shashank Shekhar and Ajeet Kumar Singh and Anirban Chakraborty. OCR-VQA: Visual Question Answering by Reading Text in Images. ICDAR. 2019
2019
-
[71]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
-
[72]
Awadalla, Anas and Xue, Le and Lo, Oscar and Shu, Manli and Lee, Hannah and Guha, Etash Kumar and Jordan, Matt and Shen, Sheng and Awadalla, Mohamed and Savarese, Silvio and others , journal=
-
[73]
Proceedings of the International Conference on Neural Information Processing Systems , year=
Training language models to follow instructions with human feedback , author=. Proceedings of the International Conference on Neural Information Processing Systems , year=
-
[74]
Ziyu Liu and Yuhang Zang and Xiaoyi Dong and Pan Zhang and Yuhang Cao and Haodong Duan and Conghui He and Yuanjun Xiong and Dahua Lin and Jiaqi Wang , booktitle=
-
[75]
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=
-
[76]
Journal of Applied Research in Memory and Cognition , volume=
Desirable difficulties in theory and practice , author=. Journal of Applied Research in Memory and Cognition , volume=. 2020 , publisher=
2020
-
[78]
Fanqing Meng and Jin Wang and Chuanhao Li and Quanfeng Lu and Hao Tian and Tianshuo Yang and Jiaqi Liao and Xizhou Zhu and Jifeng Dai and Yu Qiao and Ping Luo and Kaipeng Zhang and Wenqi Shao , booktitle=
-
[79]
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding , author=. arXiv preprint arXiv:2406.09411 , year=
-
[81]
Proceedings of the International Conference on Neural Information Processing Systems , year=
Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of the International Conference on Neural Information Processing Systems , year=
-
[82]
arXiv preprint arXiv:2407.11691 , year=
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models , author=. arXiv preprint arXiv:2407.11691 , year=
-
[83]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[84]
arXiv preprint arXiv:2404.19733 , year=
Iterative reasoning preference optimization , author=. arXiv preprint arXiv:2404.19733 , year=
-
[85]
ImageNet: A large-scale hierarchical image database , year=
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle=. ImageNet: A large-scale hierarchical image database , year=
-
[86]
Kinship Verification on Families In The Wild with Marginalized Denoising Metric Learning
Wang, Shuyang and Robinson, Joseph P and Fu, Yun. Kinship Verification on Families In The Wild with Marginalized Denoising Metric Learning. Proceedings of Automatic Face and Gesture Recognition (FG). 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.