A Dataset for Dynamic Human Preferences for Vision Language Models

Dylan Hadfield-Menell (Massachusetts Institute of Technology); Hannah Gao (Massachusetts Institute of Technology); Rachel Ma (Massachusetts Institute of Technology)

arxiv: 2606.07653 · v1 · pith:6X5MM3TYnew · submitted 2026-06-02 · 💻 cs.CV · cs.AI

A Dataset for Dynamic Human Preferences for Vision Language Models

Hannah Gao (Massachusetts Institute of Technology) , Dylan Hadfield-Menell (Massachusetts Institute of Technology) , Rachel Ma (Massachusetts Institute of Technology) This is my paper

Pith reviewed 2026-06-28 10:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision language modelsdynamic preferencesbenchmark datasetin-context adaptationautomated pipelinehuman-AI interactionmulti-modal evaluation

0 comments

The pith

A new benchmark and dataset test vision language models on adapting to dynamic human preferences given in context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to assess whether vision language models can follow preferences that users supply during use, as opposed to static ones absorbed from training data. It supplies an automated pipeline to generate varied test cases that differ in image dependence and other factors, plus the resulting dataset and model evaluations. A sympathetic reader would care because interactive applications require models to respond to changing, user-specific instructions on the fly. The work focuses on measuring this adaptation capability directly through controlled, multi-modal examples.

Core claim

The authors establish a dynamic multi-modal human-preference dataset generated via an automated pipeline that introduces controlled variations on image dependence, which supports evaluation of vision language models' capacity to interpret and act on preferences supplied in context at inference time, together with performance results for current state-of-the-art models on the benchmark.

What carries the argument

The automated pipeline that generates the benchmark dataset with systematic variations on image dependence to create test cases for in-context dynamic preferences.

If this is right

State-of-the-art vision language models can now be scored on their ability to adapt to preferences supplied at inference time.
The benchmark separates dynamic, in-context preference following from static capabilities learned during training.
Performance gaps identified by the dataset can guide development of prompting or fine-tuning methods for better real-time adaptation.
The automated generation process enables scalable creation of additional test variations without manual labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the benchmark might surface the need for models that maintain user-specific context across multiple turns of interaction.
The dataset could be extended to measure preference conflicts or changes over longer sessions, revealing limits not tested in the initial version.
If models improve on this benchmark, applications such as personalized image assistants or editing tools could become more reliable for individual users.

Load-bearing premise

The automated pipeline produces test cases that accurately represent genuine dynamic human preferences users would express in real interactive settings.

What would settle it

Human raters could review a sample of the generated test cases and report whether they match preferences they would actually state in similar image-based interaction scenarios; low agreement would undermine the benchmark.

Figures

Figures reproduced from arXiv: 2606.07653 by Dylan Hadfield-Menell (Massachusetts Institute of Technology), Hannah Gao (Massachusetts Institute of Technology), Rachel Ma (Massachusetts Institute of Technology).

read the original abstract

Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well these models can adapt to real-time preferences for different users. While an increasing number of vision-language benchmarks have recently been introduced, they focus largely on evaluating static capabilities and generally-held preferences learned from extensive training data. This work introduces a new benchmark for evaluating the ability of VLMs to understand dynamic human-preferences, i.e. preferences that are passed in-context at inference time. We provide an automated pipeline for generating this benchmark with variations on image dependence, a dynamic multi-modal human-preference dataset, and evaluations of state-of-the-art models on the novel benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a benchmark for VLMs adapting to in-context dynamic preferences but provides no details on validating the automated generation pipeline.

read the letter

The main point is that this work targets a gap in VLM evaluation by focusing on preferences supplied in context at inference time, rather than static ones from training data. It supplies an automated pipeline that varies image dependence, a generated dataset, and evaluations of current models.

What is new is the explicit framing around dynamic, user-specific preferences passed during interaction. Existing benchmarks are mostly static capability or preference tests, so this tries to create test cases that change with the input and the user's stated preference.

The paper does a clear job of stating the motivation for interactive settings and outlining the pipeline structure at a high level.

The soft spot is the lack of any reported checks on whether the pipeline produces realistic preferences. The abstract does not describe human validation, inter-annotator agreement, comparison to real user data, or tests for bias or artifacts. Without that, it is hard to know if the benchmark measures what it claims. The model evaluations are also mentioned without numbers or methods, so their strength cannot be assessed.

This is relevant for groups working on interactive VLMs or benchmark design. A reader interested in new datasets for preference adaptation might want to see the full pipeline and data once released.

It deserves peer review so referees can examine the generation process and data quality directly. The core idea is worth checking even if the current description leaves the validation open.

Referee Report

1 major / 0 minor

Summary. The paper introduces a new benchmark for evaluating Vision Language Models (VLMs) on their ability to understand dynamic human preferences provided in-context at inference time, rather than static capabilities or generally-held preferences. It describes an automated pipeline for generating the benchmark with variations on image dependence, releases a dynamic multi-modal human-preference dataset, and reports evaluations of state-of-the-art models on the benchmark.

Significance. If the pipeline and dataset are shown to produce valid representations of real dynamic preferences, the work would address a meaningful gap in VLM evaluation for interactive settings. The automated pipeline and public dataset are strengths that could support reproducible research on in-context preference adaptation.

major comments (1)

[Abstract and pipeline description] The automated pipeline is presented as generating valid test cases for dynamic in-context preferences, but the manuscript provides no description of validation steps such as human evaluation, inter-annotator agreement, or comparison against real user behavior data. This directly undermines the central claim that the benchmark measures understanding of dynamic human preferences rather than artificial constructs (see abstract and pipeline section).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for validation of the automated pipeline. We address the single major comment below and will incorporate changes in the revision.

read point-by-point responses

Referee: [Abstract and pipeline description] The automated pipeline is presented as generating valid test cases for dynamic in-context preferences, but the manuscript provides no description of validation steps such as human evaluation, inter-annotator agreement, or comparison against real user behavior data. This directly undermines the central claim that the benchmark measures understanding of dynamic human preferences rather than artificial constructs (see abstract and pipeline section).

Authors: We agree that the manuscript lacks any description of validation for the generated preferences. The pipeline relies on automated, rule-based modifications to create variations in image dependence and in-context preference statements, but without human evaluation or comparison to real user data, the claim that these represent dynamic human preferences is not yet substantiated. In the revised version we will add a dedicated subsection on validation, including a human study on a sampled subset with inter-annotator agreement statistics and qualitative comparison to real preference elicitation scenarios. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset/benchmark creation with no derivations or fitted predictions

full rationale

The paper introduces a benchmark, automated pipeline, and dataset for dynamic in-context human preferences in VLMs, followed by model evaluations. No equations, derivations, parameters, or predictions are present that could reduce to inputs by construction. The contribution is empirical resource creation rather than a claimed derivation chain; the automated pipeline is presented as a generation method without any self-referential fitting or uniqueness theorems. This matches the default case of a self-contained dataset paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset and benchmark creation paper; no free parameters, axioms, or invented entities are identifiable from the abstract.

pith-pipeline@v0.9.1-grok · 5653 in / 987 out tokens · 21950 ms · 2026-06-28T10:17:27.070213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 9 canonical work pages · 6 internal anchors

[1]

Nocaps: Novel object caption- ing at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. Nocaps: Novel object caption- ing at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019. 1

2019
[2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1

2015
[3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015
[5]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 3

2023
[6]

Understanding dataset difficulty withV-usable information

Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty withV-usable information. InProceedings of the 39th International Conference on Ma- chine Learning, pages 5988–6008. PMLR, 2022. 2

2022
[7]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 1

2017
[8]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,...

2024
[9]

Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation.arXiv preprint arXiv:2405.05363, 2024

Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, and Di- nesh Manocha. Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation.arXiv preprint arXiv:2405.05363, 2024. 1

work page arXiv 2024
[10]

Evoke: Evoking critical thinking abilities in llms via reviewer-author prompt editing

Xinyu Hu, Pengfei Tang, Simiao Zuo, Zihan Wang, Bowen Song, Qiang Lou, Jian Jiao, and Denis Charles. Evoke: Evoking critical thinking abilities in llms via reviewer-author prompt editing. InInternational Conference on Learning Representations, pages 7583–7603, 2024. 2

2024
[11]

Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding

Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 4830–4843, 2025. 1

2025
[12]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1

2019
[13]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 20235. 1

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Beavertails: Towards improved safety align- ment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704,

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety align- ment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704,
[15]

Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ran- ran Haoran Zhang, and Rui Zhang. Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024. 1

work page arXiv 2024
[16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32–73, 2017. 2, 3, 4

2017
[18]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Omnibench: Towards the future of uni- versal omni-language models,

Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272,

work page arXiv
[20]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1

2014
[21]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 3

2023
[22]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

2023
[23]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 3

2024
[24]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, 5 Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 1, 3

2024
[25]

Wildvision: Evaluat- ing vision-language models in the wild with human prefer- ences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024

Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluat- ing vision-language models in the wild with human prefer- ences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024. 2

2024
[26]

Self-refine: It- erative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: It- erative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 2

2023
[27]

Docvlm: Make your vlm an efficient reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29005–29015, 2025. 1

2025
[28]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024. 1, 3

2024
[29]

Gpt-4.1, 2025

OpenAI. Gpt-4.1, 2025. 3

2025
[30]

Gpt-5 mini, 2025

OpenAI. Gpt-5 mini, 2025. 3

2025
[31]

o4-mini, 2025

OpenAI. o4-mini, 2025. 3

2025
[32]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 3

2016
[33]

Gemini robotics: Bringing ai into the physical world, 2025

Gemini Robotics. Gemini robotics: Bringing ai into the physical world, 2025. 1

2025
[34]

Instructblip-vicuna-7b

Salesforce. Instructblip-vicuna-7b. 3
[35]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 3

2025
[36]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567,
[38]

Learning multi- dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingt- ing Gao, Di Zhang, and Zhongyuan Wang. Learning multi- dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024. 2

2024
[39]

Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, and Weisi Lin. Q-bench ++: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10404–10418, 2024. 1

2024
[40]

Vlm-guided explicit-implicit complementary novel class semantic learn- ing for few-shot object detection.Expert Systems with Ap- plications, 256:124926, 2024

Taijin Zhao, Heqian Qiu, Yu Dai, Lanxiao Wang, Hefei Mei, Fanman Meng, Qingbo Wu, and Hongliang Li. Vlm-guided explicit-implicit complementary novel class semantic learn- ing for few-shot object detection.Expert Systems with Ap- plications, 256:124926, 2024. 1 6

2024

[1] [1]

Nocaps: Novel object caption- ing at scale

Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Ste- fan Lee, and Peter Anderson. Nocaps: Novel object caption- ing at scale. InProceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957, 2019. 1

2019

[2] [2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 1

2015

[3] [3]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.12966, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server.arXiv preprint arXiv:1504.00325, 2015. 1

work page internal anchor Pith review Pith/arXiv arXiv 2015

[5] [5]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 3

2023

[6] [6]

Understanding dataset difficulty withV-usable information

Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. Understanding dataset difficulty withV-usable information. InProceedings of the 39th International Conference on Ma- chine Learning, pages 5988–6008. PMLR, 2022. 2

2022

[7] [7]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 1

2017

[8] [8]

Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnos- tic suite for entangled language hallucination and visual il- lusion in large vision-language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,...

2024

[9] [9]

Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation.arXiv preprint arXiv:2405.05363, 2024

Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, and Di- nesh Manocha. Loc-zson: Language-driven object-centric zero-shot object retrieval and navigation.arXiv preprint arXiv:2405.05363, 2024. 1

work page arXiv 2024

[10] [10]

Evoke: Evoking critical thinking abilities in llms via reviewer-author prompt editing

Xinyu Hu, Pengfei Tang, Simiao Zuo, Zihan Wang, Bowen Song, Qiang Lou, Jian Jiao, and Denis Charles. Evoke: Evoking critical thinking abilities in llms via reviewer-author prompt editing. InInternational Conference on Learning Representations, pages 7583–7603, 2024. 2

2024

[11] [11]

Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding

Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Why vision language models struggle with visual arithmetic? to- wards enhanced chart and geometry understanding. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 4830–4843, 2025. 1

2025

[12] [12]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 1

2019

[13] [13]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: a vision-language-action model with open-world generaliza- tion.arXiv preprint arXiv:2504.16054, 20235. 1

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Beavertails: Towards improved safety align- ment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704,

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety align- ment of llm via a human-preference dataset.Advances in Neural Information Processing Systems, 36:24678–24704,

[15] [15]

Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024

Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ran- ran Haoran Zhang, and Rui Zhang. Visonlyqa: Large vision language models still struggle with visual perception of geo- metric information.arXiv preprint arXiv:2412.00947, 2024. 1

work page arXiv 2024

[16] [16]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32–73, 2017. 2, 3, 4

2017

[18] [18]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Omnibench: Towards the future of uni- versal omni-language models,

Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272,

work page arXiv

[20] [20]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 1

2014

[21] [21]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 3

2023

[22] [22]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

2023

[23] [23]

Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 3

2024

[24] [24]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, 5 Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 1, 3

2024

[25] [25]

Wildvision: Evaluat- ing vision-language models in the wild with human prefer- ences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024

Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. Wildvision: Evaluat- ing vision-language models in the wild with human prefer- ences.Advances in Neural Information Processing Systems, 37:48224–48255, 2024. 2

2024

[26] [26]

Self-refine: It- erative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hal- linan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: It- erative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023. 2

2023

[27] [27]

Docvlm: Make your vlm an efficient reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29005–29015, 2025. 1

2025

[28] [28]

Gpt-4 technical report, 2024

OpenAI. Gpt-4 technical report, 2024. 1, 3

2024

[29] [29]

Gpt-4.1, 2025

OpenAI. Gpt-4.1, 2025. 3

2025

[30] [30]

Gpt-5 mini, 2025

OpenAI. Gpt-5 mini, 2025. 3

2025

[31] [31]

o4-mini, 2025

OpenAI. o4-mini, 2025. 3

2025

[32] [32]

You only look once: Unified, real-time object de- tection

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object de- tection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 3

2016

[33] [33]

Gemini robotics: Bringing ai into the physical world, 2025

Gemini Robotics. Gemini robotics: Bringing ai into the physical world, 2025. 1

2025

[34] [34]

Instructblip-vicuna-7b

Salesforce. Instructblip-vicuna-7b. 3

[35] [35]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025. 3

2025

[36] [36]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567,

[38] [38]

Learning multi- dimensional human preference for text-to-image generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingt- ing Gao, Di Zhang, and Zhongyuan Wang. Learning multi- dimensional human preference for text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8018–8027, 2024. 2

2024

[39] [39]

Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, and Weisi Lin. Q-bench ++: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10404–10418, 2024. 1

2024

[40] [40]

Vlm-guided explicit-implicit complementary novel class semantic learn- ing for few-shot object detection.Expert Systems with Ap- plications, 256:124926, 2024

Taijin Zhao, Heqian Qiu, Yu Dai, Lanxiao Wang, Hefei Mei, Fanman Meng, Qingbo Wu, and Hongliang Li. Vlm-guided explicit-implicit complementary novel class semantic learn- ing for few-shot object detection.Expert Systems with Ap- plications, 256:124926, 2024. 1 6

2024