LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Bowen Zhou; Bo Zhang; David Clifton; Guoyou Li; Jiajun Zhang; Jirui Huang; Luc Van Gool; Peng Xu; Ruilin Yao; Shengwu Xiong

arxiv: 2505.15616 · v2 · pith:V3SDGG3Inew · submitted 2025-05-21 · 💻 cs.CV

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

Ruilin Yao , Bo Zhang , Jirui Huang , Xinwei Long , Yifang Zhang , Tianyu Zou , Yufei Wu , Shichao Su

show 13 more authors

Yifan Xu Wenxi Zeng Zhaoyu Yang Guoyou Li Shilan Zhang Zichan Li Yaxiong Chen Shengwu Xiong Peng Xu Jiajun Zhang Bowen Zhou David Clifton Luc Van Gool

This is my paper

Pith reviewed 2026-05-22 13:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelsbenchmarkreasoning evaluationperception to reasoningmulti-level tasksMLLM assessmentreal-world imagescompositional reasoning

0 comments

The pith

No frontier multimodal model exceeds 60 percent accuracy on reasoning tasks when perception and reasoning are tested on identical images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lens, a benchmark built around 3.4K contemporary images that each carry human annotations for every task level from basic perception through understanding to compositional reasoning. This design keeps the visual input fixed while varying only the question type, allowing direct measurement of whether low-level perceptual skills support higher-order inference. Evaluations cover more than 15 recent models, including Qwen2.5-VL-72B, InternVL3-78B, GPT-4o, and dedicated reasoning models released after December 2024. None reach 60 percent accuracy on the reasoning tier across 12 daily scenarios drawn from social media. The result suggests current architectures still lack reliable mechanisms for chaining perceptual observations into complex conclusions on real-world content.

Core claim

Lens supplies 3.4K images with rich annotations for eight tasks organized into three progressive tiers—perception, understanding, and reasoning—while ensuring every image supports all tiers without distribution shift. By using image-invariant prompts, the benchmark isolates the contribution of lower-level visual capabilities to higher-order reasoning performance. When 15+ frontier MLLMs are tested, accuracy remains below 60 percent on the reasoning tier even for the largest and most recent systems, indicating that scaling alone has not closed the gap on compositional inference in everyday scenes.

What carries the argument

Image-invariant prompt structure across three progressive task tiers, where the same image and annotations support evaluation from basic perception to compositional reasoning.

If this is right

MLLM training must explicitly target the chaining of perceptual facts into compositional inferences rather than relying on scale alone.
Benchmark construction should favor fixed-image, multi-tier designs to remove confounding distribution shifts between task levels.
Applications involving social-media or real-time visual analysis will continue to require human oversight until reasoning accuracy improves substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the performance gap persists, hybrid systems that combine MLLMs with external reasoning modules may become necessary for reliable deployment.
The same fixed-image tiered design could be adapted to test whether similar limitations exist in video or 3D reasoning benchmarks.

Load-bearing premise

The human-authored questions and rich annotations for all tasks on each image are assumed to be consistent, unbiased, and accurately capture the intended progression from perception to compositional reasoning without introducing annotation artifacts or distribution shifts.

What would settle it

A single frontier MLLM scoring above 60 percent on the reasoning tier of Lens while preserving the expected accuracy ordering from perception through understanding would directly contradict the reported performance ceiling.

Figures

Figures reproduced from arXiv: 2505.15616 by Bowen Zhou, Bo Zhang, David Clifton, Guoyou Li, Jiajun Zhang, Jirui Huang, Luc Van Gool, Peng Xu, Ruilin Yao, Shengwu Xiong, Shichao Su, Shilan Zhang, Tianyu Zou, Wenxi Zeng, Xinwei Long, Yaxiong Chen, Yifang Zhang, Yifan Xu, Yufei Wu, Zhaoyu Yang, Zichan Li.

**Figure 1.** Figure 1: Illustration of the task split in Lens. More recent benchmarks have begun to shift toward open-world evaluation and multimodal reasoning tasks [11, 12]. While this represents progress, current benchmarks do not adequately assess the nuanced performance necessary to evaluate MLLMs’ progression towards human-like intelligence in realworld settings. They require largely primary visual comprehension and fa… view at source ↗

**Figure 2.** Figure 2: Three core themes, “Education”, “City”, and “Home”, along with their word clouds of the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Lens consists of eight sub-tasks at three levels. Perception tasks focus on recognizing object attribute and counting. Understanding tasks emphasizes localization and inter-object relationships, requiring a integration of fine-grained visual context. Reasoning tasks demand the use of external knowledge beyond the visual input and involve multi-step, complex reasoning processes to arrive at the correct answ… view at source ↗

**Figure 4.** Figure 4: Lens covers a wide range of images and annotations, from fine-grained recognition and spatial localization to complex reasoning over extended thought processes. Notably, each image is annotated with labels corresponding to all subtasks concurrently, enabling comprehensive evaluation. more realistic in emphasizing spatial location understanding under real-world scenarios as well as 2D images acquired by cam… view at source ↗

**Figure 5.** Figure 5: Statistical analysis of our dataset. We visualize the temporal distribution of the images [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The normalized probability distributions of low-level attributes from different scenes. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Statistical analysis of model accuracy and synergies between different tasks. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LENS gives a clean shared-image setup for testing perception-to-reasoning progression in MLLMs and reports that even recent frontier models stay under 60% on the top tier, but the annotation quality controls are not shown in enough detail to fully back the performance claims.

read the letter

The paper's main contribution is a benchmark that puts the same 3.4K recent social-media images through three linked tiers: perception, understanding, and compositional reasoning. Each image carries full annotations across all tasks, so the evaluation stays on one data distribution instead of jumping between unrelated sources. They also pulled in a lot of post-2024 images to keep the content current. That design choice directly tackles the distribution mismatch problem they flag in earlier multimodal benchmarks, and it is a practical step forward for measuring how lower-level visual skills feed into harder reasoning.

Referee Report

3 major / 2 minor

Summary. The paper introduces LENS, a multi-level benchmark for MLLMs consisting of 3.4K recent social-media images and 60K+ human-authored questions spanning eight tasks and twelve daily scenarios. Questions are organized into three progressive tiers (perception, understanding, reasoning) with the key design that every image receives rich annotations for all tasks, enabling controlled evaluation of how lower-level capabilities support higher-order reasoning on identical images. The authors evaluate 15+ frontier MLLMs released after December 2024 (including Qwen2.5-VL-72B, InternVL3-78B, GPT-4o, QVQ-72B-preview, and Kimi-VL) and report that none exceed 60% accuracy on the reasoning tier.

Significance. If the annotations reliably isolate compositional reasoning without systematic artifacts, the benchmark offers a useful advance over task-oriented datasets by keeping the image distribution fixed across tiers. The finding that current frontier models remain below 60% on reasoning tasks would then constitute a clear, falsifiable signal of remaining limitations in multimodal compositional reasoning and could usefully inform future model development.

major comments (3)

[§3] §3 (Dataset Construction): No inter-annotator agreement statistics, explicit annotation guidelines distinguishing the three tiers, or quality-control procedures for the 60K questions are reported. This is load-bearing for the central claim because the headline result (no model >60% on reasoning) presupposes that the human-authored questions and per-image annotations genuinely measure compositional reasoning rather than annotation noise or unintended cues.
[§4] §4 (Experiments and Results): The paper provides no statistical significance tests, confidence intervals, or error analysis for the accuracy differences across models on the reasoning tier. Without these, it is difficult to determine whether the uniform sub-60% performance reflects a genuine capability ceiling or variability in evaluation.
[§3] §3: The manuscript does not describe checks for question ambiguity, distribution shift between tiers, or potential social-media image biases that could affect all models uniformly, leaving open the possibility that measured reasoning accuracies partly reflect annotation artifacts rather than model limitations.

minor comments (2)

[Abstract] Abstract: 'manully' is a typo and should read 'manually'.
The paper would benefit from a table summarizing per-tier question counts and example questions to make the progressive structure more concrete for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript introducing the LENS benchmark. We address each major comment point by point below, providing honest responses and indicating where revisions will be made to improve the paper's rigor and clarity.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): No inter-annotator agreement statistics, explicit annotation guidelines distinguishing the three tiers, or quality-control procedures for the 60K questions are reported. This is load-bearing for the central claim because the headline result (no model >60% on reasoning) presupposes that the human-authored questions and per-image annotations genuinely measure compositional reasoning rather than annotation noise or unintended cues.

Authors: We agree that these details are essential to substantiate the benchmark's validity. The original submission omitted a comprehensive description of the annotation protocol primarily due to space limitations. In the revised manuscript, we will expand §3 with a dedicated subsection that includes: explicit tier-distinguishing guidelines provided to annotators, multi-stage quality control procedures involving expert review and filtering, and inter-annotator agreement statistics (e.g., Fleiss' kappa) calculated on a held-out sample of annotations. These additions will directly bolster confidence that the reasoning-tier results reflect genuine model limitations rather than annotation artifacts. revision: yes
Referee: [§4] §4 (Experiments and Results): The paper provides no statistical significance tests, confidence intervals, or error analysis for the accuracy differences across models on the reasoning tier. Without these, it is difficult to determine whether the uniform sub-60% performance reflects a genuine capability ceiling or variability in evaluation.

Authors: We acknowledge the value of statistical rigor for interpreting the results. We will revise §4 to report 95% bootstrap confidence intervals for all accuracy figures on the reasoning tier and include pairwise statistical significance tests (such as McNemar's test) between models. We will also add a concise error analysis subsection categorizing common failure modes (e.g., compositional errors vs. perceptual errors) across the evaluated MLLMs. These changes will help distinguish a potential capability ceiling from evaluation variability. revision: yes
Referee: [§3] §3: The manuscript does not describe checks for question ambiguity, distribution shift between tiers, or potential social-media image biases that could affect all models uniformly, leaving open the possibility that measured reasoning accuracies partly reflect annotation artifacts rather than model limitations.

Authors: We performed internal reviews to resolve ambiguities and ensured tier consistency by annotating all levels on identical images, which by design eliminates distribution shift across tiers. These steps were not fully documented in the original text. In the revision, we will add explicit descriptions of the ambiguity-checking process and tier-consistency verification in §3, along with a preliminary analysis of social-media image characteristics and their potential uniform impact. While exhaustive bias quantification would require additional experiments beyond the current scope, the added details will clarify our mitigation efforts. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential reductions.

full rationale

The paper constructs and releases the LENS dataset (3.4K images, 60K+ questions across three progressive tiers) and reports direct accuracy measurements on 15+ MLLMs, with the headline result that none exceed 60% on the reasoning tier. No equations, fitted parameters, predictions derived from subsets of the same data, or load-bearing self-citations appear in the provided text. The claims rest on external model inference against the new human-authored annotations rather than any internal derivation that reduces to its own inputs by construction, rendering the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on human annotation quality and the assumption that social-media images form a representative distribution for daily scenarios; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human-authored questions and annotations for all tasks on each image are consistent and free of systematic bias.
Invoked in the description of rich annotations supporting image-invariable prompts from perception to reasoning.

pith-pipeline@v0.9.0 · 5905 in / 1265 out tokens · 30763 ms · 2026-05-22T13:44:24.173896+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lens encompasses eight tasks, systematically organized into three hierarchical tiers with eight subtasks... perception, understanding, and reasoning
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

none of them achieve an accuracy greater than 60% in the reasoning tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 15 internal anchors

[1]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

work page 2010
[7]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

work page 2014
[8]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision, pages 69–85. Springer, 2016

work page 2016
[9]

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296, 2024

work page arXiv 2024
[10]

A survey on multimodal benchmarks: In the era of large ai models

Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, and Long Chen. A survey on multimodal benchmarks: In the era of large ai models.arXiv preprint arXiv:2409.18142, 2024

work page arXiv 2024
[11]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024
[12]

Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, 2025

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, 2025

work page 2025
[13]

Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024. 11

work page 2024
[14]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024
[15]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024
[16]

Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representa- tion learning.arXiv preprint arXiv:2504.01017, 2025

work page arXiv 2025
[17]

Octopus: A multi-modal llm with parallel recognition and sequential understanding.Advances in Neural Information Processing Systems, 37:90009–90029, 2024

Chuyang Zhao, YuXin Song, Junru Chen, Kang Rong, Haocheng Feng, Gang Zhang, Shufan Ji, Jingdong Wang, Errui Ding, and Yifan Sun. Octopus: A multi-modal llm with parallel recognition and sequential understanding.Advances in Neural Information Processing Systems, 37:90009–90029, 2024

work page 2024
[18]

Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[19]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

work page 2024
[20]

Finecops-ref: A new dataset and task for fine-grained compositional referring expression comprehension.arXiv preprint arXiv:2409.14750, 2024

Junzhuo Liu, Xuzheng Yang, Weiwei Li, and Peng Wang. Finecops-ref: A new dataset and task for fine-grained compositional referring expression comprehension.arXiv preprint arXiv:2409.14750, 2024

work page arXiv 2024
[21]

A large-scale human- centric benchmark for referring expression comprehension in the lmm era.Advances in Neural Information Processing Systems, 37:69566–69587, 2024

Fangyun Wei, Jinjing Zhao, Kun Yan, Hongyang Zhang, and Chang Xu. A large-scale human- centric benchmark for referring expression comprehension in the lmm era.Advances in Neural Information Processing Systems, 37:69566–69587, 2024

work page 2024
[22]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

work page 2017
[23]

Emotion-llama: Multimodal emotion recognition and reason- ing with instruction tuning.Advances in Neural Information Processing Systems, 37:110805– 110853, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-llama: Multimodal emotion recognition and reason- ing with instruction tuning.Advances in Neural Information Processing Systems, 37:110805– 110853, 2024

work page 2024
[24]

Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings.arXiv preprint arXiv:2309.08591, 2023

Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings.arXiv preprint arXiv:2309.08591, 2023

work page arXiv 2023
[25]

Context- aware chatbot using mllms for cultural heritage

Pavan Kartheek Rachabatuni, Filippo Principi, Paolo Mazzanti, and Marco Bertini. Context- aware chatbot using mllms for cultural heritage. InProceedings of the 15th ACM Multimedia Systems Conference, pages 459–463, 2024

work page 2024
[26]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[28]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022. 12

work page 2022
[29]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Towards visual grounding: A survey.arXiv preprint arXiv:2412.20206, 2024

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey.arXiv preprint arXiv:2412.20206, 2024

work page arXiv 2024
[32]

Visual grounding with multi-modal conditional adaptation

Ruilin Yao, Shengwu Xiong, Yichen Zhao, and Yi Rong. Visual grounding with multi-modal conditional adaptation. InProceedings of the 32nd ACM International Conference on Multime- dia, pages 3877–3886, 2024

work page 2024
[33]

Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

work page 2024
[34]

Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

Gabriel Sarch, Lawrence Jang, Michael Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

work page 2024
[35]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[40]

Synthesize diagnose and optimize: Towards fine-grained vision-language understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zuxuan Wu. Synthesize diagnose and optimize: Towards fine-grained vision-language understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13279–13288, 2024

work page 2024
[41]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

work page 2024
[42]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

work page 2015
[43]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 13

work page 2022
[44]

Haloquest: A visual hallucination dataset for advancing multimodal reasoning

Zhecan Wang, Garrett Bingham, Adams Wei Yu, Quoc V Le, Thang Luong, and Golnaz Ghiasi. Haloquest: A visual hallucination dataset for advancing multimodal reasoning. InEuropean Conference on Computer Vision, pages 288–304. Springer, 2024

work page 2024
[45]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International journal of computer vision, 128(7):1956–1981, 2020

work page 1956
[46]

The all-seeing project v2: Towards general relation comprehension of the open world

Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. InEuropean Conference on Computer Vision, pages 471–490. Springer, 2024

work page 2024
[47]

Coco-stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018

work page 2018
[48]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024
[49]

Mc-bench: A benchmark for multi-context visual grounding in the era of mllms.arXiv preprint arXiv:2410.12332, 2024

Yunqiu Xu, Linchao Zhu, and Yi Yang. Mc-bench: A benchmark for multi-context visual grounding in the era of mllms.arXiv preprint arXiv:2410.12332, 2024

work page arXiv 2024
[50]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the association for computational linguistics, 2:67–78, 2014

work page 2014
[51]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015
[52]

Chatterbox: Multimodal referring and grounding with chain-of-questions.Proceedings of the AAAI Conference on Artificial Intelligence, 39(7):7401–7409, Apr

Yunjie Tian, Tianren Ma, Lingxi Xie, and Qixiang Ye. Chatterbox: Multimodal referring and grounding with chain-of-questions.Proceedings of the AAAI Conference on Artificial Intelligence, 39(7):7401–7409, Apr. 2025

work page 2025
[53]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

work page 2017
[54]

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023

work page 2023
[55]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023
[56]

The konstanz natural video database (konvid-1k)

Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In2017 Ninth international conference on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2017

work page 2017
[57]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025

Fan Liu, Wenshuo Chao, Naiqiang Tan, and Hao Liu. Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025. 14

work page arXiv 2025
[59]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators.arXiv preprint arXiv:2305.17926, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of-experts vision- language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88:303–338, 2010

work page 2010

[7] [7]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InComputer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014

work page 2014

[8] [8]

Modeling context in referring expressions

Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. InEuropean Conference on Computer Vision, pages 69–85. Springer, 2016

work page 2016

[9] [9]

Mme-survey: A comprehensive survey on evaluation of multimodal llms

Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms.arXiv preprint arXiv:2411.15296, 2024

work page arXiv 2024

[10] [10]

A survey on multimodal benchmarks: In the era of large ai models

Lin Li, Guikun Chen, Hanrong Shi, Jun Xiao, and Long Chen. A survey on multimodal benchmarks: In the era of large ai models.arXiv preprint arXiv:2409.18142, 2024

work page arXiv 2024

[11] [11]

V?: Guided visual search as a core mechanism in multimodal llms

Penghao Wu and Saining Xie. V?: Guided visual search as a core mechanism in multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13084–13094, 2024

work page 2024

[12] [12]

Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, 2025

Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models.International Journal of Computer Vision, 133(2):825–843, 2025

work page 2025

[13] [13]

Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li Li, Ruohan Zhang, et al. Embodied agent interface: Benchmarking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534, 2024. 11

work page 2024

[14] [14]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

work page 2024

[15] [15]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

work page 2024

[16] [16]

Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representa- tion learning.arXiv preprint arXiv:2504.01017, 2025

work page arXiv 2025

[17] [17]

Octopus: A multi-modal llm with parallel recognition and sequential understanding.Advances in Neural Information Processing Systems, 37:90009–90029, 2024

Chuyang Zhao, YuXin Song, Junru Chen, Kang Rong, Haocheng Feng, Gang Zhang, Shufan Ji, Jingdong Wang, Errui Ding, and Yifan Sun. Octopus: A multi-modal llm with parallel recognition and sequential understanding.Advances in Neural Information Processing Systems, 37:90009–90029, 2024

work page 2024

[18] [18]

Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Zicheng Zhang, Haoning Wu, Erli Zhang, Guangtao Zhai, and Weisi Lin. Q-bench: A benchmark for multi-modal foundation models on low-level vision from single images to pairs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024

[19] [19]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

work page 2024

[20] [20]

Finecops-ref: A new dataset and task for fine-grained compositional referring expression comprehension.arXiv preprint arXiv:2409.14750, 2024

Junzhuo Liu, Xuzheng Yang, Weiwei Li, and Peng Wang. Finecops-ref: A new dataset and task for fine-grained compositional referring expression comprehension.arXiv preprint arXiv:2409.14750, 2024

work page arXiv 2024

[21] [21]

A large-scale human- centric benchmark for referring expression comprehension in the lmm era.Advances in Neural Information Processing Systems, 37:69566–69587, 2024

Fangyun Wei, Jinjing Zhao, Kun Yan, Hongyang Zhang, and Chang Xu. A large-scale human- centric benchmark for referring expression comprehension in the lmm era.Advances in Neural Information Processing Systems, 37:69566–69587, 2024

work page 2024

[22] [22]

Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International journal of computer vision, 123:32–73, 2017

work page 2017

[23] [23]

Emotion-llama: Multimodal emotion recognition and reason- ing with instruction tuning.Advances in Neural Information Processing Systems, 37:110805– 110853, 2024

Zebang Cheng, Zhi-Qi Cheng, Jun-Yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, and Alexander Hauptmann. Emotion-llama: Multimodal emotion recognition and reason- ing with instruction tuning.Advances in Neural Information Processing Systems, 37:110805– 110853, 2024

work page 2024

[24] [24]

Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings.arXiv preprint arXiv:2309.08591, 2023

Chen Cecilia Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings.arXiv preprint arXiv:2309.08591, 2023

work page arXiv 2023

[25] [25]

Context- aware chatbot using mllms for cultural heritage

Pavan Kartheek Rachabatuni, Filippo Principi, Paolo Mazzanti, and Marco Bertini. Context- aware chatbot using mllms for cultural heritage. InProceedings of the 15th ACM Multimedia Systems Conference, pages 459–463, 2024

work page 2024

[26] [26]

A Survey on In-context Learning

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al. A survey on in-context learning.arXiv preprint arXiv:2301.00234, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[28] [28]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022. 12

work page 2022

[29] [29]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity.arXiv preprint arXiv:2310.07704, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Towards visual grounding: A survey.arXiv preprint arXiv:2412.20206, 2024

Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan, Yaowei Wang, and Changsheng Xu. Towards visual grounding: A survey.arXiv preprint arXiv:2412.20206, 2024

work page arXiv 2024

[32] [32]

Visual grounding with multi-modal conditional adaptation

Ruilin Yao, Shengwu Xiong, Yichen Zhao, and Yi Rong. Visual grounding with multi-modal conditional adaptation. InProceedings of the 32nd ACM International Conference on Multime- dia, pages 3877–3886, 2024

work page 2024

[33] [33]

Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Multidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

work page 2024

[34] [34]

Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

Gabriel Sarch, Lawrence Jang, Michael Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought.Advances in Neural Information Processing Systems, 37:75942–75985, 2024

work page 2024

[35] [35]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Segment anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023

[40] [40]

Synthesize diagnose and optimize: Towards fine-grained vision-language understanding

Wujian Peng, Sicheng Xie, Zuyao You, Shiyi Lan, and Zuxuan Wu. Synthesize diagnose and optimize: Towards fine-grained vision-language understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13279–13288, 2024

work page 2024

[41] [41]

Eyes wide shut? exploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024

work page 2024

[42] [42]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115:211–252, 2015

work page 2015

[43] [43]

Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 13

work page 2022

[44] [44]

Haloquest: A visual hallucination dataset for advancing multimodal reasoning

Zhecan Wang, Garrett Bingham, Adams Wei Yu, Quoc V Le, Thang Luong, and Golnaz Ghiasi. Haloquest: A visual hallucination dataset for advancing multimodal reasoning. InEuropean Conference on Computer Vision, pages 288–304. Springer, 2024

work page 2024

[45] [45]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International journal of computer vision, 128(7):1956–1981, 2020

work page 1956

[46] [46]

The all-seeing project v2: Towards general relation comprehension of the open world

Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The all-seeing project v2: Towards general relation comprehension of the open world. InEuropean Conference on Computer Vision, pages 471–490. Springer, 2024

work page 2024

[47] [47]

Coco-stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1209–1218, 2018

work page 2018

[48] [48]

Hao Shao, Shengju Qian, Han Xiao, Guanglu Song, Zhuofan Zong, Letian Wang, Yu Liu, and Hongsheng Li. Visual cot: Advancing multi-modal language models with a comprehen- sive dataset and benchmark for chain-of-thought reasoning.Advances in Neural Information Processing Systems, 37:8612–8642, 2024

work page 2024

[49] [49]

Mc-bench: A benchmark for multi-context visual grounding in the era of mllms.arXiv preprint arXiv:2410.12332, 2024

Yunqiu Xu, Linchao Zhu, and Yi Yang. Mc-bench: A benchmark for multi-context visual grounding in the era of mllms.arXiv preprint arXiv:2410.12332, 2024

work page arXiv 2024

[50] [50]

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the association for computational linguistics, 2:67–78, 2014

work page 2014

[51] [51]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. InProceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

work page 2015

[52] [52]

Chatterbox: Multimodal referring and grounding with chain-of-questions.Proceedings of the AAAI Conference on Artificial Intelligence, 39(7):7401–7409, Apr

Yunjie Tian, Tianren Ma, Lingxi Xie, and Qixiang Ye. Chatterbox: Multimodal referring and grounding with chain-of-questions.Proceedings of the AAAI Conference on Artificial Intelligence, 39(7):7401–7409, Apr. 2025

work page 2025

[53] [53]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

work page 2017

[54] [54]

Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023

work page 2023

[55] [55]

Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 11:635–651, 2023

work page 2023

[56] [56]

The konstanz natural video database (konvid-1k)

Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, Shujun Li, and Dietmar Saupe. The konstanz natural video database (konvid-1k). In2017 Ninth international conference on quality of multimedia experience (QoMEX), pages 1–6. IEEE, 2017

work page 2017

[57] [57]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025

Fan Liu, Wenshuo Chao, Naiqiang Tan, and Hao Liu. Bag of tricks for inference-time computa- tion of llm reasoning.arXiv preprint arXiv:2502.07191, 2025. 14

work page arXiv 2025

[59] [59]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators.arXiv preprint arXiv:2305.17926, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[60] [60]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Dan Zhang, Diego Rojas, Guanyu Feng, Hanlin Zhao, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024