arxiv: 2604.05117 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.CL

Recognition: no theorem link

Watch Before You Answer: Learning from Visually Grounded Post-Training

Yuxuan Zhang , EunJeong Hwang , Huaisong Zhang , Penghui Du , Yiming Jia , Dongfu Jiang , Xuan He , Shenhui Zhang

show 3 more authors

Ping Nie Peter West Kelsey R. Allen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords vision-language modelsvideo understandingpost-trainingdata curationvisual groundingreinforcement learningbenchmark qualitymultimodal models

0 comments

The pith

Filtering post-training data to only visually grounded questions improves VLM video understanding by up to 6.2 points using 69 percent of the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many video understanding benchmarks and post-training datasets contain 40-60 percent of questions answerable from text alone, without needing visual input. The paper shows that restricting post-training to only the visually grounded subset of questions addresses this issue. When combined with RL-based post-training algorithms, this curation delivers higher performance than the full original dataset while using less data. It also outperforms several more complex post-training methods, indicating that ensuring questions require visual grounding is a major factor in advancing VLM capabilities.

Core claim

Commonly reported long video understanding benchmarks contain 40-60 percent of questions that can be answered using text cues alone, and similar issues pervade post-training datasets. By introducing VidGround to select only the actual visually grounded questions without linguistic biases for post-training, and pairing this with RL-based algorithms, performance improves by up to 6.2 points while using only 69.1 percent of the original post-training data. Data curation with a simple post-training algorithm outperforms several more complex post-training techniques, showing that data quality is a major bottleneck for video understanding in VLMs.

What carries the argument

VidGround, the selection of only visually grounded questions for post-training, which removes text-only solvable examples to force reliance on visual and temporal cues.

If this is right

Video understanding in VLMs improves without needing more data or compute resources.
Simple data filtering proves more effective than elaborate post-training algorithms.
Benchmarks must be redesigned to exclude questions solvable by text alone.
Post-training becomes more efficient by discarding non-visually grounded examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar filtering could be applied during pre-training to reduce shortcut learning across multimodal tasks.
Current VLM video understanding scores may be inflated by benchmarks that allow text-only solutions.
Automated detection of visual grounding requirements could become standard in dataset preparation pipelines.

Load-bearing premise

The method for identifying visually grounded questions accurately isolates cases that require visual input without introducing selection bias or removing useful training signal.

What would settle it

If a random subset of the same size as the visually grounded portion produces comparable performance gains, the claim that improvements stem specifically from visual grounding would be falsified.

Figures

Figures reproduced from arXiv: 2604.05117 by Dongfu Jiang, EunJeong Hwang, Huaisong Zhang, Kelsey R. Allen, Penghui Du, Peter West, Ping Nie, Shenhui Zhang, Xuan He, Yiming Jia, Yuxuan Zhang.

**Figure 2.** Figure 2: Analysis of text-only answerable (TA) questions in video understanding [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Accuracy on visually grounded (VG) questions—questions that cannot be [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of reasoning paths on a VideoMMMU art analy [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-model agreement analysis on Video-R1-260K [ [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-model data curation pipelines for VidGround-M1 and VidGround-M2 variants (GPT: GPT-5-mini [37], Qwen: Qwen2.5-VL-7B [7], Gemini: Gemini-3.1-Pro [23]). Both pipelines evaluate all ∼263K samples from VideoR1-260K [20] in text-only mode. GPT uses single-pass evaluation, Qwen uses circular evaluation with option permutation for MCQ + Pass@10 for openended, and Gemini uses 3-permutation circular evalu… view at source ↗

**Figure 7.** Figure 7: Additional examples of text-only answerable (TA) questions from [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Pass@10 distribution for video questions in Video-R1-260K [20]. Pass@10 measures the fraction of questions where the model produces at least one correct answer across 10 independent text-only samples. Qwen2.5-VL-7B [7] achieves Pass@10 > 0 on 74.5% of video questions without video input [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 10.** Figure 10: Overall Pass@10 distribution for Video-R1-260K [ [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Default prompt template. Red text indicates the instructions added for text-only evaluation. Enhanced prompt template for text-only evaluation Which tool is not necessary to make a snow globe? A. Distilled water B. Scissors C. Glitter D. Super glue. Based on the question and options provided, use your knowledge and common sense to determine the most likely correct answer. The video and subtitles are not a… view at source ↗

**Figure 13.** Figure 13: Art elements analysis question comparing [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗

**Figure 14.** Figure 14: Psychology question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗

**Figure 15.** Figure 15: Medical science question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗

**Figure 16.** Figure 16: Clinical medical imaging interpretation question. Same comparison setup [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗

**Figure 17.** Figure 17: Public health question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗

**Figure 18.** Figure 18: Diagnostics and laboratory medicine question. Same comparison setup [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗

**Figure 19.** Figure 19: Chemistry question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p040_19.png] view at source ↗

**Figure 20.** Figure 20: Structural engineering question. Same comparison setup as Figure [PITH_FULL_IMAGE:figures/full_fig_p041_20.png] view at source ↗

read the original abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows many video QA benchmarks and post-training sets have 40-60% text-only solvable questions, and filtering to visually grounded subsets improves RL post-training by up to 6.2 points on less data, but the gains may not be specifically due to visual grounding without better controls.

read the letter

The main thing to know is that this work quantifies how many questions in video understanding benchmarks and post-training data can be answered from text alone, then shows that restricting RL post-training to the visually grounded remainder gives a clear lift while using only 69% of the data. They also claim this curation beats some more involved post-training methods, which points to data quality as a bigger lever than algorithm tweaks in this setting. That observation is useful and worth checking against your own setups if you work with VLMs on video tasks. The measurement of the text-only fraction and the VidGround filtering approach are the concrete new pieces here, and the empirical gains are reported directly enough to be actionable. The paper does a decent job highlighting a practical issue that affects how we interpret progress in video VLMs. The soft spot is the attribution. The headline result ties the improvement to visual grounding, but without ablations against random subsets of the same size or subsets matched on length, difficulty, or noise, it's possible the lift comes from incidental curation effects rather than the visual requirement itself. The method for identifying text-only solvable questions also needs to be spelled out with controls so readers can reproduce and trust the 40-60% figure. If the full paper has those details and checks, the claims hold up better; otherwise the central story stays suggestive. This is for researchers doing post-training or evaluation on video VLMs. Readers who care about data curation for multimodal models will find the numbers and the simple fix relevant. It deserves a serious referee because the problem it flags is real and widespread, and the results are testable even if they need tighter controls to pin down the mechanism.

Referee Report

3 major / 1 minor

Summary. The paper reports that 40-60% of questions in common long-video understanding benchmarks and post-training datasets are solvable from text alone. It introduces VidGround, a filter that retains only the visually grounded subset (69.1% of the original post-training data) and shows that, when paired with standard RL post-training, this subset yields up to +6.2 points over training on the full unfiltered set. The work further claims that this simple curation step outperforms several more elaborate post-training algorithms, positioning data quality as the primary bottleneck for video VLM performance.

Significance. If the attribution to visual grounding is substantiated, the result would be significant: it supplies concrete evidence that a large fraction of existing video benchmarks and training corpora do not actually require multimodal reasoning, and demonstrates that inexpensive filtering can produce larger gains than current algorithmic post-training innovations. This would redirect attention toward systematic dataset auditing and curation rather than solely toward new RL objectives or architectures.

major comments (3)

[Abstract / §3] Abstract and §3 (filter construction): the paper states that 40-60% of questions are text-only solvable and that the VidGround filter removes precisely the non-grounded remainder, yet provides no description of the procedure used to label questions as text-only (model, prompt template, temperature, decision threshold, or human validation set). Without these details the claimed 69.1% retention rate and the subsequent performance numbers cannot be reproduced or stress-tested.
[§4 / Table 2] §4 (main results) and Table 2: the headline +6.2 point gain is reported only for the VidGround subset versus the full dataset. No control experiment is shown for a random 69.1% subset, a length-matched subset, or a difficulty-matched subset. Consequently the observed lift cannot yet be attributed to removal of text-only questions rather than to incidental effects of reduced volume or altered answer distribution.
[§4.2] §4.2 (RL post-training details): the manuscript asserts that “data curation with a simple post-training algorithm outperforms several more complex post-training techniques,” but does not list the exact RL algorithms, reward models, hyper-parameters, or training budgets used in the comparison. This omission prevents verification that the reported superiority is driven by the data filter rather than by differences in optimization setup.

minor comments (1)

[Abstract] The project page is referenced but no supplementary material, code, or filtered dataset splits are linked in the main text, which would materially aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important opportunities to improve the clarity, reproducibility, and evidential strength of our work. We respond to each major comment below and will incorporate revisions to address the concerns raised.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (filter construction): the paper states that 40-60% of questions are text-only solvable and that the VidGround filter removes precisely the non-grounded remainder, yet provides no description of the procedure used to label questions as text-only (model, prompt template, temperature, decision threshold, or human validation set). Without these details the claimed 69.1% retention rate and the subsequent performance numbers cannot be reproduced or stress-tested.

Authors: We agree that the absence of these implementation details limits reproducibility. In the revised manuscript we will expand §3 with a full specification of the text-only solvability labeling procedure, including the exact text-only model and version used, the complete prompt template, the temperature setting, the decision threshold or matching criterion for classifying a question as text-only solvable, and the protocol and results of any human validation performed on the automatic labels. These additions will allow independent reproduction of the 69.1% retention rate and the VidGround filter. revision: yes
Referee: [§4 / Table 2] §4 (main results) and Table 2: the headline +6.2 point gain is reported only for the VidGround subset versus the full dataset. No control experiment is shown for a random 69.1% subset, a length-matched subset, or a difficulty-matched subset. Consequently the observed lift cannot yet be attributed to removal of text-only questions rather than to incidental effects of reduced volume or altered answer distribution.

Authors: We acknowledge that the current results lack these controls and that they are necessary to isolate the contribution of visual grounding. In the revised version we will add an experiment using a randomly sampled 69.1% subset of the original post-training data to demonstrate that simply reducing data volume does not produce comparable gains. We will also include or discuss length-matched and difficulty-matched controls to the extent they can be constructed, along with analysis of any shifts in answer distribution, to strengthen the attribution of the observed improvements to the removal of non-grounded questions. revision: yes
Referee: [§4.2] §4.2 (RL post-training details): the manuscript asserts that “data curation with a simple post-training algorithm outperforms several more complex post-training techniques,” but does not list the exact RL algorithms, reward models, hyper-parameters, or training budgets used in the comparison. This omission prevents verification that the reported superiority is driven by the data filter rather than by differences in optimization setup.

Authors: We thank the referee for noting this omission. Although the comparisons were performed under a consistent post-training framework, the specific configurations were not enumerated. We will revise §4.2 to include a detailed description or table specifying the RL algorithms, reward models, all hyper-parameters, and training budgets (steps, batch sizes, learning rates, etc.) used both for the VidGround-filtered data and for the more complex post-training baselines. This will confirm that performance differences are attributable to the data curation rather than to variations in the optimization setup. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data filtering and RL evaluation

full rationale

The paper is an empirical study that identifies text-only answerable questions in benchmarks and post-training datasets via measurement, filters to a visually grounded subset, and applies standard RL post-training algorithms to report measured performance gains. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain; the central claims rest on external dataset measurements and experimental outcomes rather than reducing to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard assumptions in machine learning evaluation rather than new theoretical constructs.

axioms (1)

domain assumption Benchmark performance scores reflect genuine improvements in visual and temporal understanding when text-only shortcuts are removed.
The central claims rest on interpreting higher benchmark scores after filtering as evidence of better video understanding.

pith-pipeline@v0.9.0 · 5579 in / 1289 out tokens · 69635 ms · 2026-05-10T19:56:05.974352+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

Reference graph

Works this paper leans on

114 extracted references · 25 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Alhamoud, K., Alshammari, S., Tian, Y., Li, G., Torr, P.H., Kim, Y., Ghassemi, M.: Vision-language models do not understand negation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29612–29622 (2025)

2025
[4]

Anthropic: Introducing claude sonnet 4.5 (2025),https://www.anthropic.com/ news/claude-sonnet-4-5

2025
[5]

Anthropic: Introducing claude opus 4.6 (2026),https://www.anthropic.com/ news/claude-opus-4-6

2026
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bahl, S., Mendonca, R., Chen, L., Jain, U., Pathak, D.: Affordances from human videos as a versatile representation for robotics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13778–13790 (2023)

2023
[7]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

arXiv preprint arXiv:2402.17510 (2024)

Bleeker, M., Hendriksen, M., Yates, A., de Rijke, M.: Demonstrating and reducing shortcuts in vision-language representation learning. arXiv preprint arXiv:2402.17510 (2024)

work page arXiv 2024
[9]

Advances in neural information processing systems32(2019)

Cadene, R., Dancette, C., Cord, M., Parikh, D., et al.: Rubi: Reducing unimodal biases for visual question answering. Advances in neural information processing systems32(2019)

2019
[10]

Advances in Neural Information Processing Systems37, 113436–113460 (2024)

Campbell, D., Rane, S., Giallanza, T., De Sabbata, C.N., Ghods, K., Joshi, A., Ku, A., Frankland, S., Griffiths, T., Cohen, J.D., et al.: Understanding the limits of vision language models through the lens of the binding problem. Advances in Neural Information Processing Systems37, 113436–113460 (2024)

2024
[11]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Chandhok, S., Fan, W.C., Shwartz, V., Balasubramanian, V.N., Sigal, L.: Response wide shut? surprising observations in basic vision language model capabilities. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 25530–25545 (2025)

2025
[12]

Sft or rl? an early investigation into training r1-like reasoning large vision-language models

Chen, H., Tu, H., Wang, F., Liu, H., Tang, X., Du, X., Zhou, Y., Xie, C.: Sft or rl? an early investigation into training r1-like reasoning large vision-language models. arXiv preprint arXiv:2504.11468 (2025)

work page arXiv 2025
[13]

Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

Chen, H., Razin, N., Narasimhan, K., Chen, D.: Retaining by doing: The role of on-policy data in mitigating forgetting. arXiv preprint arXiv:2510.18874 (2025)

work page arXiv 2025
[14]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Chen, Y., Huang, W., Shi, B., Hu, Q., Ye, H., Zhu, L., Liu, Z., Molchanov, P., Kautz, J., Qi, X., et al.: Scaling rl to long videos. arXiv preprint arXiv:2507.07966 (2025)

work page arXiv 2025
[15]

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Chu, T., Zhai, Y., Yang, J., Tong, S., Xie, S., Schuurmans, D., Le, Q.V., Levine, S., Ma, Y.: Sft memorizes, rl generalizes: A comparative study of foundation model post-training. arXiv preprint arXiv:2501.17161 (2025)

work page internal anchor Pith review arXiv 2025
[16]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Reinforcing video reasoning with focused thinking

Dang, J., Wu, J., Wang, T., Lin, X., Zhu, N., Chen, H., Zheng, W.S., Wang, M., Chua, T.S.: Reinforcing video reasoning with focused thinking. arXiv preprint arXiv:2505.24718 (2025)

work page arXiv 2025
[18]

Electronics14(7), 1282 (2025)

Elhenawy, M., Ashqar, H.I., Rakotonirainy, A., Alhadidi, T.I., Jaber, A., Tami, M.A.: Vision-language models for autonomous driving: Clip-based dynamic scene understanding. Electronics14(7), 1282 (2025)

2025
[19]

Advances in Neural Information Processing Systems37, 89098–89124 (2024)

Fang, X., Mao, K., Duan, H., Zhao, X., Li, Y., Lin, D., Chen, K.: Mmbench-video: A long-form multi-shot benchmark for holistic video understanding. Advances in Neural Information Processing Systems37, 89098–89124 (2024)

2024
[20]

Video-R1: Reinforcing Video Reasoning in MLLMs

Feng, K., Gong, K., Li, B., Guo, Z., Wang, Y., Peng, T., Wu, J., Zhang, X., Wang, B., Yue, X.: Video-r1: Reinforcing video reasoning in mllms. arXiv preprint arXiv:2503.21776 (2025)

work page internal anchor Pith review arXiv 2025
[21]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24108–24118 (2025)

2025
[22]

Hidden in plain sight: Vlms overlook their visual representations.arXiv preprint arXiv:2506.08008,

Fu, S., Bonnen, T., Guillory, D., Darrell, T.: Hidden in plain sight: Vlms overlook their visual representations. arXiv preprint arXiv:2506.08008 (2025)

work page arXiv 2025
[23]

Google: Gemini 3.1 pro: Best for complex tasks and bringing creative concepts to life (2026),https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/

2026
[24]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017)

2017
[25]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., Liu, Z.: Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826 (2025)

work page internal anchor Pith review arXiv 2025
[26]

arXiv preprint arXiv:2502.11492 (2025)

Huang, K.H., Qin, C., Qiu, H., Laban, P., Joty, S., Xiong, C., Wu, C.S.: Why vision language models struggle with visual arithmetic? towards enhanced chart and geometry understanding. arXiv preprint arXiv:2502.11492 (2025)

work page arXiv 2025
[27]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual rep- resentation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13700–13710 (2024)

2024
[29]

In: Findings of the Association for Computational Linguistics: NAACL 2025

Lee, K.i., Kim, M., Yoon, S., Kim, M., Lee, D., Koh, H., Jung, K.: Vlind-bench: Measuring language priors in large vision-language models. In: Findings of the Association for Computational Linguistics: NAACL 2025. pp. 4129–4144 (2025)

2025
[30]

In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval

Liang, Z., Hu, H., Zhu, J.: Lpf: A language-prior feedback objective function for de- biased visual question answering. In: Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. pp. 1955– 1959 (2021)

1955
[31]

In Al-Onaizan, Y ., Bansal, M

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-LLaVA: Learning united visual representation by alignment before projection. In: Al- Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Confer- ence on Empirical Methods in Natural Language Processing. pp. 5971–5984. Association for Computational Linguistics, Miami, Flor...

work page doi:10.18653/v1/2024.emnlp-main.342 2024
[32]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 5971–5984 (2024)

2024
[33]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., et al.: Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

arXiv preprint arXiv:2501.00569 (2024)

Luo, T., Cao, A., Lee, G., Johnson, J., Lee, H.: Probing visual language priors in vlms. arXiv preprint arXiv:2501.00569 (2024)

work page arXiv 2024
[35]

In: European Conference on Computer Vision

Miller, D., S¨ underhauf, N., Kenna, A., Mason, K.: Open-set recognition in the age of vision-language models. In: European Conference on Computer Vision. pp. 1–18. Springer (2024)

2024
[36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Niu, Y., Tang, K., Zhang, H., Lu, Z., Hua, X.S., Wen, J.R.: Counterfactual vqa: A cause-effect look at language bias. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12700–12710 (2021)

2021
[37]

OpenAI: Gpt-5.https://openai.com(2025)

2025
[38]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Parashar, S., Lin, Z., Liu, T., Dong, X., Li, Y., Ramanan, D., Caverlee, J., Kong, S.: The neglected tails in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12988–12997 (2024)

2024
[39]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Park, J., Jang, K.J., Alasaly, B., Mopidevi, S., Zolensky, A., Eaton, E., Lee, I., Johnson, K.: Assessing modality bias in video question answering benchmarks with multimodal large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 19821–19829 (2025)

2025
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Peng, W., Xie, S., You, Z., Lan, S., Wu, Z.: Synthesize diagnose and opti- mize: Towards fine-grained vision-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13279– 13288 (2024)

2024
[41]

In: Proceedings of the Asian Conference on Computer Vision

Rahmanzadehgervi, P., Bolton, L., Taesiri, M.R., Nguyen, A.T.: Vision language models are blind. In: Proceedings of the Asian Conference on Computer Vision. pp. 18–34 (2024)

2024
[42]

Advances in neural information processing systems31(2018)

Ramakrishnan, S., Agrawal, A., Lee, S.: Overcoming language priors in visual question answering with adversarial regularization. Advances in neural information processing systems31(2018)

2018
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ranasinghe, K., Shukla, S.N., Poursaeed, O., Ryoo, M.S., Lin, T.Y.: Learning to localize objects improves spatial reasoning in visual-llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12977– 12987 (2024)

2024
[44]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

IEEE Transactions on Circuits and Systems for Video Technology pp

Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., Xu, C.: Video understanding with large language models: A survey. IEEE Transactions on Circuits and Systems for Video Technology pp. 1–1 (2025).https://doi.org/10.1109/...

work page doi:10.1109/tcsvt.2025.3566695 2025
[46]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9568–9578 (2024)

2024
[47]

Vision Language Models are Biased

Vo, A., Nguyen, K.N., Taesiri, M.R., Dang, V.T., Nguyen, A.T., Kim, D.: Vision language models are biased. arXiv preprint arXiv:2505.23941 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Llava-critic-r1: Your critic model is secretly a strong policy model.CoRR, abs/2509.00676, 2025

Wang, X., Li, C., Yang, J., Zhang, K., Liu, B., Xiong, T., Huang, F.: Llava-critic-r1: Your critic model is secretly a strong policy model. arXiv preprint arXiv:2509.00676 (2025)

work page arXiv 2025
[49]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Wang, Z., Yoon, J., Yu, S., Islam, M.M., Bertasius, G., Bansal, M.: Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 28114–28128 (2025)

2025
[50]

von Werra, L., Belkada, Y., Tunstall, L., Beeching, E., Thrush, T., Lambert, N., Huang, S., Rasul, K., Gallou´ edec, Q.: TRL: Transformer Reinforcement Learning, https://github.com/huggingface/trl
[51]

When language overrules: Modality imbalance in vlms.arXiv preprint arXiv:2508.10552, 2025

Wu, H., Tang, M., Zheng, X., Jiang, H.: When language overrules: Revealing text dominance in multimodal large language models. arXiv preprint arXiv:2508.10552 (2025)

work page arXiv 2025
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Xu, Y., Zhu, L., Yang, Y.: Mc-bench: A benchmark for multi-context visual ground- ing in the era of mllms. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17675–17687 (2025)

2025
[53]

In: The Eleventh In- ternational Conference on Learning Representations (2023),https://openreview

Xue, H., Sun, Y., Liu, B., Fu, J., Song, R., Li, H., Luo, J.: CLIP-vip: Adapting pre-trained image-text model to video-language alignment. In: The Eleventh In- ternational Conference on Learning Representations (2023),https://openreview. net/forum?id=GNjzMAgawq

2023
[54]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Zhang, G., Zhang, Y., Zhang, K., Tresp, V.: Can vision-language models be a good guesser? exploring vlms for times and location reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 636–645 (2024)

2024
[56]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)

work page Pith review arXiv 2024
[57]

Zhang, Y., Unell, A., Wang, X., Ghosh, D., Su, Y., Schmidt, L., Yeung-Levy, S.: Why are visually-grounded language models bad at image classification? Advances in Neural Information Processing Systems37, 51727–51753 (2024)

2024
[58]

Gemini-1.5-Pro achieves less than 15% accuracy in the text-only setup

Zhao, Y., Zhang, H., Xie, L., Hu, T., Gan, G., Long, Y., Hu, Z., Chen, W., Li, C., Xu, Z., et al.: Mmvu: Measuring expert-level multi-discipline video understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 8475–8489 (2025) Appendix Appendix Overview AMulti-model agreement analysis BCross-task generalization: image QA...

2025
[59]

The lines are varied in direction and style, contributing significantly to the overall composition

Line: There are numerous lines used in the drawings, especially in the abstract and geometric shapes. The lines are varied in direction and style, contributing significantly to the overall composition
[60]

The use of shapes to create patterns and structures is evident

Shape: Shapes are also prominent, with both geometric and organic forms present. The use of shapes to create patterns and structures is evident
[61]

The emphasis is more on two- dimensional elements rather than creating a sense of volume through form

Form: While there are some three-dimensional elements like the cube and the sphere, they are not the main focus. The emphasis is more on two- dimensional elements rather than creating a sense of volume through form
[62]

The colors are bright and flat, without much shading

Value: The use of value (light and shadow) is minimal compared to other elements. The colors are bright and flat, without much shading
[63]

The colors are used to fill shapes and add contrast, but they do not create a strong color palette

Color: Color is present, but it is not the most dominant element. The colors are used to fill shapes and add contrast, but they do not create a strong color palette
[64]

The surfaces are smooth and flat

Texture: There is no significant texture in the artwork. The surfaces are smooth and flat
[65]

The arrangement of elements is somewhat balanced, but it is not the primary focus

Balance: The balance is somewhat achieved, but it is not the most striking feature. The arrangement of elements is somewhat balanced, but it is not the primary focus
[66]

The artwork does not explore depth or negative space extensively

Space: The concept of space is present, but it is not the most prominent. The artwork does not explore depth or negative space extensively
[67]

The colors and lines provide some contrast, but it is not the main focus

Contrast: There is some contrast, but it is not the most dominant element. The colors and lines provide some contrast, but it is not the main focus
[68]

Given the analysis, the most prominently featured elements are line and shape

Proportion: The proportions are somewhat consistent, but they are not the most striking feature. Given the analysis, the most prominently featured elements are line and shape. However, since the question asks for the most prominent, and considering the variety of lines and shapes, the answer leans towards: B. Shape Final answer: B. Shape To determine whic...
[69]

This suggests that line is a significant element in the artwork

Line: The video shows a hand drawing various types of lines such as horizontal, diagonal, curved, and dotted. This suggests that line is a significant element in the artwork
[70]

Shape: The video also shows the drawing of diﬀerent shapes like triangles, circles, and squares, indicating that shape is another prominent element
[71]

Form: The video briefly shows a 3D form, which suggests that form is also an element being discussed
[72]

Value: The video mentions value, which refers to the lightness or darkness of a color, and shows examples of how value can be used in art
[73]

Color: The video includes a color wheel and discusses primary colors, indicating that color is a key element
[74]

Texture: The video briefly shows a textured surface, suggesting that texture is also an element being discussed
[75]

Balance: The video does not focus on balance as a primary element
[76]

Given the content of the video, the elements that are most prominently featured are likely to be those that are discussed and demonstrated in the video

Space: The video mentions space, including positive and negative space, indicating that space is a significant element. Given the content of the video, the elements that are most prominently featured are likely to be those that are discussed and demonstrated in the video. Based on the video content, the most prominent elements are: Line, Shape, Color, Tex...
[77]

The structure introduced at 1:18 is damaged
[78]

The structure introduced at 0:32 is damaged
[79]

The introduced at 1:10 is severed or damaged
[80]

The dendrites did not receive enough neurotransmitter signals to properly initiate the process

Showing first 80 references.