arxiv: 2408.16500 · v1 · submitted 2024-08-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong , Weihan Wang , Ming Ding , Wenmeng Yu , Qingsong Lv , Yan Wang , Yean Cheng , Shiyu Huang

show 17 more authors

Junhui Ji Zhao Xue Lei Zhao Zhuoyi Yang Xiaotao Gu Xiaohan Zhang Guanyu Feng Da Yin Zihan Wang Ji Qi Xixuan Song Peng Zhang Debing Liu Bin Xu Juanzi Li Yuxiao Dong Jie Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual language modelsimage understandingvideo understandingmultimodal modelsbenchmark performancetemporal groundinghigh resolution inputs

0 comments

The pith

The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the CogVLM2 family of visual language models, which includes versions for image understanding and a dedicated video model. It extends prior designs by supporting inputs up to 1344 by 1344 pixels for images and by adding timestamped multi-frame processing with automated temporal grounding data for videos. A reader would care because these updates produce leading scores on established tests for visual question answering and video tasks, pointing to practical gains in how models combine sight and language. The work also releases the models publicly.

Core claim

CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 by 1344 pixels. CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. The family achieves state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.

What carries the argument

Visual expert architecture with enhanced pre-training and post-training recipes for images, plus multi-frame inputs with timestamps and automated temporal grounding data construction for videos.

If this is right

Higher resolution inputs allow models to capture finer visual details in image tasks.
Timestamp integration and automated data construction improve handling of time-based events in videos.
The same family of models can address both static image and dynamic video understanding without separate systems.
Public release of the models enables direct use and further adaptation on new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automated temporal grounding methods may lower the cost of preparing video training data across other multimodal projects.
Unified image-video architectures could support applications that mix still frames and motion, such as instructional video analysis.
Refinements in vision-language fusion might transfer to related areas like visual reasoning in robotics.

Load-bearing premise

The reported benchmark gains come mainly from the described architecture and training changes rather than from larger model sizes, more data, or extra compute that are not disclosed.

What would settle it

An ablation experiment that holds model size, data volume, and total compute fixed while applying only the new training recipes and input handling, then measures whether the benchmark scores still match the claimed state-of-the-art levels.

read the original abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CogVLM2 adds higher-resolution support and a video variant to the prior visual-expert line, reports SOTA numbers on standard benchmarks, and ships open models, but lacks ablations that separate those changes from raw scale increases.

read the letter

The main point is that this paper takes the existing CogVLM visual-expert setup, raises the input resolution to 1344x1344, adds a video model that ingests multiple frames with timestamps, and builds automated temporal grounding data. They claim new state-of-the-art scores on MMBench, MM-Vet, TextVQA, MVBench, and VCGBench while releasing the weights and code publicly. That combination of concrete engineering steps and open release is the useful part here. The work is incremental rather than foundational, but the higher resolution and video handling are practical extensions that people can actually download and run. The open-sourcing lowers the usual barrier for checking claims on real tasks. The soft spot is the missing link between the described changes and the reported gains. There are no scale-matched baselines, no parameter counts for the new variants, and no ablation tables that hold model size, data volume, and compute fixed while varying only the resolution or temporal components. Without those, it is hard to know how much of the lift comes from the architecture tweaks versus simply training a larger model on more data. The results section stays at the level of benchmark tables without error bars or breakdowns by task type. This paper is for groups that need strong, immediately usable open VLMs for image or video work and want to see where the current public frontier sits. It is not the place to look for new theoretical ideas or rigorous isolation of contributions. I would send it to peer review. The open models and benchmark numbers give referees something concrete to evaluate, and the gaps in attribution are fixable with additional experiments rather than fatal to the core claims.

Referee Report

2 major / 1 minor

Summary. The paper introduces the CogVLM2 family of visual language models, comprising CogVLM2 for image understanding and CogVLM2-Video for video understanding, along with GLM-4V. Building on the visual-expert architecture from prior CogVLM work, the models incorporate higher-resolution input support (up to 1344×1344), revised pre-training and post-training recipes, multi-frame video inputs with timestamps, and an automated temporal grounding data construction method. The manuscript reports state-of-the-art results on benchmarks including MMBench, MM-Vet, TextVQA, MVBench, and VCGBench, and releases the models as open source.

Significance. If the performance gains can be attributed to the described architecture and training changes rather than scale increases, the work would deliver open, reproducible models that advance multimodal fusion for both static images and temporal video reasoning, providing a useful baseline for the community.

major comments (2)

[Experiments / Results] The central attribution of SOTA results to the inherited visual-expert architecture plus higher-resolution support and revised training recipes is not supported by any scale-matched ablations or baseline comparisons. No parameter counts, data-volume details, or controlled experiments isolating these factors from raw scaling are provided, leaving the link between the proposed changes and the reported deltas unverified.
[Model Architecture / Video Understanding] The description of the automated temporal grounding data construction for CogVLM2-Video is too high-level to allow reproduction or independent verification of its contribution to MVBench and VCGBench gains.

minor comments (1)

[Abstract] The abstract states that models 'inherit the visual expert architecture with improved training recipes' without enumerating the concrete recipe changes; a brief enumeration would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing the strongest honest defense of our contributions while acknowledging areas where additional details or clarifications are warranted. We will incorporate revisions to improve clarity and reproducibility.

read point-by-point responses

Referee: [Experiments / Results] The central attribution of SOTA results to the inherited visual-expert architecture plus higher-resolution support and revised training recipes is not supported by any scale-matched ablations or baseline comparisons. No parameter counts, data-volume details, or controlled experiments isolating these factors from raw scaling are provided, leaving the link between the proposed changes and the reported deltas unverified.

Authors: We acknowledge that the manuscript does not contain explicit scale-matched ablations or controlled baselines that fully isolate the visual-expert architecture, higher-resolution support, and revised training recipes from potential effects of increased model scale or data volume. CogVLM2 extends the visual-expert design introduced in our prior CogVLM work, with targeted enhancements in resolution (up to 1344×1344) and training procedures; the reported SOTA results on MMBench, MM-Vet, TextVQA, MVBench, and VCGBench reflect the cumulative outcome of these changes. In the revised manuscript we will add explicit parameter counts for CogVLM2, CogVLM2-Video, and GLM-4V, along with available details on pre-training and post-training data volumes. Full isolation experiments would require substantial additional compute resources beyond the scope of this submission; however, the open-source release of models and code enables independent verification and further ablation studies by the community. revision: partial
Referee: [Model Architecture / Video Understanding] The description of the automated temporal grounding data construction for CogVLM2-Video is too high-level to allow reproduction or independent verification of its contribution to MVBench and VCGBench gains.

Authors: We agree that the current description of the automated temporal grounding pipeline is high-level and limits reproducibility. In the revised manuscript we will expand this section with concrete implementation details, including the specific LLM prompts used to generate temporal grounding annotations, the filtering and quality-control steps, the format of timestamp integration with multi-frame inputs, and representative examples of the constructed training data. These additions will allow readers to better assess the method's contribution to the observed gains on MVBench and VCGBench. revision: yes

standing simulated objections not resolved

Full scale-matched ablation studies that rigorously isolate the contributions of architecture, resolution, and training recipes from raw scaling effects, as these controlled experiments were not performed in the original work.

Circularity Check

0 steps flagged

No circularity: purely empirical model description and benchmark reporting

full rationale

The manuscript describes the CogVLM2 family as an empirical extension of prior CogVLM/GLM-4V work, with inherited visual-expert architecture, higher-resolution support, revised training recipes, multi-frame video input, and automated data construction. All central claims consist of measured performance numbers on external public benchmarks (MMBench, MM-Vet, TextVQA, MVBench, VCGBench). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; the work contains no mathematical chain that could reduce to its own inputs by construction. Attribution questions (scale vs. architecture) are matters of experimental design, not circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard large-scale vision-language model training assumptions and empirical scaling; no new physical or mathematical axioms are introduced.

free parameters (1)

training recipe hyperparameters
Improved pre-training and post-training recipes imply multiple tuned values for learning rates, batch sizes, and data mixtures that are not enumerated in the abstract.

axioms (1)

domain assumption Higher input resolution and multi-frame video inputs improve vision-language fusion performance
Invoked implicitly when claiming gains from the 1344x1344 resolution and timestamped multi-frame architecture.

pith-pipeline@v0.9.0 · 5581 in / 1226 out tokens · 34863 ms · 2026-05-16T20:07:23.935615+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
cs.CL 2026-04 unverdicted novelty 7.0

Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
Towards Unconstrained Human-Object Interaction
cs.CV 2026-04 unverdicted novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
cs.CV 2026-04 unverdicted novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
cs.CV 2026-04 unverdicted novelty 6.0

VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
cs.CV 2026-01 unverdicted novelty 6.0

VideoGPA distills geometry priors via self-supervised DPO to enhance 3D consistency, temporal stability, and motion coherence in video diffusion models.
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
cs.CV 2025-12 unverdicted novelty 6.0

High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
Improving Video Generation with Human Feedback
cs.CV 2025-01 unverdicted novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
cs.CV 2024-12 unverdicted novelty 6.0

VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% high...
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
cs.RO 2024-12 unverdicted novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models
cs.CV 2026-05 conditional novelty 5.0

A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
KD-CVG: A Knowledge-Driven Approach for Creative Video Generation
cs.CV 2026-04 unverdicted novelty 5.0

KD-CVG uses an Advertising Creative Knowledge Base plus Semantic-Aware Retrieval and Multimodal Knowledge Reference modules to improve semantic alignment and motion realism in text-to-video generation for advertising.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
cs.AI 2026-04 unverdicted novelty 5.0

ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
cs.AI 2026-04 unverdicted novelty 5.0

ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 19 Pith papers · 19 internal anchors

[1]

Acharya, K

M. Acharya, K. Kafle, and C. Kanan. Tallyqa: Answering complex counting questions. In Proc. of Association for the Advancement of Artificial Intelligence, 2019

work page 2019
[2]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

D. Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[4]

Antol, A

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proc. of International Conference on Computer Vision, pages 2425–2433, 2015

work page 2015
[6]

J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusiñol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. In Proc. of International Conference on Computer Vision, pages 4290–4300, 2019

work page 2019
[8]

Nougat: Neural Optical Understanding for Academic Documents

L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Byeon, B

M. Byeon, B. Park, H. Kim, S. Lee, W. Baek, and S. Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022
[10]

J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022

work page 2022
[11]

L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, B. Lin, Z. Tang, L. Yuan, Y . Qiao, D. Lin, F. Zhao, and J. Wang. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024

work page arXiv 2024
[13]

X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[16]

J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y . Wang, L. Hong, J. Han, H. Xu, Z. Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023

work page arXiv 2023
[17]

T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

something something

R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proc. of International Conference on Computer Vision, 2017

work page 2017
[19]

Grauman, A

K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, C. Fuegen, A....

work page 2022
[20]

W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

work page 2024
[21]

Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proc. of Computer Vision and Pattern Recognition, 2017

work page 2017
[22]

Kafle, S

K. Kafle, S. Cohen, B. Price, and C. Kanan. Dvqa: Understanding data visualizations via question answering. In Proc. of Computer Vision and Pattern Recognition, 2018

work page 2018
[23]

Kafle and C

K. Kafle and C. Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017

work page 1965
[24]

S. E. Kahou, V . Michalski, A. Atkinson, A. Kadar, A. Trischler, and Y . Bengio. Figureqa: An annotated figure dataset for visual reasoning, 2018

work page 2018
[25]

Kazemi, H

M. Kazemi, H. Alvari, A. Anand, J. Wu, X. Chen, and R. Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023

work page arXiv 2023
[26]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images, 2016

work page 2016
[27]

Kembhavi, M

A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

work page 2016
[28]

Kembhavi, M

A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proc. of Computer Vision and Pattern Recognition, pages 5376–5384, 2017

work page 2017
[29]

W. Kim, C. Choi, W. Lee, and W. Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024

work page 2024
[30]

Krishna, Y

R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017

work page 2017
[31]

J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018

work page 2018
[32]

C. Li, W. Liu, R. Guo, X. Yin, K. Jiang, Y . Du, Y . Du, L. Zhu, B. Lai, X. Hu, et al. Pp- ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022. 12

work page arXiv 2022
[33]

J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[34]

K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, L. Wang, and Y . Qiao. MVBench: A comprehensive multi-modal video understanding benchmark, 2023

work page 2023
[36]

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, L. Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ArXiv, abs/2211.09552, 2022

work page arXiv 2022
[37]

L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models, 2024

work page 2024
[38]

Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chen, R. Chu, S. Liu, and J. Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

work page arXiv 2024
[39]

K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou. Univtg: Towards unified video-language temporal grounding. In Proc. of International Conference on Computer Vision, 2023

work page 2023
[40]

F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning. Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 11, 2023

work page 2023
[41]

H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning, 2024

work page 2024
[42]

H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[43]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. Proc. of Neural Information Processing Systems, 36, 2024

work page 2024
[44]

R. Liu, C. Li, H. Tang, Y . Ge, Y . Shan, and G. Li. St-llm: Large language models are effective temporal learners. https://arxiv.org/abs/2404.00308, 2023

work page arXiv 2023
[45]

Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Y . Liu, Z. Li, B. Yang, C. Li, X. Yin, C. lin Liu, L. Jin, and X. Bai. On the hidden mystery of ocr in large multimodal models, 2024

work page 2024
[47]

P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In The Joint Confer- ence of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), 2021

work page 2021
[48]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Proc. of Neural Information Processing Systems, 35:2507–2521, 2022

work page 2022
[49]

P. Lu, L. Qiu, K.-W. Chang, Y . N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proc. of International Conference on Learning Representations, 2023

work page 2023
[50]

P. Lu, L. Qiu, J. Chen, T. Xia, Y . Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021

work page 2021
[51]

D. Luo, J. Huang, S. Gong, H. Jin, and Y . Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Proc. of Computer Vision and Pattern Recognition, 2023. 13

work page 2023
[52]

M. Maaz, H. Rasheed, S. Khan, and F. Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding, 2024

work page 2024
[53]

M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 2024

work page 2024
[54]

M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024

work page 2024
[55]

Marino, M

K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proc. of Computer Vision and Pattern Recognition, pages 3195–3204, 2019

work page 2019
[56]

Marti and H

U.-V . Marti and H. Bunke. The iam-database: An english sentence database for offline hand- writing recognition. International Journal on Document Analysis and Recognition, 5:39–46, 11 2002

work page 2002
[57]

Masry, D

A. Masry, D. Long, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics

work page 2022
[58]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Masry, D

A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

work page 2022
[60]

Mathew, V

M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V . Jawahar. Infographicvqa. In Proc. of IEEE Winter Conference on Applications of Computer Vision, pages 2582–2591, 2022

work page 2022
[61]

Mathew, D

M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[62]

Mathew, D

M. Mathew, D. Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In Proc. of IEEE Winter Conference on Applications of Computer Vision, pages 2199–2208, 2021

work page 2021
[63]

Mathew, D

M. Mathew, D. Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images, 2021

work page 2021
[64]

Mishra, S

A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

work page 2019
[65]

OpenAI. Gpt-4o. 2024

work page 2024
[66]

M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

S. Ren, L. Yao, S. Li, X. Sun, and L. Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051, 2023

work page arXiv 2023
[68]

Schuhmann, R

C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022

work page 2022
[69]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 14

work page internal anchor Pith review Pith/arXiv arXiv 2021
[70]

Schwenk, A

D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In Proc. of European Conference on Computer Vision, page 146–162, 2022

work page 2022
[71]

N. Shazeer. Glu variants improve transformer, 2020

work page 2020
[72]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

work page 2019
[73]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In Proc. of Computer Vision and Pattern Recognition, pages 8317–8326, 2019

work page 2019
[74]

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

B. J. Tang, A. Boggust, and A. Satyanarayan. VisText: A Benchmark for Semantically Rich Chart Captioning. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 2023

work page 2023
[76]

S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y . LeCun, and S. Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

work page 2024
[77]

W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, S. Huang, B. Xu, Y . Dong, M. Ding, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

work page arXiv 2024
[78]

W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Z. Wang, L. Wang, T. Wu, T. Li, and G. Wu. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proc. of Association for the Advancement of Artificial Intelligence, 2022

work page 2022
[80]

J. Xiao, X. Shang, A. Yao, and T. seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proc. of Computer Vision and Pattern Recognition, 2021

work page 2021
[81]

L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng. PLLaV A: Parameter-free LLaV A extension from images to videos for video dense captioning, 2024

work page 2024

Showing first 80 references.