pith. machine review for the scientific record. sign in

arxiv: 2408.16500 · v1 · submitted 2024-08-29 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

CogVLM2: Visual Language Models for Image and Video Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual language modelsimage understandingvideo understandingmultimodal modelsbenchmark performancetemporal groundinghigh resolution inputs
0
0 comments X

The pith

The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the CogVLM2 family of visual language models, which includes versions for image understanding and a dedicated video model. It extends prior designs by supporting inputs up to 1344 by 1344 pixels for images and by adding timestamped multi-frame processing with automated temporal grounding data for videos. A reader would care because these updates produce leading scores on established tests for visual question answering and video tasks, pointing to practical gains in how models combine sight and language. The work also releases the models publicly.

Core claim

CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 by 1344 pixels. CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. The family achieves state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.

What carries the argument

Visual expert architecture with enhanced pre-training and post-training recipes for images, plus multi-frame inputs with timestamps and automated temporal grounding data construction for videos.

If this is right

  • Higher resolution inputs allow models to capture finer visual details in image tasks.
  • Timestamp integration and automated data construction improve handling of time-based events in videos.
  • The same family of models can address both static image and dynamic video understanding without separate systems.
  • Public release of the models enables direct use and further adaptation on new tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated temporal grounding methods may lower the cost of preparing video training data across other multimodal projects.
  • Unified image-video architectures could support applications that mix still frames and motion, such as instructional video analysis.
  • Refinements in vision-language fusion might transfer to related areas like visual reasoning in robotics.

Load-bearing premise

The reported benchmark gains come mainly from the described architecture and training changes rather than from larger model sizes, more data, or extra compute that are not disclosed.

What would settle it

An ablation experiment that holds model size, data volume, and total compute fixed while applying only the new training recipes and input handling, then measures whether the benchmark scores still match the claimed state-of-the-art levels.

read the original abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the CogVLM2 family of visual language models, comprising CogVLM2 for image understanding and CogVLM2-Video for video understanding, along with GLM-4V. Building on the visual-expert architecture from prior CogVLM work, the models incorporate higher-resolution input support (up to 1344×1344), revised pre-training and post-training recipes, multi-frame video inputs with timestamps, and an automated temporal grounding data construction method. The manuscript reports state-of-the-art results on benchmarks including MMBench, MM-Vet, TextVQA, MVBench, and VCGBench, and releases the models as open source.

Significance. If the performance gains can be attributed to the described architecture and training changes rather than scale increases, the work would deliver open, reproducible models that advance multimodal fusion for both static images and temporal video reasoning, providing a useful baseline for the community.

major comments (2)
  1. [Experiments / Results] The central attribution of SOTA results to the inherited visual-expert architecture plus higher-resolution support and revised training recipes is not supported by any scale-matched ablations or baseline comparisons. No parameter counts, data-volume details, or controlled experiments isolating these factors from raw scaling are provided, leaving the link between the proposed changes and the reported deltas unverified.
  2. [Model Architecture / Video Understanding] The description of the automated temporal grounding data construction for CogVLM2-Video is too high-level to allow reproduction or independent verification of its contribution to MVBench and VCGBench gains.
minor comments (1)
  1. [Abstract] The abstract states that models 'inherit the visual expert architecture with improved training recipes' without enumerating the concrete recipe changes; a brief enumeration would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing the strongest honest defense of our contributions while acknowledging areas where additional details or clarifications are warranted. We will incorporate revisions to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Experiments / Results] The central attribution of SOTA results to the inherited visual-expert architecture plus higher-resolution support and revised training recipes is not supported by any scale-matched ablations or baseline comparisons. No parameter counts, data-volume details, or controlled experiments isolating these factors from raw scaling are provided, leaving the link between the proposed changes and the reported deltas unverified.

    Authors: We acknowledge that the manuscript does not contain explicit scale-matched ablations or controlled baselines that fully isolate the visual-expert architecture, higher-resolution support, and revised training recipes from potential effects of increased model scale or data volume. CogVLM2 extends the visual-expert design introduced in our prior CogVLM work, with targeted enhancements in resolution (up to 1344×1344) and training procedures; the reported SOTA results on MMBench, MM-Vet, TextVQA, MVBench, and VCGBench reflect the cumulative outcome of these changes. In the revised manuscript we will add explicit parameter counts for CogVLM2, CogVLM2-Video, and GLM-4V, along with available details on pre-training and post-training data volumes. Full isolation experiments would require substantial additional compute resources beyond the scope of this submission; however, the open-source release of models and code enables independent verification and further ablation studies by the community. revision: partial

  2. Referee: [Model Architecture / Video Understanding] The description of the automated temporal grounding data construction for CogVLM2-Video is too high-level to allow reproduction or independent verification of its contribution to MVBench and VCGBench gains.

    Authors: We agree that the current description of the automated temporal grounding pipeline is high-level and limits reproducibility. In the revised manuscript we will expand this section with concrete implementation details, including the specific LLM prompts used to generate temporal grounding annotations, the filtering and quality-control steps, the format of timestamp integration with multi-frame inputs, and representative examples of the constructed training data. These additions will allow readers to better assess the method's contribution to the observed gains on MVBench and VCGBench. revision: yes

standing simulated objections not resolved
  • Full scale-matched ablation studies that rigorously isolate the contributions of architecture, resolution, and training recipes from raw scaling effects, as these controlled experiments were not performed in the original work.

Circularity Check

0 steps flagged

No circularity: purely empirical model description and benchmark reporting

full rationale

The manuscript describes the CogVLM2 family as an empirical extension of prior CogVLM/GLM-4V work, with inherited visual-expert architecture, higher-resolution support, revised training recipes, multi-frame video input, and automated data construction. All central claims consist of measured performance numbers on external public benchmarks (MMBench, MM-Vet, TextVQA, MVBench, VCGBench). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; the work contains no mathematical chain that could reduce to its own inputs by construction. Attribution questions (scale vs. architecture) are matters of experimental design, not circularity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard large-scale vision-language model training assumptions and empirical scaling; no new physical or mathematical axioms are introduced.

free parameters (1)
  • training recipe hyperparameters
    Improved pre-training and post-training recipes imply multiple tuned values for learning rates, batch sizes, and data mixtures that are not enumerated in the abstract.
axioms (1)
  • domain assumption Higher input resolution and multi-frame video inputs improve vision-language fusion performance
    Invoked implicitly when claiming gains from the 1344x1344 resolution and timestamped multi-frame architecture.

pith-pipeline@v0.9.0 · 5581 in / 1226 out tokens · 34863 ms · 2026-05-16T20:07:23.935615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  2. Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.

  3. Towards Unconstrained Human-Object Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

  4. VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

    cs.CV 2026-04 unverdicted novelty 7.0

    VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...

  5. Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    cs.CV 2025-05 conditional novelty 7.0

    Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

  6. CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

    cs.CV 2026-05 unverdicted novelty 6.0

    CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...

  7. VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

    cs.CV 2026-04 unverdicted novelty 6.0

    VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

  8. MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.

  9. VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

    cs.CV 2026-01 unverdicted novelty 6.0

    VideoGPA distills geometry priors via self-supervised DPO to enhance 3D consistency, temporal stability, and motion coherence in video diffusion models.

  10. High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models

    cs.CV 2025-12 unverdicted novelty 6.0

    High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.

  11. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  12. Improving Video Generation with Human Feedback

    cs.CV 2025-01 unverdicted novelty 6.0

    A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

  13. VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

    cs.CV 2024-12 unverdicted novelty 6.0

    VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% high...

  14. Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

    cs.RO 2024-12 unverdicted novelty 6.0

    Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...

  15. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    cs.CV 2024-08 unverdicted novelty 6.0

    CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

  16. Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models

    cs.CV 2026-05 conditional novelty 5.0

    A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.

  17. KD-CVG: A Knowledge-Driven Approach for Creative Video Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    KD-CVG uses an Advertising Creative Knowledge Base plus Semantic-Aware Retrieval and Multimodal Knowledge Reference modules to improve semantic alignment and motion realism in text-to-video generation for advertising.

  18. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.

  19. ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

    cs.AI 2026-04 unverdicted novelty 5.0

    ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.

  20. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 19 Pith papers · 19 internal anchors

  1. [1]

    Acharya, K

    M. Acharya, K. Kafle, and C. Kanan. Tallyqa: Answering complex counting questions. In Proc. of Association for the Advancement of Artificial Intelligence, 2019

  2. [2]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    D. Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020

  4. [4]

    Antol, A

    S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh. Vqa: Visual question answering. In Proc. of International Conference on Computer Vision, pages 2425–2433, 2015

  5. [6]

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023

  6. [7]

    A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusiñol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. In Proc. of International Conference on Computer Vision, pages 4290–4300, 2019

  7. [8]

    Nougat: Neural Optical Understanding for Academic Documents

    L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023

  8. [9]

    Byeon, B

    M. Byeon, B. Park, H. Kim, S. Lee, W. Baek, and S. Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022

  9. [10]

    J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022

  10. [11]

    L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024

  11. [12]

    L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, B. Lin, Z. Tang, L. Yuan, Y . Qiao, D. Lin, F. Zhao, and J. Wang. Sharegpt4video: Improving video understanding and generation with better captions. arXiv preprint arXiv:2406.04325, 2024

  12. [13]

    X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022

  13. [14]

    Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023

  14. [15]

    S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024

  15. [16]

    J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y . Wang, L. Hong, J. Han, H. Xu, Z. Li, et al. G-llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370, 2023

  16. [17]

    T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 11

  17. [18]

    something something

    R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proc. of International Conference on Computer Vision, 2017

  18. [19]

    Grauman, A

    K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, C. Fuegen, A....

  19. [20]

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  20. [21]

    Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proc. of Computer Vision and Pattern Recognition, 2017

  21. [22]

    Kafle, S

    K. Kafle, S. Cohen, B. Price, and C. Kanan. Dvqa: Understanding data visualizations via question answering. In Proc. of Computer Vision and Pattern Recognition, 2018

  22. [23]

    Kafle and C

    K. Kafle and C. Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017

  23. [24]

    S. E. Kahou, V . Michalski, A. Atkinson, A. Kadar, A. Trischler, and Y . Bengio. Figureqa: An annotated figure dataset for visual reasoning, 2018

  24. [25]

    Kazemi, H

    M. Kazemi, H. Alvari, A. Anand, J. Wu, X. Chen, and R. Soricut. Geomverse: A systematic evaluation of large models for geometric reasoning. arXiv preprint arXiv:2312.12241, 2023

  25. [26]

    Kembhavi, M

    A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images, 2016

  26. [27]

    Kembhavi, M

    A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016

  27. [28]

    Kembhavi, M

    A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proc. of Computer Vision and Pattern Recognition, pages 5376–5384, 2017

  28. [29]

    W. Kim, C. Choi, W. Lee, and W. Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024

  29. [30]

    Krishna, Y

    R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017

  30. [31]

    J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018

  31. [32]

    C. Li, W. Liu, R. Guo, X. Yin, K. Jiang, Y . Du, Y . Du, L. Zhu, B. Lai, X. Hu, et al. Pp- ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022. 12

  32. [33]

    J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  33. [34]

    K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023

  34. [35]

    K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, L. Wang, and Y . Qiao. MVBench: A comprehensive multi-modal video understanding benchmark, 2023

  35. [36]

    K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, L. Wang, and Y . Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. ArXiv, abs/2211.09552, 2022

  36. [37]

    L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models, 2024

  37. [38]

    Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chen, R. Chu, S. Liu, and J. Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024

  38. [39]

    K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou. Univtg: Towards unified video-language temporal grounding. In Proc. of International Conference on Computer Vision, 2023

  39. [40]

    F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning. Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 11, 2023

  40. [41]

    H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning, 2024

  41. [42]

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  42. [43]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. Proc. of Neural Information Processing Systems, 36, 2024

  43. [44]

    R. Liu, C. Li, H. Tang, Y . Ge, Y . Shan, and G. Li. St-llm: Large language models are effective temporal learners. https://arxiv.org/abs/2404.00308, 2023

  44. [45]

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023

  45. [46]

    Y . Liu, Z. Li, B. Yang, C. Li, X. Yin, C. lin Liu, L. Jin, and X. Bai. On the hidden mystery of ocr in large multimodal models, 2024

  46. [47]

    P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In The Joint Confer- ence of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), 2021

  47. [48]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Proc. of Neural Information Processing Systems, 35:2507–2521, 2022

  48. [49]

    P. Lu, L. Qiu, K.-W. Chang, Y . N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proc. of International Conference on Learning Representations, 2023

  49. [50]

    P. Lu, L. Qiu, J. Chen, T. Xia, Y . Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021

  50. [51]

    D. Luo, J. Huang, S. Gong, H. Jin, and Y . Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Proc. of Computer Vision and Pattern Recognition, 2023. 13

  51. [52]

    M. Maaz, H. Rasheed, S. Khan, and F. Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding, 2024

  52. [53]

    M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 2024

  53. [54]

    M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024

  54. [55]

    Marino, M

    K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proc. of Computer Vision and Pattern Recognition, pages 3195–3204, 2019

  55. [56]

    Marti and H

    U.-V . Marti and H. Bunke. The iam-database: An english sentence database for offline hand- writing recognition. International Journal on Document Analysis and Recognition, 5:39–46, 11 2002

  56. [57]

    Masry, D

    A. Masry, D. Long, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics

  57. [58]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022

  58. [59]

    Masry, D

    A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022

  59. [60]

    Mathew, V

    M. Mathew, V . Bagal, R. Tito, D. Karatzas, E. Valveny, and C. V . Jawahar. Infographicvqa. In Proc. of IEEE Winter Conference on Applications of Computer Vision, pages 2582–2591, 2022

  60. [61]

    Mathew, D

    M. Mathew, D. Karatzas, and C. Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  61. [62]

    Mathew, D

    M. Mathew, D. Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In Proc. of IEEE Winter Conference on Applications of Computer Vision, pages 2199–2208, 2021

  62. [63]

    Mathew, D

    M. Mathew, D. Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images, 2021

  63. [64]

    Mishra, S

    A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019

  64. [65]

    OpenAI. Gpt-4o. 2024

  65. [66]

    M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  66. [67]

    S. Ren, L. Yao, S. Li, X. Sun, and L. Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. arXiv preprint arXiv:2312.02051, 2023

  67. [68]

    Schuhmann, R

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022

  68. [69]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 14

  69. [70]

    Schwenk, A

    D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In Proc. of European Conference on Computer Vision, page 146–162, 2022

  70. [71]

    N. Shazeer. Glu variants improve transformer, 2020

  71. [72]

    Singh, V

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

  72. [73]

    Singh, V

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read. In Proc. of Computer Vision and Pattern Recognition, pages 8317–8326, 2019

  73. [74]

    Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  74. [75]

    B. J. Tang, A. Boggust, and A. Satyanarayan. VisText: A Benchmark for Semantically Rich Chart Captioning. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 2023

  75. [76]

    S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y . LeCun, and S. Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024

  76. [77]

    W. Wang, Z. He, W. Hong, Y . Cheng, X. Zhang, J. Qi, S. Huang, B. Xu, Y . Dong, M. Ding, et al. Lvbench: An extreme long video understanding benchmark. arXiv preprint arXiv:2406.08035, 2024

  77. [78]

    W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023

  78. [79]

    Z. Wang, L. Wang, T. Wu, T. Li, and G. Wu. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proc. of Association for the Advancement of Artificial Intelligence, 2022

  79. [80]

    J. Xiao, X. Shang, A. Yao, and T. seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proc. of Computer Vision and Pattern Recognition, 2021

  80. [81]

    L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng. PLLaV A: Parameter-free LLaV A extension from images to videos for video dense captioning, 2024

Showing first 80 references.