Recognition: 2 theorem links
· Lean TheoremCogVLM2: Visual Language Models for Image and Video Understanding
Pith reviewed 2026-05-16 20:07 UTC · model grok-4.3
The pith
The CogVLM2 family reaches state-of-the-art results on image and video benchmarks by refining visual expert architectures and training recipes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 by 1344 pixels. CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. The family achieves state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench.
What carries the argument
Visual expert architecture with enhanced pre-training and post-training recipes for images, plus multi-frame inputs with timestamps and automated temporal grounding data construction for videos.
If this is right
- Higher resolution inputs allow models to capture finer visual details in image tasks.
- Timestamp integration and automated data construction improve handling of time-based events in videos.
- The same family of models can address both static image and dynamic video understanding without separate systems.
- Public release of the models enables direct use and further adaptation on new tasks.
Where Pith is reading between the lines
- Automated temporal grounding methods may lower the cost of preparing video training data across other multimodal projects.
- Unified image-video architectures could support applications that mix still frames and motion, such as instructional video analysis.
- Refinements in vision-language fusion might transfer to related areas like visual reasoning in robotics.
Load-bearing premise
The reported benchmark gains come mainly from the described architecture and training changes rather than from larger model sizes, more data, or extra compute that are not disclosed.
What would settle it
An ablation experiment that holds model size, data volume, and total compute fixed while applying only the new training recipes and input handling, then measures whether the benchmark scores still match the claimed state-of-the-art levels.
read the original abstract
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the CogVLM2 family of visual language models, comprising CogVLM2 for image understanding and CogVLM2-Video for video understanding, along with GLM-4V. Building on the visual-expert architecture from prior CogVLM work, the models incorporate higher-resolution input support (up to 1344×1344), revised pre-training and post-training recipes, multi-frame video inputs with timestamps, and an automated temporal grounding data construction method. The manuscript reports state-of-the-art results on benchmarks including MMBench, MM-Vet, TextVQA, MVBench, and VCGBench, and releases the models as open source.
Significance. If the performance gains can be attributed to the described architecture and training changes rather than scale increases, the work would deliver open, reproducible models that advance multimodal fusion for both static images and temporal video reasoning, providing a useful baseline for the community.
major comments (2)
- [Experiments / Results] The central attribution of SOTA results to the inherited visual-expert architecture plus higher-resolution support and revised training recipes is not supported by any scale-matched ablations or baseline comparisons. No parameter counts, data-volume details, or controlled experiments isolating these factors from raw scaling are provided, leaving the link between the proposed changes and the reported deltas unverified.
- [Model Architecture / Video Understanding] The description of the automated temporal grounding data construction for CogVLM2-Video is too high-level to allow reproduction or independent verification of its contribution to MVBench and VCGBench gains.
minor comments (1)
- [Abstract] The abstract states that models 'inherit the visual expert architecture with improved training recipes' without enumerating the concrete recipe changes; a brief enumeration would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, providing the strongest honest defense of our contributions while acknowledging areas where additional details or clarifications are warranted. We will incorporate revisions to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Experiments / Results] The central attribution of SOTA results to the inherited visual-expert architecture plus higher-resolution support and revised training recipes is not supported by any scale-matched ablations or baseline comparisons. No parameter counts, data-volume details, or controlled experiments isolating these factors from raw scaling are provided, leaving the link between the proposed changes and the reported deltas unverified.
Authors: We acknowledge that the manuscript does not contain explicit scale-matched ablations or controlled baselines that fully isolate the visual-expert architecture, higher-resolution support, and revised training recipes from potential effects of increased model scale or data volume. CogVLM2 extends the visual-expert design introduced in our prior CogVLM work, with targeted enhancements in resolution (up to 1344×1344) and training procedures; the reported SOTA results on MMBench, MM-Vet, TextVQA, MVBench, and VCGBench reflect the cumulative outcome of these changes. In the revised manuscript we will add explicit parameter counts for CogVLM2, CogVLM2-Video, and GLM-4V, along with available details on pre-training and post-training data volumes. Full isolation experiments would require substantial additional compute resources beyond the scope of this submission; however, the open-source release of models and code enables independent verification and further ablation studies by the community. revision: partial
-
Referee: [Model Architecture / Video Understanding] The description of the automated temporal grounding data construction for CogVLM2-Video is too high-level to allow reproduction or independent verification of its contribution to MVBench and VCGBench gains.
Authors: We agree that the current description of the automated temporal grounding pipeline is high-level and limits reproducibility. In the revised manuscript we will expand this section with concrete implementation details, including the specific LLM prompts used to generate temporal grounding annotations, the filtering and quality-control steps, the format of timestamp integration with multi-frame inputs, and representative examples of the constructed training data. These additions will allow readers to better assess the method's contribution to the observed gains on MVBench and VCGBench. revision: yes
- Full scale-matched ablation studies that rigorously isolate the contributions of architecture, resolution, and training recipes from raw scaling effects, as these controlled experiments were not performed in the original work.
Circularity Check
No circularity: purely empirical model description and benchmark reporting
full rationale
The manuscript describes the CogVLM2 family as an empirical extension of prior CogVLM/GLM-4V work, with inherited visual-expert architecture, higher-resolution support, revised training recipes, multi-frame video input, and automated data construction. All central claims consist of measured performance numbers on external public benchmarks (MMBench, MM-Vet, TextVQA, MVBench, VCGBench). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear; the work contains no mathematical chain that could reduce to its own inputs by construction. Attribution questions (scale vs. architecture) are matters of experimental design, not circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- training recipe hyperparameters
axioms (1)
- domain assumption Higher input resolution and multi-frame video inputs improve vision-language fusion performance
Forward citations
Cited by 20 Pith papers
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Culture-Aware Humorous Captioning: Multimodal Humor Generation across Cultural Contexts
Introduces culture-aware humorous captioning task and staged alignment framework that improves contextual fit and balances image relevance with humor in multimodal LLMs.
-
Towards Unconstrained Human-Object Interaction
Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.
-
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can...
-
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
-
MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing
MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.
-
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
VideoGPA distills geometry priors via self-supervised DPO to enhance 3D consistency, temporal stability, and motion coherence in video diffusion models.
-
High-Entropy Tokens as Multimodal Failure Points in Vision-Language Models
High-entropy tokens act as concentrated multimodal failure points in VLMs, enabling sparse Entropy-Guided Attacks that achieve 93-95% success and 30-38% harmful rates with cross-model transfer.
-
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
-
Improving Video Generation with Human Feedback
A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.
-
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% high...
-
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models
A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
-
KD-CVG: A Knowledge-Driven Approach for Creative Video Generation
KD-CVG uses an Advertising Creative Knowledge Base plus Semantic-Aware Retrieval and Multimodal Knowledge Reference modules to improve semantic alignment and motion realism in text-to-video generation for advertising.
-
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
ReCAPA uses multi-level predictive correction and semantic alignment modules to reduce cascading failures in VLA systems, with new metrics for tracking error propagation and recovery on embodied benchmarks.
-
ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
M. Acharya, K. Kafle, and C. Kanan. Tallyqa: Answering complex counting questions. In Proc. of Association for the Advancement of Artificial Intelligence, 2019
work page 2019
-
[2]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
D. Alexey. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [4]
-
[6]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusiñol, C. Jawahar, E. Valveny, and D. Karatzas. Scene text visual question answering. In Proc. of International Conference on Computer Vision, pages 4290–4300, 2019
work page 2019
-
[8]
Nougat: Neural Optical Understanding for Academic Documents
L. Blecher, G. Cucurull, T. Scialom, and R. Stojnic. Nougat: Neural optical understanding for academic documents. arXiv preprint arXiv:2308.13418, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [9]
-
[10]
J. Chen, J. Tang, J. Qin, X. Liang, L. Liu, E. P. Xing, and L. Lin. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning, 2022
work page 2022
-
[11]
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, et al. Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [12]
-
[13]
X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Lu, B. Li, P. Luo, T. Lu, Y . Qiao, and J. Dai. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
S. Y . Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
- [16]
-
[17]
T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, D. Rojas, G. Feng, H. Zhao, H. Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 11
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
R. Goyal, S. E. Kahou, V . Michalski, J. Materzynska, S. Westphal, H. Kim, V . Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, F. Hoppe, C. Thurau, I. Bax, and R. Memisevic. The “something something” video database for learning and evaluating visual common sense. In Proc. of International Conference on Computer Vision, 2017
work page 2017
-
[19]
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. K. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Z. Xu, C. Zhao, S. Bansal, D. Batra, V . Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, C. Fuegen, A....
work page 2022
-
[20]
W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y . Wang, Z. Wang, Y . Dong, M. Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024
work page 2024
-
[21]
Y . Jang, Y . Song, Y . Yu, Y . Kim, and G. Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proc. of Computer Vision and Pattern Recognition, 2017
work page 2017
- [22]
-
[23]
K. Kafle and C. Kanan. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision, pages 1965–1973, 2017
work page 1965
-
[24]
S. E. Kahou, V . Michalski, A. Atkinson, A. Kadar, A. Trischler, and Y . Bengio. Figureqa: An annotated figure dataset for visual reasoning, 2018
work page 2018
- [25]
-
[26]
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images, 2016
work page 2016
-
[27]
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016
work page 2016
-
[28]
A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proc. of Computer Vision and Pattern Recognition, pages 5376–5384, 2017
work page 2017
-
[29]
W. Kim, C. Choi, W. Lee, and W. Rhee. An image grid can be worth a video: Zero-shot video question answering using a vlm, 2024
work page 2024
-
[30]
R. Krishna, Y . Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y . Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017
work page 2017
-
[31]
J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018
work page 2018
- [32]
-
[33]
J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023
work page 2023
-
[34]
K. Li, Y . He, Y . Wang, Y . Li, W. Wang, P. Luo, Y . Wang, L. Wang, and Y . Qiao. Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo, L. Wang, and Y . Qiao. MVBench: A comprehensive multi-modal video understanding benchmark, 2023
work page 2023
- [36]
-
[37]
L. Li, Y . Wang, R. Xu, P. Wang, X. Feng, L. Kong, and Q. Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models, 2024
work page 2024
- [38]
-
[39]
K. Q. Lin, P. Zhang, J. Chen, S. Pramanick, D. Gao, A. J. Wang, R. Yan, and M. Z. Shou. Univtg: Towards unified video-language temporal grounding. In Proc. of International Conference on Computer Vision, 2023
work page 2023
-
[40]
F. Liu, G. Emerson, and N. Collier. Visual spatial reasoning. Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 11, 2023
work page 2023
-
[41]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning, 2024
work page 2024
-
[42]
H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024
work page 2024
-
[43]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. Proc. of Neural Information Processing Systems, 36, 2024
work page 2024
- [44]
-
[45]
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Y . Liu, Z. Li, B. Yang, C. Li, X. Yin, C. lin Liu, L. Jin, and X. Bai. On the hidden mystery of ocr in large multimodal models, 2024
work page 2024
-
[47]
P. Lu, R. Gong, S. Jiang, L. Qiu, S. Huang, X. Liang, and S.-C. Zhu. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. In The Joint Confer- ence of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021), 2021
work page 2021
-
[48]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Proc. of Neural Information Processing Systems, 35:2507–2521, 2022
work page 2022
-
[49]
P. Lu, L. Qiu, K.-W. Chang, Y . N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In Proc. of International Conference on Learning Representations, 2023
work page 2023
-
[50]
P. Lu, L. Qiu, J. Chen, T. Xia, Y . Zhao, W. Zhang, Z. Yu, X. Liang, and S.-C. Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021
work page 2021
-
[51]
D. Luo, J. Huang, S. Gong, H. Jin, and Y . Liu. Towards generalisable video moment retrieval: Visual-dynamic injection to image-text pre-training. In Proc. of Computer Vision and Pattern Recognition, 2023. 13
work page 2023
-
[52]
M. Maaz, H. Rasheed, S. Khan, and F. Khan. Videogpt+: Integrating image and video encoders for enhanced video understanding, 2024
work page 2024
-
[53]
M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 2024
work page 2024
-
[54]
M. Maaz, H. Rasheed, S. Khan, and F. S. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024
work page 2024
- [55]
-
[56]
U.-V . Marti and H. Bunke. The iam-database: An english sentence database for offline hand- writing recognition. International Journal on Document Analysis and Recognition, 5:39–46, 11 2002
work page 2002
-
[57]
A. Masry, D. Long, J. Q. Tan, S. Joty, and E. Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics
work page 2022
-
[58]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
A. Masry, D. X. Long, J. Q. Tan, S. Joty, and E. Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [59]
- [60]
- [61]
- [62]
- [63]
- [64]
-
[65]
OpenAI. Gpt-4o. 2024
work page 2024
-
[66]
M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [67]
-
[68]
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems , 35:25278–25294, 2022
work page 2022
-
[69]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 14
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[70]
D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In Proc. of European Conference on Computer Vision, page 146–162, 2022
work page 2022
-
[71]
N. Shazeer. Glu variants improve transformer, 2020
work page 2020
- [72]
- [73]
-
[74]
Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
B. J. Tang, A. Boggust, and A. Satyanarayan. VisText: A Benchmark for Semantically Rich Chart Captioning. In Proc. of IEEE International Conference on Automatic Face and Gesture Recognition, 2023
work page 2023
-
[76]
S. Tong, E. Brown, P. Wu, S. Woo, M. Middepogu, S. C. Akula, J. Yang, S. Yang, A. Iyer, X. Pan, A. Wang, R. Fergus, Y . LeCun, and S. Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms, 2024
work page 2024
- [77]
-
[78]
W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y . Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Z. Wang, L. Wang, T. Wu, T. Li, and G. Wu. Negative sample matters: A renaissance of metric learning for temporal grounding. In Proc. of Association for the Advancement of Artificial Intelligence, 2022
work page 2022
-
[80]
J. Xiao, X. Shang, A. Yao, and T. seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proc. of Computer Vision and Pattern Recognition, 2021
work page 2021
-
[81]
L. Xu, Y . Zhao, D. Zhou, Z. Lin, S. K. Ng, and J. Feng. PLLaV A: Parameter-free LLaV A extension from images to videos for video dense captioning, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.