pith. sign in

hub Canonical reference

CogVLM2: Visual Language Models for Image and Video Understanding

Canonical reference. 83% of citing Pith papers cite this work as background.

32 Pith papers citing it
Background 83% of classified citations
abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to $1344 \times 1344$ pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

hub tools

citation-role summary

background 6

citation-polarity summary

roles

background 6

polarities

background 5 unclear 1

representative citing papers

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

cs.CV · 2026-05-17 · conditional · novelty 7.0

Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.

Towards Unconstrained Human-Object Interaction

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Introduces the U-HOI task and shows MLLMs plus a language-to-graph pipeline can handle human-object interactions without any predefined vocabulary at training or inference time.

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

VSAS-Bench offers temporally dense annotations and synchronous/asynchronous protocols to evaluate streaming VLMs on timeliness, consistency, accuracy, and latency trade-offs, showing that adapted conventional VLMs can outperform specialized streaming models.

EgoWalk: A Multimodal Dataset for Robot Navigation in the Wild

cs.RO · 2025-05-27 · conditional · novelty 7.0

EgoWalk supplies 50 hours of real-world multimodal human navigation data in varied indoor/outdoor settings together with open pipelines that auto-generate language goal annotations and traversability masks.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

MIRAGE: Benchmarking and Aligning Multi-Instance Image Editing

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

MIRAGE introduces a benchmark for multi-instance image editing and a training-free framework that uses vision-language parsing and parallel regional denoising to achieve precise edits without altering backgrounds.

Improving Video Generation with Human Feedback

cs.CV · 2025-01-23 · unverdicted · novelty 6.0

A human preference dataset and VideoReward model enable Flow-DPO and Flow-NRG to produce smoother, better-aligned videos from text prompts in flow-based generators.

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

cs.CV · 2024-08-12 · unverdicted · novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.

citing papers explorer

Showing 32 of 32 citing papers.