Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Khan · 2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection

cs.CV · 2025-11-25 · unverdicted · novelty 7.0

Video LMMs name objects and actions reliably but fail to detect the precise frames and locations of contact and release events, revealing shortcut learning instead of physical grounding.

citing papers explorer

Showing 3 of 3 citing papers.

CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering cs.CV · 2026-05-22 · unverdicted · none · ref 25
Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 31
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection cs.CV · 2025-11-25 · unverdicted · none · ref 15
Video LMMs name objects and actions reliably but fail to detect the precise frames and locations of contact and release events, revealing shortcut learning instead of physical grounding.

Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

fields

years

verdicts

representative citing papers

citing papers explorer