TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.
Quo vadis, action recognition? a new model and the kinetics dataset
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
method 1polarities
use method 1representative citing papers
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.
VAGNet anticipates accidents in dashcam videos using global features from VideoMAE-V2 combined with transformers and graphs, reporting higher average precision and mean time-to-accident on four benchmarks while running more efficiently than prior methods.
citing papers explorer
-
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions
TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.
-
Video Diffusion Models
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
-
ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines
ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.
-
VAGNet: Vision-based Accident Anticipation with Global Features
VAGNet anticipates accidents in dashcam videos using global features from VideoMAE-V2 combined with transformers and graphs, reporting higher average precision and mean time-to-accident on four benchmarks while running more efficiently than prior methods.