pith. sign in

Quo vadis, action recognition? a new model and the kinetics dataset

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

fields

cs.CV 4 cs.RO 1

years

2026 4 2022 1

verdicts

UNVERDICTED 5

roles

method 1

polarities

use method 1

representative citing papers

Video Diffusion Models

cs.CV · 2022-04-07 · unverdicted · novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

VAGNet: Vision-based Accident Anticipation with Global Features

cs.CV · 2026-04-10 · unverdicted · novelty 4.0

VAGNet anticipates accidents in dashcam videos using global features from VideoMAE-V2 combined with transformers and graphs, reporting higher average precision and mean time-to-accident on four benchmarks while running more efficiently than prior methods.

citing papers explorer

Showing 5 of 5 citing papers.

  • TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions cs.CV · 2026-04-30 · unverdicted · none · ref 10

    TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.

  • Video Diffusion Models cs.CV · 2022-04-07 · unverdicted · none · ref 8

    A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance on video prediction and unconditional generation benchmarks.

  • Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 1

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

  • ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines cs.CV · 2026-04-13 · unverdicted · none · ref 31

    ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.

  • VAGNet: Vision-based Accident Anticipation with Global Features cs.CV · 2026-04-10 · unverdicted · none · ref 30

    VAGNet anticipates accidents in dashcam videos using global features from VideoMAE-V2 combined with transformers and graphs, reporting higher average precision and mean time-to-accident on four benchmarks while running more efficiently than prior methods.