Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, Cordelia Schmid

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification

cs.CV · 2026-01-10 · unverdicted · novelty 6.0

A three-stage pipeline uses few-shot VLM action parsing, sliding-window segmentation, and LLM sequence classification with peer context to measure student engagement from classroom videos.

citing papers explorer

Showing 3 of 3 citing papers.

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos cs.CV · 2026-04-13 · unverdicted · none · ref 2
V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation cs.CV · 2026-04-11 · unverdicted · none · ref 1
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification cs.CV · 2026-01-10 · unverdicted · none · ref 6
A three-stage pipeline uses few-shot VLM action parsing, sliding-window segmentation, and LLM sequence classification with peer context to measure student engagement from classroom videos.

Vivit: A video vision transformer

fields

years

verdicts

representative citing papers

citing papers explorer