V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
Vivit: A video vision transformer
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
A three-stage pipeline uses few-shot VLM action parsing, sliding-window segmentation, and LLM sequence classification with peer context to measure student engagement from classroom videos.
citing papers explorer
-
V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos
V-Nutri fuses final-dish features with cooking-process keyframes from egocentric videos to improve dish-level calorie and macronutrient estimation over single-image baselines.
-
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
-
Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification
A three-stage pipeline uses few-shot VLM action parsing, sliding-window segmentation, and LLM sequence classification with peer context to measure student engagement from classroom videos.