A Short Note about Kinetics-600

Andras Banki-Horvath; Andrew Zisserman; Chloe Hillier; Eric Noland; Joao Carreira

arxiv: 1808.01340 · v1 · pith:UOVSYDGJnew · submitted 2018-08-03 · 💻 cs.CV

A Short Note about Kinetics-600

Joao Carreira , Eric Noland , Andras Banki-Horvath , Chloe Hillier , Andrew Zisserman This is my paper

classification 💻 cs.CV

keywords datasetclassesclipsleastvideoactionarchitecturebaseline

0 comments

read the original abstract

We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips. In order to scale up the dataset we changed the data collection process so it uses multiple queries per class, with some of them in a language other than english -- portuguese. This paper details the changes between the two versions of the dataset and includes a comprehensive set of statistics of the new version as well as baseline results using the I3D neural network architecture. The paper is a companion to the release of the ground truth labels for the public test set.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

USV: Towards Understanding the User-generated Short-form Videos
cs.CV 2026-05 unverdicted novelty 7.0

Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations
cs.CV 2026-05 unverdicted novelty 7.0

LIMSSR reformulates incomplete multimodal learning as LLM-driven sequence-to-score reasoning with prompt-guided imputation and mask-aware aggregation, outperforming baselines on action quality assessment without compl...
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
Video Diffusion Models
cs.CV 2022-04 unverdicted novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
Cambrian-P: Pose-Grounded Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
EAST: Early Action Prediction Sampling Strategy with Token Masking
cs.CV 2026-04 unverdicted novelty 6.0

EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU...
From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage
cs.CV 2026-04 unverdicted novelty 6.0

HELIX is the first end-to-end neural codec jointly optimizing video compression and DNA encoding via tokens, achieving 1.91 bits per nucleotide with Kronecker mixing and FSM mapping.
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
cs.CV 2026-04 unverdicted novelty 6.0

ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis
cs.MM 2024-11 unverdicted novelty 6.0

Experiments with a video-text-to-speech transformer show co-temporal positional indexing enables synchronization without timestamps, text and video supply complementary signals, and modality ordering creates a trade-o...
VideoPoet: A Large Language Model for Zero-Shot Video Generation
cs.CV 2023-12 unverdicted novelty 6.0

VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
EVA-CLIP: Improved Training Techniques for CLIP at Scale
cs.CV 2023-03 conditional novelty 6.0

EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
PIQA: Reasoning about Physical Commonsense in Natural Language
cs.CL 2019-11 accept novelty 6.0

PIQA is a new benchmark showing that current AI models achieve 77% on physical commonsense questions versus humans at 95%.
Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition
cs.CV 2026-05 unverdicted novelty 5.0

SimVA constructs a 4D similarity volume over video tokens and action classes then applies spatial, motion-aware, and Mamba-based temporal aggregation to achieve competitive zero-shot and few-shot performance on open-v...
A Comprehensive Survey of Action Quality Assessment: Method and Benchmark
cs.CV 2024-12 unverdicted novelty 5.0

This survey proposes a modality-driven hierarchical taxonomy for AQA methods, establishes a unified benchmark for video-based approaches across datasets, and outlines research trends and challenges.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
cs.CV 2022-12 unverdicted novelty 5.0

InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
cs.CV 2022-05 unverdicted novelty 5.0

CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
cs.CV 2026-04 unverdicted novelty 4.0

Motion separation modules plus negative prompts improve CLIP-based zero-shot video action recognition on standard benchmarks.
Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives
cs.CL 2024-06 unverdicted novelty 2.0

Survey summarizing video-language understanding tasks, challenges, and methods from architecture, training, and data perspectives, including performance comparisons and future directions.