UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Pith reviewed 2026-05-11 01:20 UTC · model grok-4.3
The pith
UCF101 supplies a dataset of 101 human action classes drawn from over 13,000 unconstrained video clips.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UCF101 is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5 percent. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
What carries the argument
The UCF101 dataset itself, organized into 101 action categories from web videos, together with the bag-of-words baseline that measures initial recognition performance.
If this is right
- Action recognition algorithms can now be evaluated on a larger number of classes and clips than in earlier datasets.
- Methods must handle camera motion and background clutter to exceed the reported baseline.
- Future comparisons of recognition systems can use the 44.5 percent figure as a reference point for this scale of data.
Where Pith is reading between the lines
- Subsequent datasets would need to surpass 101 classes or 13k clips to claim greater difficulty on the same criteria.
- The resource could support development of systems for video search or surveillance that operate on uncontrolled footage.
Load-bearing premise
The collected videos sufficiently represent the variability and challenges of unconstrained real-world human actions.
What would settle it
A demonstration that a much larger or more varied collection of action videos exists or that the 44.5 percent baseline understates the dataset difficulty because of evaluation choices.
read the original abstract
We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UCF101 as the largest dataset of human actions, containing 101 classes, over 13,000 video clips, and 27 hours of data from realistic, unconstrained YouTube videos that include camera motion and cluttered backgrounds. It reports a baseline action recognition accuracy of 44.5% using a standard bag-of-words approach and claims that UCF101 is the most challenging action dataset due to its scale and unconstrained nature.
Significance. The release of a large-scale action recognition dataset with realistic video conditions would provide a valuable benchmark for the computer vision community if the data collection and baseline are fully documented. The 101-class scale extends prior work, but the significance of the 'most challenging' positioning depends on whether the low baseline accuracy is shown to stem from the added variability rather than pipeline specifics.
major comments (2)
- [Abstract] Abstract: The assertion that UCF101 is 'currently the most challenging dataset of actions' rests on its descriptive attributes (101 classes, >13k clips, unconstrained YouTube videos) together with the 44.5% bag-of-words baseline, yet no equivalent bag-of-words numbers are provided on prior datasets such as UCF50 or HMDB51. Without these anchors the difficulty ranking remains an untested assertion.
- [Baseline results] Baseline evaluation: The manuscript states an overall performance of 44.5% but supplies no details on the train/test splits, evaluation protocol (e.g., cross-validation folds or leave-one-out), or any measure of variance. This omission prevents assessment of whether the reported accuracy fairly demonstrates the dataset's difficulty.
minor comments (1)
- [Abstract] The abstract and introduction should explicitly list the exact number of videos per class and any class-balance statistics to allow readers to judge diversity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript introducing the UCF101 dataset. We address each major comment below and will revise the paper to improve clarity and support for our claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that UCF101 is 'currently the most challenging dataset of actions' rests on its descriptive attributes (101 classes, >13k clips, unconstrained YouTube videos) together with the 44.5% bag-of-words baseline, yet no equivalent bag-of-words numbers are provided on prior datasets such as UCF50 or HMDB51. Without these anchors the difficulty ranking remains an untested assertion.
Authors: We agree that including bag-of-words baseline results on UCF50 and HMDB51 would provide stronger quantitative support for the relative difficulty claim. Our positioning of UCF101 as the most challenging is grounded in its objectively larger scale and the realistic, unconstrained video conditions (camera motion, cluttered backgrounds) that exceed those in prior datasets. In the revised manuscript, we will add a comparison table with baseline accuracies obtained using the identical bag-of-words pipeline on UCF50 and HMDB51 to enable direct assessment. revision: yes
-
Referee: [Baseline results] Baseline evaluation: The manuscript states an overall performance of 44.5% but supplies no details on the train/test splits, evaluation protocol (e.g., cross-validation folds or leave-one-out), or any measure of variance. This omission prevents assessment of whether the reported accuracy fairly demonstrates the dataset's difficulty.
Authors: We apologize for the insufficient detail in the baseline description. The reported 44.5% accuracy is the mean over the three standard train/test splits released with UCF101. We will expand the experimental section in the revised manuscript to explicitly describe the evaluation protocol, including the use of the three splits, the averaging procedure, and the standard deviation across splits to quantify variance. revision: yes
Circularity Check
No circularity: dataset release with direct empirical baseline
full rationale
The manuscript introduces UCF101 by reporting collection statistics (101 classes, >13k clips, 27 hours) and a single standard bag-of-words baseline result of 44.5%. No equations, fitted parameters, predictions, or derivations exist that could reduce to the inputs by construction. Claims of scale and challenge rest on descriptive counts and qualitative video-source description rather than any self-referential loop or self-citation chain. The baseline is an external standard method applied once; it is not a fitted quantity renamed as a prediction. The work is therefore self-contained as a data release and baseline evaluation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 60 Pith papers
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark evaluating LMMs on video editing technique recognition and operation simulation using 3.9K videos and 3,080 QA pairs, revealing a large performance gap to humans.
-
USV: Towards Understanding the User-generated Short-form Videos
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
-
PERL: Parameter Efficient Reasoning in CLIP Latent Space
PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance...
-
Neutral-Reference Prompting for Vision-Language Models
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base...
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
-
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
-
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
CoDAAR creates a unified discrete representation space for multimodal sequences by aligning modality-specific codebooks through index-level semantic consensus, enabling both specificity and cross-modal generalization.
-
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state...
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multi...
-
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning
GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting
E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.
-
Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling
ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
-
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to reta...
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Improving Sparse Autoencoder with Dynamic Attention
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
-
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
-
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
-
InstrAct: Towards Action-Centric Understanding in Instructional Videos
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on...
-
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
Adapting MLLMs for Nuanced Video Retrieval
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
Zero-shot Concept Bottleneck Models
Z-CBMs achieve zero-shot interpretable predictions by retrieving concepts from a million-vocabulary web bank via cross-modal search and regressing labels with sparse linear regression.
-
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
-
NetTailor: Tuning the Architecture, Not Just the Weights
NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for s...
-
The Kinetics Human Action Video Dataset
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
-
TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness ga...
-
TVRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling
TVRN combines invertible wavelet-based networks with a surrogate gradient approximator and compression-aware asymmetric design to improve frame-rate rescaling quality under real codecs.
-
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
A3B2 introduces an adaptive asymmetric adapter with uncertainty-aware dampening to reduce branch bias in few-shot vision-language image classification and outperforms standard adapter and prompt methods.
-
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
A3B2 adds uncertainty-aware dampening and asymmetric MoE-style adapters to balance image and text branches, outperforming 11 baselines on 11 few-shot datasets.
-
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
-
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
-
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improve...
-
VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision
Viewpoint-conditioned feature selection improves thermal vehicle re-identification mAP by 19.7% on RGBNT100 and 12.8% on a new maritime dataset by adapting RGB ViT extractors.
-
SpecPL: Disentangling Spectral Granularity for Prompt Learning
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
-
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
-
Prototype-Based Test-Time Adaptation of Vision-Language Models
PTA adapts VLMs at test time via adaptively weighted class prototypes that accumulate test-sample features, delivering higher accuracy than cache-based TTA while preserving nearly full inference speed.
-
Prototype-Based Test-Time Adaptation of Vision-Language Models
PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
-
EAST: Early Action Prediction Sampling Strategy with Token Masking
EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU...
-
Identifying Ethical Biases in Action Recognition Models
The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.
-
KVNN: Learnable Multi-Kernel Volterra Neural Networks
kVNN uses order-adaptive learnable multi-kernel Volterra layers to efficiently capture higher-order feature interactions in deep networks for vision tasks.
-
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
-
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video und...
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
-
LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video
LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
-
Visual prompting reimagined: The power of the Activation Prompts
Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
-
PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization
Imbalanced multimodal learning that prioritizes the performance-dominant modality via unimodal ranking and asymmetric gradient modulation outperforms balanced approaches.
-
GeoWorld: Geometric World Models
GeoWorld applies hyperbolic geometry to JEPA world models and introduces geometric reinforcement learning, reporting modest success-rate gains of ~3% and ~2% on 3- and 4-step planning tasks versus V-JEPA 2.
-
ComMark: Covert and Robust Black-Box Model Watermarking with Compressed Samples
ComMark embeds covert watermarks in models using frequency-domain compressed samples and simulated attacks, claiming state-of-the-art covertness and robustness across image, speech, text, and video tasks.
-
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
GA2-CLIP uses generic attribute anchors and coupled hard-soft prompts to preserve generalization in prompt-tuned video-language models on base-to-new class tasks.
-
Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal...
-
On the Provable Importance of Gradients for Language-Assisted Image Clustering
GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.
-
SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
SeMoBridge projects images into the text modality via a semantic bridge to reduce CLIP's intra-modal misalignment and improve few-shot performance.
-
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, a...
Reference graph
Works this paper leans on
- [1]
- [2]
- [3]
-
[4]
G. Johansson, S. Bergstrom, and W. Epstein. Perceiving events and objects, 1994. Lawrence Erlbaum Associates. 2
work page 1994
- [5]
-
[6]
J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos in the wild, 2009. IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR). 2, 6
work page 2009
- [7]
-
[8]
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2, 5, 6
-
[9]
J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments for activity clas- sication, 2010. European Conference on Computer Vision (ECCV). 2, 6
work page 2010
-
[10]
K. Reddy and M. Shah. Recognizing 50 human action cat- egories of web videos, 2012. Machine Vision and Applica- tions Journal (MV AP). 2, 6
work page 2012
-
[11]
M. Rodriguez, J. Ahmed, and M. Shah. Action mach: A spatiotemporal maximum average correlation height lter for action recognition, 2008. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2, 6
work page 2008
-
[12]
C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac- tions: A local svm approach, 2004. International Conference on Pattern Recognition (ICPR). 2, 6
work page 2004
-
[13]
D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views using 3d exemplars, 2007. International Conference on Computer Vision (ICCV). 2, 6 Archery Baseball Pitch Basketball Dunk Biking Bowling Boxing Speed Bag Clean and Jerk Cricket Bowling Diving Field Hockey Penalty Frisbee Catch Golf Swing High Jump Horse Riding Javelin Throw Lon...
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.