MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.
super hub Baseline reference
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Baseline reference. 64% of citing Pith papers use this work as a benchmark or comparison.
abstract
We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such
authors
co-cited works
representative citing papers
MMA-82 is a multi-domain benchmark with 82 micro-action categories, 77,856 instances from 454 subjects, and protocols for recognition and multi-label detection tasks including cross-domain and few-shot settings.
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.
ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
citing papers explorer
-
MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation
MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.
-
A New Multi-Domain Benchmark for Micro-Action Recognition and Detection
MMA-82 is a multi-domain benchmark with 82 micro-action categories, 77,856 instances from 454 subjects, and protocols for recognition and multi-label detection tasks including cross-domain and few-shot settings.
-
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
-
An Attribute-Based Measure of Video Complexity
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
-
USV: Towards Understanding the User-generated Short-form Videos
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
-
PERL: Parameter Efficient Reasoning in CLIP Latent Space
PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.
-
Neutral-Reference Prompting for Vision-Language Models
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
-
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
-
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
-
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
-
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
-
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
-
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
-
Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling
ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation
LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
-
InstrAct: Towards Action-Centric Understanding in Instructional Videos
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
-
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
-
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
-
Forget, Anticipate and Adapt: Test Time Training for Long Videos
FFN performs TTT on multi-hour videos by restricting updates to three frames and using a surprise metric for adaptive window sizing, plus a new EpicTours dataset.
-
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning
AREA stabilizes attribute extraction with principal geodesic analysis on hyperspherical space and aggregation with lightweight task experts plus variational bottleneck and optimal transport routing, outperforming SOTA in CLIP-based CIL.
-
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
-
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
-
Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models
Introduces Closed-Loop Bidirectional Prompting with Semantic Anchor for cross-modal agreement recovery, claiming SOTA adversarial robustness and generalization on 11 datasets.
-
UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition
UAV-OVO benchmark exposes large ID/OOD performance gaps in video action recognition due to low-to-high depression viewpoint shifts, and LATER uses LoRA subspace anchoring for test-time feature re-centering to reduce drift.
-
TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models
TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness gain on 11 datasets.
-
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
A3B2 introduces an adaptive asymmetric adapter with uncertainty-aware dampening to reduce branch bias in few-shot vision-language image classification and outperforms standard adapter and prompt methods.
-
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
-
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model
CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improved VLM performance.
-
VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision
Viewpoint-conditioned feature selection improves thermal vehicle re-identification mAP by 19.7% on RGBNT100 and 12.8% on a new maritime dataset by adapting RGB ViT extractors.
-
SpecPL: Disentangling Spectral Granularity for Prompt Learning
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
-
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
-
Prototype-Based Test-Time Adaptation of Vision-Language Models
PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
-
EAST: Early Action Prediction Sampling Strategy with Token Masking
EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.
-
Identifying Ethical Biases in Action Recognition Models
The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.
-
KVNN: Learnable Multi-Kernel Volterra Neural Networks
kVNN uses order-adaptive learnable multi-kernel Volterra layers to efficiently capture higher-order feature interactions in deep networks for vision tasks.
-
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
-
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video understanding benchmarks.
-
Latent-Compressed Variational Autoencoder for Video Diffusion Models
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
-
LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video
LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
-
Visual prompting reimagined: The power of the Activation Prompts
Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
-
PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization
Imbalanced multimodal learning that prioritizes the performance-dominant modality via unimodal ranking and asymmetric gradient modulation outperforms balanced approaches.
-
GeoWorld: Geometric World Models
GeoWorld applies hyperbolic geometry to JEPA world models and introduces geometric reinforcement learning, reporting modest success-rate gains of ~3% and ~2% on 3- and 4-step planning tasks versus V-JEPA 2.
-
Temporal Preservation over Processing: Diagnosing and Designing Spatiotemporal Single-Stage Video Detectors
TemporalLens diagnoses temporal reliance in single-stage video detectors via controlled perturbations, while YOLO-3D shows temporal preservation yields +3.7 pp mAP@50 gains.
-
TACO: Towards Task-Consistent Open-Vocabulary Adaptation in Video Recognition
TACO proposes Relative Structure Distillation and a lightweight specialization projection to mitigate inconsistency between fine-tuning and evaluation objectives in open-vocabulary video recognition, claiming state-of-the-art results on cross-dataset and base-to-novel benchmarks.
-
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
VidPrism introduces a heterogeneous temporal MoE with content-aware multi-rate sampling and bidirectional fusion for image-to-video transfer, claiming SOTA results on video benchmarks.
-
Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition
SimVA constructs a 4D similarity volume over video tokens and action classes then applies spatial, motion-aware, and Mamba-based temporal aggregation to achieve competitive zero-shot and few-shot performance on open-vocabulary action recognition benchmarks.