MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.
super hub Baseline reference
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Baseline reference. 64% of citing Pith papers use this work as a benchmark or comparison.
abstract
We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such
authors
co-cited works
representative citing papers
MMA-82 is a multi-domain benchmark with 82 micro-action categories, 77,856 instances from 454 subjects, and protocols for recognition and multi-label detection tasks including cross-domain and few-shot settings.
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.
ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
citing papers explorer
-
Adapting MLLMs for Nuanced Video Retrieval
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models
GA2-CLIP uses generic attribute anchors and coupled hard-soft prompts to preserve generalization in prompt-tuned video-language models on base-to-new class tasks.
-
Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding
A plug-and-play Anonymizing Adapter Module removes private information from video latent features using self-supervised privacy objectives and consistency losses while retaining utility on action recognition, temporal detection, and anomaly tasks.
-
On the Provable Importance of Gradients for Language-Assisted Image Clustering
GradNorm selects positive nouns via gradient magnitudes from cross-entropy loss, with an error bound proving it subsumes prior CLIP methods and delivers SOTA clustering results.
-
SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
SeMoBridge projects images into the text modality via a semantic bridge to reduce CLIP's intra-modal misalignment and improve few-shot performance.
-
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.
-
Perception Encoder: The best visual embeddings are not at the output of the network
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
-
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
FAR baseline plus asymmetric kernels for long short-term context modeling achieves SOTA short and long video generation in autoregressive setups.
-
CLIP-SVD: Efficient and Interpretable Vision-Language Adaptation via Singular Values
CLIP-SVD performs parameter-efficient adaptation of CLIP by fine-tuning singular values from SVD of weight matrices, reporting SOTA few-shot accuracy on 21 datasets plus a language-based interpretability analysis.
-
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
ADAPT reframes test-time adaptation as probabilistic Gaussian inference with CLIP-guided regularization, delivering SOTA results without gradients, source data, or full target access in both online and transductive settings.
-
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.