hub

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, Mubarak Shah · 2012 · cs.CV · arXiv 1212.0402

58 Pith papers cite this work. Polarity classification is still indexing.

58 Pith papers citing it

open full Pith review browse 58 citing papers arXiv PDF

abstract

We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 2

citation-polarity summary

background 2 unclear 1 use dataset 1

claims ledger

abstract We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such

co-cited works

representative citing papers

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

cs.CV · 2026-05-13 · conditional · novelty 7.0

STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting

eess.SP · 2026-04-18 · unverdicted · novelty 7.0

E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.

Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.

CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion

cs.CR · 2026-04-10 · unverdicted · novelty 7.0

CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.

InstrAct: Towards Action-Centric Understanding in Instructional Videos

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.

Learning from Synthetic Data via Provenance-Based Input Gradient Guidance

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

cs.CV · 2024-07-02 · unverdicted · novelty 7.0

OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.

The Kinetics Human Action Video Dataset

cs.CV · 2017-05-19 · accept · novelty 7.0

Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.

A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

A3B2 adds uncertainty-aware dampening and asymmetric MoE-style adapters to balance image and text branches, outperforming 11 baselines on 11 few-shot datasets.

Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.

Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization

q-bio.NC · 2026-05-12 · unverdicted · novelty 6.0

Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.

Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improved VLM performance.

VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Viewpoint-conditioned feature selection improves thermal vehicle re-identification mAP by 19.7% on RGBNT100 and 12.8% on a new maritime dataset by adapting RGB ViT extractors.

citing papers explorer

Showing 50 of 58 citing papers.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 44 · internal anchor
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning cs.CV · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition cs.CV · 2026-05-13 · conditional · none · ref 63 · internal anchor
STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations cs.CV · 2026-05-12 · unverdicted · none · ref 32 · 2 links · internal anchor
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 59 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning cs.CV · 2026-05-05 · unverdicted · none · ref 31 · internal anchor
GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CV · 2026-05-05 · unverdicted · none · ref 32 · 2 links · internal anchor
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 138 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting eess.SP · 2026-04-18 · unverdicted · none · ref 21 · internal anchor
E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.
Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling cs.CV · 2026-04-18 · unverdicted · none · ref 37 · internal anchor
ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals cs.AI · 2026-04-17 · unverdicted · none · ref 49 · internal anchor
Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 124 · internal anchor
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Improving Sparse Autoencoder with Dynamic Attention cs.LG · 2026-04-16 · unverdicted · none · ref 61 · internal anchor
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation cs.CV · 2026-04-10 · unverdicted · none · ref 25 · internal anchor
LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion cs.CR · 2026-04-10 · unverdicted · none · ref 34 · internal anchor
CLIP-Inspector reconstructs OOD triggers to detect backdoors in prompt-tuned CLIP models with 94% accuracy and higher AUROC than baselines, plus a repair step via fine-tuning.
InstrAct: Towards Action-Centric Understanding in Instructional Videos cs.CV · 2026-04-09 · unverdicted · none · ref 27 · internal anchor
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance cs.CV · 2026-04-03 · unverdicted · none · ref 37 · internal anchor
A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation cs.CV · 2024-07-02 · unverdicted · none · ref 6 · internal anchor
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
The Kinetics Human Action Video Dataset cs.CV · 2017-05-19 · accept · none · ref 20 · internal anchor
Kinetics is a new video dataset of 400 human actions with over 160000 ten-second clips collected from YouTube, accompanied by baseline action-classification results from neural networks.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning cs.CV · 2026-05-13 · unverdicted · none · ref 36 · internal anchor
A3B2 adds uncertainty-aware dampening and asymmetric MoE-style adapters to balance image and text branches, outperforming 11 baselines on 11 few-shot datasets.
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models cs.CV · 2026-05-12 · unverdicted · none · ref 41 · internal anchor
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization q-bio.NC · 2026-05-12 · unverdicted · none · ref 34 · internal anchor
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model cs.CV · 2026-05-07 · unverdicted · none · ref 41 · internal anchor
CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improved VLM performance.
VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision cs.CV · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
Viewpoint-conditioned feature selection improves thermal vehicle re-identification mAP by 19.7% on RGBNT100 and 12.8% on a new maritime dataset by adapting RGB ViT extractors.
SpecPL: Disentangling Spectral Granularity for Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 52 · internal anchor
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Prototype-Based Test-Time Adaptation of Vision-Language Models cs.CV · 2026-04-23 · unverdicted · none · ref 25 · internal anchor
PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
EAST: Early Action Prediction Sampling Strategy with Token Masking cs.CV · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.
Identifying Ethical Biases in Action Recognition Models cs.CV · 2026-04-20 · unverdicted · none · ref 50 · internal anchor
The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.
KVNN: Learnable Multi-Kernel Volterra Neural Networks cs.CV · 2026-04-16 · unverdicted · none · ref 23 · internal anchor
kVNN uses order-adaptive learnable multi-kernel Volterra layers to efficiently capture higher-order feature interactions in deep networks for vision tasks.
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning cs.CV · 2026-04-14 · unverdicted · none · ref 31 · internal anchor
Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding cs.CV · 2026-04-14 · unverdicted · none · ref 79 · internal anchor
A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video understanding benchmarks.
Latent-Compressed Variational Autoencoder for Video Diffusion Models cs.CV · 2026-04-12 · unverdicted · none · ref 39 · internal anchor
A frequency-based latent compression method for video VAEs yields higher reconstruction quality than channel-reduction baselines at fixed compression ratios.
ELT: Elastic Looped Transformers for Visual Generation cs.CV · 2026-04-10 · unverdicted · none · ref 71 · internal anchor
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations cs.CV · 2026-04-09 · unverdicted · none · ref 43 · internal anchor
SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and text-to-video synthesis.
LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video cs.CV · 2026-04-08 · unverdicted · none · ref 25 · internal anchor
LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
Visual prompting reimagined: The power of the Activation Prompts cs.CV · 2026-04-07 · unverdicted · none · ref 71 · internal anchor
Activation prompts on intermediate layers outperform input-level visual prompting and parameter-efficient fine-tuning in accuracy and efficiency across 29 datasets.
PDMP: Rethinking Balanced Multimodal Learning via Performance-Dominant Modality Prioritization cs.CV · 2026-04-07 · unverdicted · none · ref 48 · internal anchor
Imbalanced multimodal learning that prioritizes the performance-dominant modality via unimodal ranking and asymmetric gradient modulation outperforms balanced approaches.
Perception Encoder: The best visual embeddings are not at the output of the network cs.CV · 2025-04-17 · unverdicted · none · ref 126 · internal anchor
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets cs.CV · 2023-11-25 · conditional · none · ref 88 · internal anchor
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results while enabling motion LoRA and multi-view 3D applications.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 6 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
EVA-CLIP: Improved Training Techniques for CLIP at Scale cs.CV · 2023-03-27 · conditional · none · ref 47 · internal anchor
EVA-CLIP delivers improved CLIP training recipes that yield 82.0% zero-shot ImageNet-1K accuracy for a 5B-parameter model after only 9 billion samples.
Latent Video Diffusion Models for High-Fidelity Long Video Generation cs.CV · 2022-11-23 · unverdicted · none · ref 34 · internal anchor
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
Make-A-Video: Text-to-Video Generation without Text-Video Data cs.CV · 2022-09-29 · unverdicted · none · ref 13 · internal anchor
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
VideoGPT: Video Generation using VQ-VAE and Transformers cs.CV · 2021-04-20 · accept · none · ref 35 · internal anchor
VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Video Generation with Predictive Latents cs.CV · 2026-05-04 · unverdicted · none · ref 40 · internal anchor
PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.
CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition cs.CV · 2026-05-01 · unverdicted · none · ref 31 · internal anchor
CEZSAR uses contrastive learning to align video and sentence embeddings with automatic negative sampling, claiming state-of-the-art zero-shot action recognition on UCF-101 and Kinetics-400.
Physics-Informed Temporal U-Net for High-Fidelity Fluid Interpolation physics.flu-dyn · 2026-04-25 · unverdicted · none · ref 19 · internal anchor
A Temporal U-Net with perceptual loss and a physics-informed parabolic bridge interpolates sparse fluid observations, cutting MAE to 0.015 from 0.085 while retaining high-frequency turbulent structures.
Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition cs.CV · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
Micro-DualNet employs dual ST and TS pathways with entity-level adaptive routing and Mutual Action Consistency loss to achieve competitive results on MA-52 and state-of-the-art on iMiGUE for micro-action recognition.
Hierarchical Textual Knowledge for Enhanced Image Clustering cs.CV · 2026-04-13 · unverdicted · none · ref 37 · internal anchor
KEC constructs hierarchical textual knowledge from LLMs to create knowledge-enhanced image features that improve clustering performance over baselines and zero-shot CLIP on 20 datasets.

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer