super hub Baseline reference

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Amir Roshan Zamir, Khurram Soomro, Mubarak Shah · 2012 · cs.CV · arXiv 1212.0402

Baseline reference. 64% of citing Pith papers use this work as a benchmark or comparison.

147 Pith papers citing it

Baseline 64% of classified citations

open full Pith review browse 147 citing papers more from Amir Roshan Zamir arXiv PDF

abstract

We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such clips.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 13 background 7 baseline 2

citation-polarity summary

use dataset 12 background 7 baseline 2 unclear 1

claims ledger

abstract We introduce UCF101 which is currently the largest dataset of human actions. It consists of 101 action classes, over 13k clips and 27 hours of video data. The database consists of realistic user uploaded videos containing camera motion and cluttered background. Additionally, we provide baseline action recognition results on this new dataset using standard bag of words approach with overall performance of 44.5%. To the best of our knowledge, UCF101 is currently the most challenging dataset of actions due to its large number of classes, large number of clips and also unconstrained nature of such

authors

Amir Roshan Zamir Khurram Soomro Mubarak Shah

co-cited works

representative citing papers

MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.

T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

T-VSS is a lightweight test-time defense that steers attacked visual features in VLMs using sample-specific low-rank subspaces and reliability-weighted entropy minimization to improve robustness.

Semantic Robustness Certification for Vision-Language Models

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

Framework certifies VLM robustness under semantic transformations via text prompt proxies, enabling quantitative certification of safe extent intervals without per-variation data.

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

MMA-82 is a multi-domain benchmark with 82 micro-action categories, 77,856 instances from 454 subjects, and protocols for recognition and multi-label detection tasks including cross-domain and few-shot settings.

FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

FS-DVS inserts a learnable spatial filter before DVS event triggering; the filter converges to center-surround kernels that emphasize mid-spatial frequencies and improve downstream detection and recognition.

VidMsg: A Benchmark for Implicit Message Inference in Short Videos

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

VidMsg is a new benchmark dataset and QA/retrieval tasks for implicit message inference in short videos, where current models perform poorly.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

cs.CV · 2026-06-01 · conditional · novelty 7.0

Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.

An Attribute-Based Measure of Video Complexity

cs.CV · 2026-05-30 · unverdicted · novelty 7.0

VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.

USV: Towards Understanding the User-generated Short-form Videos

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.

PERL: Parameter Efficient Reasoning in CLIP Latent Space

cs.CV · 2026-05-18 · unverdicted · novelty 7.0

PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.

Neutral-Reference Prompting for Vision-Language Models

cs.CV · 2026-05-15 · unverdicted · novelty 7.0

NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.

STAR: Semantic-Temporal Adaptive Representation Learning for Few-Shot Action Recognition

cs.CV · 2026-05-13 · conditional · novelty 7.0

STAR improves 1-shot action recognition by up to 8.1% on SSv2-Full through semantic-temporal alignment and Mamba-based prototype refinement.

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.

VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

E2E-WAVE: End-to-End Learned Waveform Generation for Underwater Video Multicasting

eess.SP · 2026-04-18 · unverdicted · novelty 7.0

E2E-WAVE achieves +5 dB PSNR and real-time 16 FPS 128x128 video over 2.3 kbps underwater channels by learning waveforms that favor semantic similarity on decoding errors.

Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

Pairwise scoring signals in Vision Transformer token reduction are inherently unstable due to high perturbation counts and degrade in deep layers, causing collapse, while unary signals with triage enable CATIS to retain 96.9% accuracy at 63% FLOPs reduction on ViT-Large ImageNet-1K.

Efficient Video Diffusion Models: Advancements and Challenges

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

Improving Sparse Autoencoder with Dynamic Attention

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.

citing papers explorer

Showing 50 of 115 citing papers after filters.

MASS: Motion-Aligned Selective Scan for Refinement in Flow-Based Video Frame Interpolation cs.CV · 2026-06-26 · unverdicted · none · ref 35 · internal anchor
MASS reformulates SSM-based feature scanning in flow-based VFI to follow dynamic motion trajectories via learnable path integration and velocity-aware sampling, claiming SOTA on challenging large-displacement cases.
T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models cs.CV · 2026-06-22 · unverdicted · none · ref 42 · internal anchor
T-VSS is a lightweight test-time defense that steers attacked visual features in VLMs using sample-specific low-rank subspaces and reliability-weighted entropy minimization to improve robustness.
A New Multi-Domain Benchmark for Micro-Action Recognition and Detection cs.CV · 2026-06-12 · unverdicted · none · ref 1 · internal anchor
MMA-82 is a multi-domain benchmark with 82 micro-action categories, 77,856 instances from 454 subjects, and protocols for recognition and multi-label detection tasks including cross-domain and few-shot settings.
FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness cs.CV · 2026-06-05 · unverdicted · none · ref 44 · internal anchor
FS-DVS inserts a learnable spatial filter before DVS event triggering; the filter converges to center-surround kernels that emphasize mid-spatial frequencies and improve downstream detection and recognition.
VidMsg: A Benchmark for Implicit Message Inference in Short Videos cs.CV · 2026-06-02 · unverdicted · none · ref 34 · internal anchor
VidMsg is a new benchmark dataset and QA/retrieval tasks for implicit message inference in short videos, where current models perform poorly.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
An Attribute-Based Measure of Video Complexity cs.CV · 2026-05-30 · unverdicted · none · ref 44 · internal anchor
VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.
USV: Towards Understanding the User-generated Short-form Videos cs.CV · 2026-05-20 · unverdicted · none · ref 66 · internal anchor
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
PERL: Parameter Efficient Reasoning in CLIP Latent Space cs.CV · 2026-05-18 · unverdicted · none · ref 27 · internal anchor
PERL augments frozen CLIP with a shared recurrent reasoning module of roughly 6K parameters that iteratively refines representations via latent token injection, delivering strong base-to-novel and transfer performance across 15 benchmarks.
Neutral-Reference Prompting for Vision-Language Models cs.CV · 2026-05-15 · unverdicted · none · ref 12 · internal anchor
NeRP corrects asymmetric class confusion in VLMs for unseen classes by combining neutral-prompt priors with sample likelihood to flip predictions on confusable pairs, improving new-class accuracy while preserving base-class performance.
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 44 · internal anchor
TeDiO regularizes temporal diagonals in diffusion transformer attention maps to produce smoother video motion while keeping per-frame quality intact.
Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning cs.CV · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations cs.CV · 2026-05-12 · unverdicted · none · ref 32 · 2 links · internal anchor
CoDAAR aligns modality-specific codebooks at the index level using Discrete Temporal Alignment and Cascading Semantic Alignment to achieve cross-modal generalization while preserving unique structures, reporting state-of-the-art results on event classification, localization, video segmentation, and跨
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning cs.CV · 2026-05-10 · unverdicted · none · ref 59 · internal anchor
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing cs.CV · 2026-05-05 · unverdicted · none · ref 32 · 2 links · internal anchor
VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation cs.CV · 2026-05-02 · unverdicted · none · ref 138 · internal anchor
VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Inductive Convolution Nuclear Norm Minimization for Tensor Completion with Arbitrary Sampling cs.CV · 2026-04-18 · unverdicted · none · ref 37 · internal anchor
ICNNM reformulates CNNM using pre-learned shared convolution eigenvectors to bypass SVD computations, significantly reducing time while improving recovery performance for tensor completion with arbitrary sampling.
Efficient Video Diffusion Models: Advancements and Challenges cs.CV · 2026-04-17 · unverdicted · none · ref 124 · internal anchor
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation cs.CV · 2026-04-10 · unverdicted · none · ref 25 · internal anchor
LMFT enables state-of-the-art performance in video unsupervised domain adaptation by focusing on motion-rich tokens and reducing computational overhead.
InstrAct: Towards Action-Centric Understanding in Instructional Videos cs.CV · 2026-04-09 · unverdicted · none · ref 27 · internal anchor
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance cs.CV · 2026-04-03 · unverdicted · none · ref 37 · internal anchor
A framework that applies provenance-based guidance to input gradients during synthetic data training to promote learning from target regions only.
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation cs.CV · 2026-03-10 · unverdicted · none · ref 47 · internal anchor
FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.
Adapting MLLMs for Nuanced Video Retrieval cs.CV · 2025-12-15 · unverdicted · none · ref 65 · internal anchor
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation cs.CV · 2024-07-02 · unverdicted · none · ref 6 · internal anchor
OpenVid-1M supplies 1 million high-quality text-video pairs and introduces MVDiT to improve text-to-video generation by better using both visual structure and text semantics.
NetTailor: Tuning the Architecture, Not Just the Weights cs.CV · 2019-06-29 · unverdicted · none · ref 59 · internal anchor
NetTailor adapts CNN architecture for new tasks by assembling pre-trained universal blocks with task-specific layers, trained via activation mimicry and complexity penalties to match accuracy while reducing size for simpler tasks.
Decompose, Compare, and Decide: Multimodal LLMs are Implicit Few-Shot Learners cs.CV · 2026-06-30 · unverdicted · none · ref 40 · internal anchor
DeCoDe decomposes few-shot classification into binary pairwise image comparisons whose affirmative logits serve as similarity scores, enabling strong performance from unmodified MLLMs on twelve datasets.
Forget, Anticipate and Adapt: Test Time Training for Long Videos cs.CV · 2026-06-25 · unverdicted · none · ref 71 · 2 links · internal anchor
FFN performs TTT on multi-hour videos by restricting updates to three frames and using a surprise metric for adaptive window sizing, plus a new EpicTours dataset.
Modality-Aware Out-of-Distribution Detection for Multi-Modal Action Recognition cs.CV · 2026-06-23 · unverdicted · none · ref 36 · internal anchor
A modality-aware post-hoc detector for multi-modal OOD detection in action recognition combines uni-modal prediction relationships with feature-space scores and outperforms prior methods on the MultiOOD benchmark.
Black-Box Continual Learning for Vision-Language Models cs.CV · 2026-06-22 · unverdicted · none · ref 59 · internal anchor
Introduces Black-CL black-box benchmark and BETA textual-prototype method that matches or exceeds white-box continual learning performance on ten datasets using 0.05M parameters.
Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding cs.CV · 2026-06-21 · unverdicted · none · ref 18 · internal anchor
GPS framework adds self-guided reasoning modules to lightweight VLMs for fine-grained action understanding, claiming performance near GPT-4o with better factual accuracy on a custom CAP-based dataset.
TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living cs.CV · 2026-06-18 · unverdicted · none · ref 258 · internal anchor
TimeProVe proposes a propose-then-verify framework using lightweight action-based candidate evidence generation followed by targeted VLM verification for efficient long video temporal reasoning, achieving 7.3% improvement on OTB with 75% fewer VLM calls.
TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization cs.CV · 2026-06-16 · unverdicted · none · ref 111 · internal anchor
TivTok factorizes video clips into reusable time-invariant tokens and frame-specific time-variant tokens via Scope-Induced Factorization and Invariant Broadcasting, achieving 2.91x better compression for 128-frame videos on benchmarks.
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers cs.CV · 2026-06-11 · unverdicted · none · ref 24 · internal anchor
RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.
Hybrid Robustness Verification for Spatio-Temporal Neural Networks cs.CV · 2026-06-08 · unverdicted · none · ref 40 · internal anchor
STBP computes exact closed-form bounds for the first convolutional layer of spatio-temporal networks and propagates scalable approximations through the rest to certify robustness under subset-frame or patch perturbations.
Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting cs.CV · 2026-06-04 · unverdicted · none · ref 23 · internal anchor
A parameter-free approach drops redundant video tokens via temporal L1 differences in frozen latent space and reconstructs them with LIT, yielding 31x speedup over ElasticTok-CV on TokenBench and DAVIS.
Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models cs.CV · 2026-06-03 · unverdicted · none · ref 53 · internal anchor
GPUA learns an orthogonal mapping from VFM to VLM feature space to preserve geometry and improve cross-model compatibility for zero-shot recognition and segmentation.
AREA: Attribute Extraction and Aggregation for CLIP-Based Class-Incremental Learning cs.CV · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
AREA stabilizes attribute extraction with principal geodesic analysis on hyperspherical space and aggregation with lightweight task experts plus variational bottleneck and optimal transport routing, outperforming SOTA in CLIP-based CIL.
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification cs.CV · 2026-05-27 · unverdicted · none · ref 47 · internal anchor
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers cs.CV · 2026-05-26 · unverdicted · none · ref 32 · internal anchor
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models cs.CV · 2026-05-25 · unverdicted · none · ref 34 · internal anchor
Introduces Closed-Loop Bidirectional Prompting with Semantic Anchor for cross-modal agreement recovery, claiming SOTA adversarial robustness and generalization on 11 datasets.
UAV-OVO: Out-of-Viewpoint Generalization in UAV Action Recognition cs.CV · 2026-05-25 · unverdicted · none · ref 3 · internal anchor
UAV-OVO benchmark exposes large ID/OOD performance gaps in video action recognition due to low-to-high depression viewpoint shifts, and LATER uses LoRA subspace anchoring for test-time feature re-centering to reduce drift.
TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models cs.CV · 2026-05-17 · unverdicted · none · ref 80 · internal anchor
TAME uses a Mixture-of-Experts prompt bank with input-dependent routing and three unsupervised objectives to adaptively defend CLIP against adversarial attacks at inference time, achieving at least 49.1% robustness gain on 11 datasets.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning cs.CV · 2026-05-13 · unverdicted · none · ref 36 · 2 links · internal anchor
A3B2 introduces an adaptive asymmetric adapter with uncertainty-aware dampening to reduce branch bias in few-shot vision-language image classification and outperforms standard adapter and prompt methods.
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models cs.CV · 2026-05-12 · unverdicted · none · ref 41 · internal anchor
CPT creates cluster-invariant spaces from pre-trained VLM semantics and applies neural collapse losses to boost long-tail performance and unseen-class generalization in prompt tuning.
Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model cs.CV · 2026-05-07 · unverdicted · none · ref 41 · internal anchor
CAKI generates class-specific prompts from few-shot samples of the same class, stores them in a knowledge bank, and uses query-key matching to inject relevant class knowledge into test instance predictions for improved VLM performance.
VC-FeS: Viewpoint-Conditioned Feature Selection for Vehicle Re-identification in Thermal Vision cs.CV · 2026-05-06 · unverdicted · none · ref 40 · internal anchor
Viewpoint-conditioned feature selection improves thermal vehicle re-identification mAP by 19.7% on RGBNT100 and 12.8% on a new maritime dataset by adapting RGB ViT extractors.
SpecPL: Disentangling Spectral Granularity for Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 52 · internal anchor
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 15 · internal anchor
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Prototype-Based Test-Time Adaptation of Vision-Language Models cs.CV · 2026-04-23 · unverdicted · none · ref 25 · internal anchor
PTA adapts VLMs at test time by maintaining and updating class-specific knowledge prototypes from test samples, achieving higher accuracy than cache-based methods with far less speed loss.
EAST: Early Action Prediction Sampling Strategy with Token Masking cs.CV · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer