{"total":118,"items":[{"citing_arxiv_id":"2606.00891","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MMDG-Bench: A Benchmark for Multimodal Domain Generalization","primary_cat":"cs.CV","submitted_at":"2026-05-30T20:52:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MMDG-Bench provides unified protocols and ten baselines for multimodal domain generalization, showing structured DG-MML combinations often outperform prior methods with insights on framework choice and backbone effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00640","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Attribute-Based Measure of Video Complexity","primary_cat":"cs.CV","submitted_at":"2026-05-30T09:30:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VideoABC estimates video-LLM failure probability via low-dimensional attribute projection, dual quantization (k-means plus lattice), and psychophysics-inspired synthetic data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00439","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Physical Object Understanding with a Physically Controllable World Model","primary_cat":"cs.CV","submitted_at":"2026-05-30T00:10:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Autoregressive probabilistic world models trained on raw videos yield emergent object segmentation, 3D controllability, and physical relationship inference via multi-future motion correlation analysis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31529","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-29T16:43:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SVI-Bench is a 35K-hour sports video benchmark with 9 tasks across four cognitive pillars that reveals multimodal models drop from ~73% on action QA to 5% on agentic evidence-gathering tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31108","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams","primary_cat":"cs.CV","submitted_at":"2026-05-29T10:17:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Domain-incremental video learning that permits forgetting through per-domain LoRA adapters and recovers the matching adapter at inference via test-time training on a self-supervised MAE reconstruction head.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30673","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TeachObs: A Human-Validated Benchmark for Multimodal Teaching Observation and Model Evaluation","primary_cat":"cs.CL","submitted_at":"2026-05-29T00:06:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TeachObs is a new human-validated benchmark dataset and evaluation protocol for multimodal AI on classroom teaching observation, showing no model dominates across tracks and that models over-rate procedurally clear lessons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30346","ref_index":57,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"YoCausal: How Far is Video Generation from World Model? A Causality Perspective","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28604","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification","primary_cat":"cs.CV","submitted_at":"2026-05-27T15:20:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23288","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spatio-Temporal Similarity Volume Aggregation for Open-Vocabulary Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-22T07:01:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SimVA constructs a 4D similarity volume over video tokens and action classes then applies spatial, motion-aware, and Mamba-based temporal aggregation to achieve competitive zero-shot and few-shot performance on open-vocabulary action recognition benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22819","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cambrian-P: Pose-Grounded Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[41] Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. [42] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InCVPR, 2023. [43] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017. [44] Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias"},{"citing_arxiv_id":"2605.22372","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ASAP: Attention Sink Anchored Pruning","primary_cat":"cs.LG","submitted_at":"2026-05-21T12:04:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ASAP prunes tokens in ViTs by anchoring on attention sinks modeled as lazy random walks, using cumulative transition matrices and radial diffusion clustering to compress redundancy while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21977","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:11:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VINA trains a single detector on images plus video frames using a cross-modal supervised contrastive objective, yielding bidirectional gains and SOTA results on 14 image, video, and in-the-wild benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20838","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"USV: Towards Understanding the User-generated Short-form Videos","primary_cat":"cs.CV","submitted_at":"2026-05-20T07:27:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20645","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Seeing Through Fog: Towards Fog-Invariant Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-20T03:09:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces FogAct paired clean-foggy video dataset and FogNet two-stream CLIP model that learns fog-invariant semantic representations via clean-video guidance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19510","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Return of Frustratingly Easy Unsupervised Video Domain Adaptation","primary_cat":"cs.CV","submitted_at":"2026-05-19T08:07:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MetaTrans improves unsupervised video domain adaptation performance by separating and subtracting spatial and temporal divergences via a dedicated module and a minimal two-term loss objective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18257","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook","primary_cat":"cs.CV","submitted_at":"2026-05-18T11:56:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CodeBind uses a modality-shared-specific codebook and compositional vector quantization to decouple shared semantic features from modality-unique details, achieving state-of-the-art multimodal classification and retrieval across nine modalities without requiring fully paired data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17671","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-17T22:04:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17311","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpecSem-Net: Integrating Spectral and Semantic Features for Robust AI-generated Video Detection","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpecSem-Net integrates Fourier-based spectral filtering with semantic-guided gated merging to detect AI-generated videos, reporting 87.25% accuracy on a new benchmark of five commercial generators and 95.59% on public datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15477","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoExo-WM: Unlocking Exo Video for Ego World Models","primary_cat":"cs.CV","submitted_at":"2026-05-14T23:35:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15342","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T19:12:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14569","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-14T08:39:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Evaluation Metrics.We evaluate the reconstructed videos at semantic, spatiotemporal, and pixel levels [20, 25]. For semantic-level evaluation, we computeN-way top- Kaccuracy to assess whether the generated videos seman- tically match the ground-truth (GT) clips, using a Video- MAE [92]-based classifier on 400 video classes from the Kinetics-400 dataset [42], following prior work [20, 25, 98]. For spatiotemporal-level evaluation, we use CLIP temporal consistency (CLIP-pcc) [77] and DINO [72] temporal con- sistency [20, 35] (DTC) to measure the spatiotemporal co- herence of the generated videos. Additionally, we employ the Motion Smoothness (MS) and Dynamic Degree [35] to assess the smoothness and magnitude of movements, along"},{"citing_arxiv_id":"2605.11497","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-12T04:15:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PoseBridge recovers semantic information lost during skeletonization by extracting pose-anchored cues from human pose estimation and transferring them via skeleton-conditioned bridging and semantic prototype adaptation, yielding 13.3-17.4 point gains on the Kinetics PURLS benchmark.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Thus, inference explicitly couples the two sides of our bridge: zb carries pose-anchored semantics recovered from the skeleton extraction process, while ˜t provides skeleton-compatible semantic targets, jointly reducing the skeleton-text semantic gap. 4 Experiments 4.1 Experimental Setup Datasets. We evaluate PoseBridge on NTU-RGB+D 60 [ 24], NTU-RGB+D 120 [20], PKU-MMD [18], and Kinetics-200/400 [12, 35]. NTU-RGB+D 60/120 and PKU-MMD are controlled RGB-D skeleton action benchmarks, while Kinetics-200/400 provides in-the-wild videos with diverse scenes and action contexts. For all datasets, we follow the standard seen/unseen class splits used in prior ZSSAR works [8, 34, 35]. Detailed dataset descriptions are provided in Appendix B. 6 Table 1: Comparison on NTU-RGB+D 60/120 under standard ZSL/GZSL splits."},{"citing_arxiv_id":"2605.09640","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2026-05-10T16:36:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We extend the evaluation to class-incremental video classification, where the model se- quentially learns new classes from a stream of video clips. The primary difference from image classification is the modality shift from images to video, which introduces temporal cues across tasks. We conduct experiments on two commonly used video recognition datasets: UCF-101 [59] and Kinetics-200 [60, 61]. Due to the substantial computational cost of video training and limited 8 Table 4: A comparison of Domain-Incremental Learning (DIL) across image classification and object detection. Method Image Classification Object Detection DomainNet (6 Domains) OfficeHome (4 Domains) Pascal Series (4 Domains) A ↑ F ↓ A ↑ F ↓ ¯Ab ↑ F b ↓ Joint TrainingSFT 67."},{"citing_arxiv_id":"2605.09422","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs","primary_cat":"cs.CL","submitted_at":"2026-05-10T08:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery. 1 Introduction Large Multimodal Models (LMMs) [1-3] have achieved remarkable progress across a wide range of video understanding tasks such as action recognition [4], visual question answering [5], and scene description [6]. Yet in real-world scenarios, video understanding demands more than perceiving what is visible, requiring instead the ability to reason about why observed events occur, such as inferring why a vehicle brakes suddenly in autonomous driving or what triggers an abnormal tissue response in medical video analysis."},{"citing_arxiv_id":"2605.07859","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-08T15:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07568","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-08T10:40:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"representation empowers large language models with image and video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13700-13710, 2024. [21] Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel, Shahab Bakhtiari, Blake Richards, and Mike Rabbat. Interpreting physics in video world models.arXiv preprint arXiv:2602.07050, 2026. [22] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017. [23] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models."},{"citing_arxiv_id":"2605.06894","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"McNdroid: A Longitudinal Multimodal Benchmark for Robust Drift Detection in Android Malware","primary_cat":"cs.CR","submitted_at":"2026-05-07T19:53:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"McNdroid is a new longitudinal multimodal benchmark showing that Android malware detectors degrade over time but multimodal approaches maintain better performance across long temporal gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06747","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HumanNet: Scaling Human-centric Video Learning to One Million Hours","primary_cat":"cs.CV","submitted_at":"2026-05-07T15:21:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HumanNet is a 1M-hour human-centric video dataset with interaction annotations that enables better vision-language-action model performance than equivalent robot data in a controlled test.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"Egomimic: Scaling imitation learning via egocentric video, 2024. URLhttps://arxiv.org/abs/2410.24221. [19] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset, 2017. URLhttps://arxiv.org/abs/1705.06950. [20] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius Memmel, Sungjae Park, Ilija Radosavovic, Kaiyuan"},{"citing_arxiv_id":"2605.06351","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition","primary_cat":"cs.HC","submitted_at":"2026-05-07T14:33:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05895","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting AI-Generated Videos with Spiking Neural Networks","primary_cat":"cs.CV","submitted_at":"2026-05-07T09:08:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAST with spiking neural networks achieves 93.14% mean accuracy detecting AI-generated videos from 10 unseen generators by exploiting smoother pixel residuals and compact semantic trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03848","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback","primary_cat":"cs.CV","submitted_at":"2026-05-05T15:14:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SkillFormer, PATS, and ProfVLM deliver state-of-the-art multi-view proficiency estimation on Ego-Exo4D with up to 20x fewer parameters by combining selective fusion, dense sampling, and generative feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03820","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal Learning on Low-Quality Data with Conformal Predictive Self-Calibration","primary_cat":"cs.CV","submitted_at":"2026-05-05T14:48:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CPSC uses conformal prediction to decompose and fuse robust unimodal features and recalibrate gradients based on instance reliability, outperforming prior methods on imbalanced and noisy multimodal benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03276","ref_index":17,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing","primary_cat":"cs.CV","submitted_at":"2026-05-05T02:05:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VEBENCH is the first benchmark with 3.9K videos and 3,080 human-verified QA pairs that measures LMMs on video editing technique recognition and operation simulation, revealing a large gap to human performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"soning, editing serves as a real-world testbed for studying complex multimodal cognitionhighly relevant to filmmak- ing [8, 24], short-form content creation [14, 33], and pro- fessional media production [15, 29]. Several recent benchmarks have begun to explore do- mains tangentially related to video editing and filmmak- ing [11, 17, 22, 34, 36]. VEU-Bench [17] investigates a range of editing techniques such as cuts and transitions, for- mulating questions that span from recognition to reasoning. ShotBench [22] emphasizes shot-level analysis and cine- matic attributes, including composition and camera move- ment. While these benchmarks have made valuable contri- butions toward advancing video understanding in editing-"},{"citing_arxiv_id":"2605.02134","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video Generation with Predictive Latents","primary_cat":"cs.CV","submitted_at":"2026-05-04T01:30:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PV-VAE improves video latent spaces for generation by unifying reconstruction with future-frame prediction, reporting 52% faster convergence and 34.42 FVD gain over Wan2.2 VAE on UCF101.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"learning of structural motion and temporal evolution. The total loss is formulated as follows: Ltotal =λrec(LMSE +L Diff) +λ lpipsLLPIPS +λ ganLGAN +λ klLKL, (1) where eachλcontrols the relative contribution of its corresponding component. 4 Experiments 4.1 Experimental setups Evaluation details.We evaluate PV-VAE on three widely used benchmarks: UCF101 [40], RealEstate10K [69], and Kinetics-400 [21]. For video generation, we follow prior work [10, 53] and adopt the Latte architecture [29], a Transformer-based latent diffusion model that supports both unconditional and class-conditional generation. We use UCF101 for class-conditional generation and RealEstate10K for unconditional generation. All videos are converted into 17-frame clips at256× 256resolution for both training and testing."},{"citing_arxiv_id":"2605.02094","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SignMAE: Segmentation-Driven Self-Supervised Learning for Sign Language Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-03T23:25:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SignMAE uses segmentation-driven masking in a mask-and-reconstruct self-supervised task to learn fine-grained sign representations, achieving state-of-the-art accuracy on WLASL, NMFs-CSL, and Slovo with fewer frames and modalities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"3 Implementation Details Model Initialization and Data Preparation.In our experiments, we use Vision Transformer (large) as the backbone for both the video and keypoint heatmap models. We initialize the parameters of the video model using the checkpoint from VideoMAE [23]. The backbone of the checkpoint is Vision Transformer (large) pretrained on the Kinetics-400 [13] dataset. The keypoint heatmap model is initialized randomly. Since no ground truth keypoint data is available in sign language datasets, we use Sapiens [14] for its high performance to extract 55 keypoints, 13 body keypoints, and 42 hand keypoints. Next, the keypoint coordinates are converted to a keypoint heatmap with 224×224 resolu- tion. To improve efficiency, we pretrain separate encoders on the three language"},{"citing_arxiv_id":"2605.01967","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization","primary_cat":"cs.LG","submitted_at":"2026-05-03T16:53:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01517","ref_index":137,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation","primary_cat":"cs.CV","submitted_at":"2026-05-02T16:10:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26488","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Featurising Pixels from Dynamic 3D Scenes with Linear In-Context Learners","primary_cat":"cs.CV","submitted_at":"2026-04-29T09:51:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LILA learns temporally consistent semantic and geometric pixel features from uncurated videos via linear in-context learning on off-the-shelf depth and motion cues, yielding empirical gains on video object segmentation, surface normal estimation, and semantic segmentation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26461","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"$\\text{PKS}^4$:Parallel Kinematic Selective State Space Scanners for Efficient Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-29T09:17:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PKS^4 adds a kinematic-prior-driven parallel state space scanner module to 2D vision backbones for linear-complexity temporal modeling in videos, delivering SOTA action recognition with 10x lower training compute and convergence in 20 epochs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23415","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis","primary_cat":"cs.CV","submitted_at":"2026-04-25T19:15:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"DualStreamHybrid assigns ViT-Tiny to RGB and MobileNetV2 to 20-channel flow, projects features to common space, and finds cross-attention best on UCF11 (98.12%) while weighted fusion is most consistent on UCF50 (96.86%).","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ViViT [3] explored several factorisation strategies for spatiotemporal attention, finding that large-scale pretraining is critical to performance. MViT [ 9] introduced pooled attention that progressively reduces sequence length while increasing channel depth-enabling efficient multi-scale modelling without the full quadratic cost. Video Swin Transformer [23] adapted the shifted-window design of Swin Transformer [22] to 3D video, achieving competitive results with a more structured attention pattern. The applicability of transformer-based encoders to smaller-scale benchmarks such as UCF50 was demonstrated by Hussainet al.[ 16], who coupled ViT-Base/16 with an LSTM to achieve 96.1%-showing that transformer appearance encoders are effective even outside large-scale pretraining regimes."},{"citing_arxiv_id":"2604.22595","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges","primary_cat":"cs.CV","submitted_at":"2026-04-24T14:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"EV-CLIP introduces mask and context visual prompts to adapt CLIP for improved few-shot video action recognition under visual challenges such as low light and egocentric views, outperforming other efficient methods with backbone-scale-independent efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21011","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Micro-DualNet: Dual-Path Spatio-Temporal Network for Micro-Action Recognition","primary_cat":"cs.CV","submitted_at":"2026-04-22T19:00:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Micro-DualNet employs dual ST and TS pathways with entity-level adaptive routing and Mutual Action Consistency loss to achieve competitive results on MA-52 and state-of-the-art on iMiGUE for micro-action recognition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20760","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Exploring High-Order Self-Similarity for Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-22T16:48:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"OrderSelf-Similarity), that learns distinct representations of STSSs at diverse orders and integrates them into holistic motion features. The proposed module is lightweight and can be easily integrated into existing video architectures, enhancing temporal modeling capabilities across various domains (Fig. 1b). We first evaluate our method on diverse action recognition benchmarks,i.e., Kinetics-400 [31], Something-Something V1 & V2 [21,54], Diving48 [43], and FineGym [64], demonstrating significant performance improvements, introducing marginal computation and memory overhead. We further incorporate MOSS into video multi-modal large language models (MLLMs) and demonstrate that it enhances fine-grained motion understanding, leading to substantial gains on"},{"citing_arxiv_id":"2604.19093","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration","primary_cat":"cs.CV","submitted_at":"2026-04-21T05:18:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A probabilistic Gaussian model with adaptive contrastive asymmetry rectification improves multi-modal test-time adaptation by modeling category distributions and correcting modality asymmetry for better predictions under shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18367","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EAST: Early Action Prediction Sampling Strategy with Token Masking","primary_cat":"cs.CV","submitted_at":"2026-04-20T14:57:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EAST uses randomized time-step sampling and token masking to train a single encoder-only model that generalizes across all observation ratios in early action prediction and reports new state-of-the-art accuracy on NTU60, SSv2, and UCF101.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17971","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Identifying Ethical Biases in Action Recognition Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T08:51:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The authors create a synthetic video auditing framework that detects statistically significant skin color biases in popular human action recognition models even when actions are identical.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17074","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Comparison Drives Preference: Reference-Aware Modeling for AI-Generated Video Quality Assessment","primary_cat":"cs.CV","submitted_at":"2026-04-18T17:21:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RefVQA uses a query-centered reference graph and graph-guided difference aggregation to improve AI-generated video quality assessment by incorporating inter-video comparisons.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16240","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting","primary_cat":"cs.CV","submitted_at":"2026-04-17T17:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CollideNet achieves state-of-the-art time-to-collision forecasting on three public datasets by combining multi-scale spatial aggregation with temporal disentanglement of trend and seasonality in a hierarchical transformer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14816","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results","primary_cat":"cs.CV","submitted_at":"2026-04-16T09:36:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The NTIRE 2026 Challenge released a public dataset of 2,000 videos with crowdsourced saliency maps and reported results from participating teams using standard quality metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14149","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[25] Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. InICCV, 2017. 25 [26] Peng Jin, Ryuichi Takanobu, Wancai Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024. 3 [27] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017. 25 [28] Bruno Korbar, Yongqin Xian, Alessio Tonioni, Andrew Zisserman, and Federico Tombari. Text-conditioned resampler for long form video understanding."}],"limit":50,"offset":0}