AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.
hub Baseline reference
In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.
Introduces a unified benchmark for continual anomaly detection with discrete and continuous protocols plus a training-free DINOSaur method that outperforms prior CAD approaches with zero forgetting and sub-100ms edge inference.
COSY uses independent per-component 3DGS generators plus context tokens to achieve disentangled semantic editing of human heads without masks or classifiers.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
AdaGScale uses viewpoint-adaptive scaling of Gaussians in 3D-GS by estimating peripheral color contributions to reduce Gaussian-tile pairs, delivering 13.8x geometric mean speedup with ~0.5 dB PSNR loss on city-scale scenes.
DPC-VQA decouples a frozen MLLM perceptual prior from a lightweight residual calibration branch to adapt video quality assessment to new scenarios with under 2% trainable parameters and 20% of typical MOS labels.
Fleet achieves dynamic few-shot adaptation for AIGI detection via avoidance routing in decoupled subspaces, raising accuracy from 20.4% to 73.1% on new generators like Doubao Seedream 4.0 with 10 shots.
RBE-Flow recasts dense cross-modal flow estimation as closed-loop recurrent Bayesian estimation on learned feature manifolds with uncertainty-adaptive updates and achieves SOTA on three registration benchmarks.
Rotary positional encodings reduce the symmetry group of functional equivalence in attention compared to sinusoidal encodings, increasing expressivity and altering linear mode connectivity patterns.
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
Introduces ShopTrajQA long-context benchmark and an RLVR-trained tool-augmented agent that bypasses LLM context limits by external file storage and code-based retrieval for shopping trajectories.
AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.
MultiAct is an unpaired inference-time method that adaptively amplifies cross-attention for underrepresented components in composite text prompts to improve semantic coverage in motion generation while preserving realism.
Two event cameras automatically estimate impact time, racket-face location, and shuttlecock speed in badminton smashes, validated against high-speed cameras on 124 trials with small biases and no proportional error.
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
A three-agent mobile system for end-to-end walking support shows motivational companion dialogue boosts affect and UX in a 12-person in-the-wild crossover study.
ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
A deep learning method with an enhanced physical degradation model incorporating anisotropic light spread and hidden skyglow, trained via generative models and synthetic-real coupling, removes light pollution from night cityscape images more effectively than prior restoration techniques.
STEP uses dynamic superpatch merging via dCTS and early token exits to cut token count by 2.5x and computational complexity by up to 4x on ViT-Large for high-res segmentation, with at most 2% accuracy drop and 40% tokens halted early.
Hybrid ANN-CANN network for visual object tracking that operationalizes bias-variance complementarity to outperform baselines on nine benchmarks.
citing papers explorer
-
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.
-
Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning
FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.
-
Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
Introduces a unified benchmark for continual anomaly detection with discrete and continuous protocols plus a training-free DINOSaur method that outperforms prior CAD approaches with zero forgetting and sub-100ms edge inference.
-
COSY: Compositional 3DGS Synthesis for Disentangled Human Head Editing
COSY uses independent per-component 3DGS generators plus context tokens to achieve disentangled semantic editing of human heads without masks or classifiers.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs
AdaGScale uses viewpoint-adaptive scaling of Gaussians in 3D-GS by estimating peripheral color contributions to reduce Gaussian-tile pairs, delivering 13.8x geometric mean speedup with ~0.5 dB PSNR loss on city-scale scenes.
-
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment
DPC-VQA decouples a frozen MLLM perceptual prior from a lightweight residual calibration branch to adapt video quality assessment to new scenarios with under 2% trainable parameters and 20% of typical MOS labels.
-
Fleet: Few Shots Lead Effective AI-generated Image Detection
Fleet achieves dynamic few-shot adaptation for AIGI detection via avoidance routing in decoupled subspaces, raising accuracy from 20.4% to 73.1% on new generators like Doubao Seedream 4.0 with 10 shots.
-
RBE-Flow: Recurrent Bayesian Estimation on Feature Manifolds for Cross-Modal Registration
RBE-Flow recasts dense cross-modal flow estimation as closed-loop recurrent Bayesian estimation on learned feature manifolds with uncertainty-adaptive updates and achieves SOTA on three registration benchmarks.
-
Functional Equivalence in Attention: A Comprehensive Study with Applications to Linear Mode Connectivity
Rotary positional encodings reduce the symmetry group of functional equivalence in attention compared to sinusoidal encodings, increasing expressivity and altering linear mode connectivity patterns.
-
Harnessing Streaming Video in the Wild
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
-
Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR
Introduces ShopTrajQA long-context benchmark and an RLVR-trained tool-augmented agent that bypasses LLM context limits by external file storage and code-based retrieval for shopping trajectories.
-
AFUN: Towards an Affordance Foundation Model for Functionality Understanding
AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.
-
MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance
MultiAct is an unpaired inference-time method that adaptively amplifies cross-attention for underrepresented components in composite text prompts to improve semantic coverage in motion generation while preserving realism.
-
Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event Cameras
Two event cameras automatically estimate impact time, racket-face location, and shuttlecock speed in badminton smashes, validated against high-speed cameras on 124 trials with small biases and no proportional error.
-
No One Knows the State of the Art in Geospatial Foundation Models
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
-
Causal Attribution via Activation Patching
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
-
SmartWalkCoach: An AI Companion for End-to-End Walking Guidance, Motivation, and Reflection
A three-agent mobile system for end-to-end walking support shows motivational companion dialogue boosts affect and UX in a 12-person in-the-wild crossover study.
-
ACPO: Anchor-Constrained Perceptual Optimization for Diffusion Models with No-Reference Quality Guidance
ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
-
RareSpot+: A Benchmark, Model, and Active Learning Framework for Small and Rare Wildlife in Aerial Imagery
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
-
Deep Light Pollution Removal in Night Cityscape Photographs
A deep learning method with an enhanced physical degradation model incorporating anisotropic light spread and hidden skyglow, trained via generative models and synthetic-real coupling, removes light pollution from night cityscape images more effectively than prior restoration techniques.
-
Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions
STEP uses dynamic superpatch merging via dCTS and early token exits to cut token count by 2.5x and computational complexity by up to 4x on ViT-Large for high-res segmentation, with at most 2% accuracy drop and 40% tokens halted early.
-
A Theory-grounded Hybrid Neural Network Integrating Complementary Estimation Mechanisms for Stable Visual Object TrackingA
Hybrid ANN-CANN network for visual object tracking that operationalizes bias-variance complementarity to outperform baselines on nine benchmarks.
-
From Full Boards to Tiny Defects: Scale-Aware Tile Inference with Topology-Aware Merging for High-Resolution PCB Defect Detection
Tile-based inference with topology-aware merging improves small PCB defect detection by preserving scale and resolving edge artifacts on two datasets.
-
SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection
SPECTRA-Net fuses multi-view tensor representations from vision foundation models, spectral analysis, local anomaly detection, and statistical descriptors to achieve state-of-the-art cross-domain AI-generated image detection with explainable artifact localization.
-
Generalization Under Scrutiny: Cross-Domain Detection Progresses, Pitfalls, and Persistent Challenges
A survey that organizes methods for cross-domain object detection into a taxonomy, analyzes domain shift across detection stages, and outlines persistent challenges.
-
Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods, Datasets, Challenges and Future Research Directions
A literature review that categorizes deep learning approaches for visual hand gesture recognition, summarizes state-of-the-art methods across tasks, reviews datasets and metrics, and identifies challenges and future directions.