AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.
hub Baseline reference
In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Work- shops (CVPR W), pp
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.
Introduces a unified benchmark for continual anomaly detection with discrete and continuous protocols plus a training-free DINOSaur method that outperforms prior CAD approaches with zero forgetting and sub-100ms edge inference.
COSY uses independent per-component 3DGS generators plus context tokens to achieve disentangled semantic editing of human heads without masks or classifiers.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
AdaGScale uses viewpoint-adaptive scaling of Gaussians in 3D-GS by estimating peripheral color contributions to reduce Gaussian-tile pairs, delivering 13.8x geometric mean speedup with ~0.5 dB PSNR loss on city-scale scenes.
DPC-VQA decouples a frozen MLLM perceptual prior from a lightweight residual calibration branch to adapt video quality assessment to new scenarios with under 2% trainable parameters and 20% of typical MOS labels.
Fleet achieves dynamic few-shot adaptation for AIGI detection via avoidance routing in decoupled subspaces, raising accuracy from 20.4% to 73.1% on new generators like Doubao Seedream 4.0 with 10 shots.
RBE-Flow recasts dense cross-modal flow estimation as closed-loop recurrent Bayesian estimation on learned feature manifolds with uncertainty-adaptive updates and achieves SOTA on three registration benchmarks.
Rotary positional encodings reduce the symmetry group of functional equivalence in attention compared to sinusoidal encodings, increasing expressivity and altering linear mode connectivity patterns.
Presents Streaming-Train-248K dataset, Streaming Harness system, and Streaming-Eval benchmark to enable VLMs for proactive, memory-equipped streaming video understanding.
Introduces ShopTrajQA long-context benchmark and an RLVR-trained tool-augmented agent that bypasses LLM context limits by external file storage and code-based retrieval for shopping trajectories.
AFUN predicts task-conditional functional masks and 3D post-contact motion curves from RGB-D and language, trained via a standardized multi-source data pipeline, and reports large gains over baselines on segmentation, contact prediction, and motion tasks.
MultiAct is an unpaired inference-time method that adaptively amplifies cross-attention for underrepresented components in composite text prompts to improve semantic coverage in motion generation while preserving realism.
Two event cameras automatically estimate impact time, racket-face location, and shuttlecock speed in badminton smashes, validated against high-speed cameras on 124 trials with small biases and no proportional error.
An audit of 152 papers reveals that geospatial foundation models lack standardized evaluations, training controls, and weight releases, so no one knows the state of the art.
CAAP produces patch attributions in ViTs by direct activation patching on intermediate layers to measure causal contribution to the target class score.
A three-agent mobile system for end-to-end walking support shows motivational companion dialogue boosts affect and UX in a 12-person in-the-wild crossover study.
ACPO uses anchor-based regularization with NR-IQA guidance to enable stable perceptual quality improvements in diffusion model fine-tuning.
RareSpot+ boosts small-object detection mAP by 0.13 on aerial wildlife data and cuts annotation needs to 1.7% of tiles via consistency losses and spatial priors.
A deep learning method with an enhanced physical degradation model incorporating anisotropic light spread and hidden skyglow, trained via generative models and synthetic-real coupling, removes light pollution from night cityscape images more effectively than prior restoration techniques.
STEP uses dynamic superpatch merging via dCTS and early token exits to cut token count by 2.5x and computational complexity by up to 4x on ViT-Large for high-res segmentation, with at most 2% accuracy drop and 40% tokens halted early.
Hybrid ANN-CANN network for visual object tracking that operationalizes bias-variance complementarity to outperform baselines on nine benchmarks.
citing papers explorer
No citing papers match the current filters.