Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
Canonical reference
In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022
Canonical reference. 79% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
A deep learning framework represents phase on the unit circle with a geodesic loss for improved ptychographic amplitude and phase reconstruction.
VGIA certifies exact recovery of individual records from aggregated gradients in federated learning using a subspace verification test on ReLU hyperplanes.
The C-Score quantifies intra-class explanation consistency for CAM methods via confidence-weighted pairwise soft IoU and detects AUC-consistency dissociation as an early warning for model instability on chest X-ray classification.
Diffusion models reconstruct high-resolution 3D cardiac ultrasound volumes from heavily undersampled elevation planes and outperform traditional interpolation and supervised deep learning baselines.
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
OSOR is a one-step diffusion inpainting method using an occupancy-guided discriminator, alpha head, and semantic-anchored verification pipeline to achieve effect-aware object removal, outperforming multi-step baselines in quality at 4-30x speed.
QUEST measures uncertainty via the Lebesgue volume of highest-density regions of a distribution's support, evaluated at robustness parameter alpha, and claims to satisfy UQ axioms while outperforming variance and differential entropy on selective prediction tasks.
DiffPC reformulates projector photometric compensation as a diffusion-based denoising task guided by photometry and image content to achieve better results in unseen environments.
GRASP maps natural language to bounding-box goals via VLM for neuro-symbolic planning and reports 73.3% success in 90 real-robot trials without task-specific training.
MoRE integrates a sparsely activated MoE module with unsupervised routing into a variational network for stable multimodal MRI reconstruction on fastMRI brain and knee data at 8x undersampling.
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.
LiFT factorizes 3D medical volume synthesis into per-slice 2D generation and inter-slice trajectory learning, using a tri-planar drifting loss for unconditional coherence and a z-context mixer for paired translation tasks.
SCOUP decouples 2D sparse code learning from 3D Gaussian optimization to deliver up to 400x training speedup and 3x better memory efficiency while matching accuracy on open-vocabulary 3D queries.
A new 839K-image plant disease dataset paired with an agentic visual reasoning system that uses source-grounded symptoms raises diagnosis accuracy by 16.2 points on average and generalizes to unseen crops without retraining.
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.
Frozen DINOv3 features with multi-view MLP probes, entropy-weighted fusion, and spatial regularization achieve 0.895 Dice on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR, recovering 98.4% of full-data performance with only five annotated patients.
LARGO uses a low-rank hypernetwork with CP decomposition to unify 2^N-1 missing-modality models into one, ranking first in 47 of 52 configurations on BraTS and ISLES with small Dice gains over baselines.
citing papers explorer
-
PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
Continuous Expert Assembly: Instance-Conditioned Low-Rank Residuals for All-in-One Image Restoration
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
-
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
-
Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification
The C-Score quantifies intra-class explanation consistency for CAM methods via confidence-weighted pairwise soft IoU and detects AUC-consistency dissociation as an early warning for model instability on chest X-ray classification.
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
OSOR: One-Step Diffusion Inpainting for Effect-Aware Object Removal
OSOR is a one-step diffusion inpainting method using an occupancy-guided discriminator, alpha head, and semantic-anchored verification pipeline to achieve effect-aware object removal, outperforming multi-step baselines in quality at 4-30x speed.
-
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.
-
LiFT: Lifted Inter-slice Feature Trajectories for 3D Image Generation from 2D Generators
LiFT factorizes 3D medical volume synthesis into per-slice 2D generation and inter-slice trajectory learning, using a tri-planar drifting loss for unconditional coherence and a z-context mixer for paired translation tasks.
-
Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting
SCOUP decouples 2D sparse code learning from 3D Gaussian optimization to deliver up to 400x training speedup and 3x better memory efficiency while matching accuracy on open-vocabulary 3D queries.
-
MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
-
DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation
Frozen DINOv3 features with multi-view MLP probes, entropy-weighted fusion, and spatial regularization achieve 0.895 Dice on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR, recovering 98.4% of full-data performance with only five annotated patients.
-
LARGO: Low-Rank Hypernetwork for Handling Missing Modalities
LARGO uses a low-rank hypernetwork with CP decomposition to unify 2^N-1 missing-modality models into one, ranking first in 47 of 52 configurations on BraTS and ISLES with small Dice gains over baselines.
-
Amodal SAM: A Unified Amodal Segmentation Framework with Generalization
Amodal SAM extends SAM with a Spatial Completion Adapter, Target-Aware Occlusion Synthesis for data, and consistency losses to reach SOTA amodal segmentation with strong generalization to new objects and scenes.
-
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
-
Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection
RPC is a post-hoc calibration technique that augments flow-based anomaly scores with nearest-prototype deviation in the frozen latent space, gated by keypoint confidence, yielding consistent AUROC gains on video anomaly detection tasks.
-
Image Quality Assessment of Identity Cards Using Measures from Open Face Image Quality
OFIQ quality measures applied to preprocessed ID card images show correlation with improved presentation attack detection performance across four datasets containing both real and printed mock cards.
-
Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
CIR benchmarks contain many unimodal shortcuts and noisy queries, leading to overestimation of models' multimodal composition capabilities.
-
Beyond Masks: The Case for Medical Image Parsing
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.
-
Weighted Knowledge Distillation for Semi-Supervised Segmentation of Maxillary Sinus in Panoramic X-ray Images
A semi-supervised framework using weighted knowledge distillation and SinusCycle-GAN refinement achieves 96.35% Dice score for maxillary sinus segmentation in panoramic X-rays from 2,511 patients.
-
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
-
3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography
A 3D self-supervised foundation model trained on over 360k head CT scans improves downstream disease classification on limited-label internal and external datasets versus scratch-trained and prior models.
-
A Heterogeneous Two-Stream Framework for Video Action Recognition with Comparative Fusion Analysis
DualStreamHybrid assigns ViT-Tiny to RGB and MobileNetV2 to 20-channel flow, projects features to common space, and finds cross-attention best on UCF11 (98.12%) while weighted fusion is most consistent on UCF50 (96.86%).
-
Variational Latent Entropy Estimation Disentanglement: Controlled Attribute Leakage for Face Recognition
VLEED uses variational latent entropy estimation to separate categorical attributes from identity in face embeddings, achieving wider privacy-utility tradeoffs and bias reduction than prior methods on IJB-C, RFW, and VGGFace2.