Neural reconstruction losses in VAEs reduce latent information content and produce more isotropic latent geometries with even uncertainty distribution.
Canonical reference
In: IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2022, Waikoloa, HI, USA, January 3-8, 2022
Canonical reference. 79% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
representative citing papers
PanoPlane achieves up to 17.8% PSNR gains in sparse-view indoor novel view synthesis by using training-free plane-aware panoramic completion to supervise 3D Gaussian Splatting.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
CEA assembles per-token low-rank residual updates via dense affinities over hyper-adapter-generated components to improve all-in-one image restoration on spatially non-uniform degradations.
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
A deep learning framework represents phase on the unit circle with a geodesic loss for improved ptychographic amplitude and phase reconstruction.
VGIA certifies exact recovery of individual records from aggregated gradients in federated learning using a subspace verification test on ReLU hyperplanes.
The C-Score quantifies intra-class explanation consistency for CAM methods via confidence-weighted pairwise soft IoU and detects AUC-consistency dissociation as an early warning for model instability on chest X-ray classification.
Diffusion models reconstruct high-resolution 3D cardiac ultrasound volumes from heavily undersampled elevation planes and outperform traditional interpolation and supervised deep learning baselines.
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
OSOR is a one-step diffusion inpainting method using an occupancy-guided discriminator, alpha head, and semantic-anchored verification pipeline to achieve effect-aware object removal, outperforming multi-step baselines in quality at 4-30x speed.
QUEST measures uncertainty via the Lebesgue volume of highest-density regions of a distribution's support, evaluated at robustness parameter alpha, and claims to satisfy UQ axioms while outperforming variance and differential entropy on selective prediction tasks.
DiffPC reformulates projector photometric compensation as a diffusion-based denoising task guided by photometry and image content to achieve better results in unseen environments.
GRASP maps natural language to bounding-box goals via VLM for neuro-symbolic planning and reports 73.3% success in 90 real-robot trials without task-specific training.
MoRE integrates a sparsely activated MoE module with unsupervised routing into a variational network for stable multimodal MRI reconstruction on fastMRI brain and knee data at 8x undersampling.
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.
Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperforms them across multiple models and settings.
LiFT factorizes 3D medical volume synthesis into per-slice 2D generation and inter-slice trajectory learning, using a tri-planar drifting loss for unconditional coherence and a z-context mixer for paired translation tasks.
SCOUP decouples 2D sparse code learning from 3D Gaussian optimization to deliver up to 400x training speedup and 3x better memory efficiency while matching accuracy on open-vocabulary 3D queries.
A new 839K-image plant disease dataset paired with an agentic visual reasoning system that uses source-grounded symptoms raises diagnosis accuracy by 16.2 points on average and generalizes to unseen crops without retraining.
MAG-VLAQ fuses multi-modal ground and aerial data via ODE-conditioned vector-of-locally-aggregated-queries to nearly double recall@1 on aerial-ground place recognition benchmarks.
Lexical acoustic coding lets LLMs transmit audio waveforms as editable natural-language sentences that another LLM can parse and reconstruct into sound.
Frozen DINOv3 features with multi-view MLP probes, entropy-weighted fusion, and spatial regularization achieve 0.895 Dice on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR, recovering 98.4% of full-data performance with only five annotated patients.
LARGO uses a low-rank hypernetwork with CP decomposition to unify 2^N-1 missing-modality models into one, ranking first in 47 of 52 configurations on BraTS and ISLES with small Dice gains over baselines.
citing papers explorer
-
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents
IPO-Mine releases a toolkit and large multimodal dataset for structured analysis of IPO filings and shows state-of-the-art models diverge from human judgments on chart quality and misleadingness.