Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

Claire Kim; Hyunhee Park; Jaeeun Lee; Jaewon Min; Jihye Park; Jin Hyeon Kim; Kyoungjin Oh; MinKyu Park; Paul Hyunbin Cho; Seungryong Kim

arxiv: 2605.26230 · v1 · pith:NIVODSZInew · submitted 2026-05-25 · 💻 cs.CV

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

Jin Hyeon Kim , Jaeeun Lee , Claire Kim , Kyoungjin Oh , Paul Hyunbin Cho , Jaewon Min , Yeji Choi , Jihye Park

show 3 more authors

Hyunhee Park Minkyu Park Seungryong Kim

This is my paper

Pith reviewed 2026-06-29 23:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-view 3D reconstructiondiffusion denoisingfeature space restorationgeometry-aware featuresrobust reconstructionimage restorationfeed-forward models

0 comments

The pith

GARD restores accurate 3D scene geometry by applying diffusion denoising directly inside the feature space of a feed-forward reconstructor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Geometry-Aware Representation Denoising (GARD) to improve multi-view 3D reconstruction when input images contain real-world degradations. It performs diffusion-based restoration in the feature space of an existing feed-forward 3D model rather than in pixel space. This leverages the geometry-aware properties already present in those features to recover scene structure without modeling degradations explicitly. An added decoder further allows the same refined features to produce high-quality RGB images. The approach is evaluated on the Depth Anything 3 benchmark for degraded conditions.

Core claim

GARD performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. An additional RGB image decoder enables simultaneous recovery of high-quality imagery from the refined representations.

What carries the argument

Diffusion process operating on geometry-aware features extracted by the base 3D reconstructor, which guides restoration without explicit degradation modeling.

If this is right

Degraded multi-view inputs can yield accurate scene geometry after feature-space diffusion.
The same refined features can produce restored high-quality RGB images via a decoder.
No separate degradation model is required for the restoration step.
The method improves robustness on benchmarks containing real-world image degradations such as DA3.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feature-space strategy might transfer to other feed-forward 3D models that produce geometry-aware representations.
Operating in feature space could prove more efficient than pixel-space diffusion for tasks where geometry is the primary output.
Downstream applications such as robotics or augmented reality that rely on multi-view reconstruction would gain practical robustness.

Load-bearing premise

The geometry-aware features already produced by the base 3D reconstruction model contain enough information to steer diffusion-based recovery of scene geometry under arbitrary real-world degradations.

What would settle it

Running the base 3D reconstructor with and without GARD on the same set of degraded multi-view images and finding no measurable improvement in reconstruction accuracy metrics would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.26230 by Claire Kim, Hyunhee Park, Jaeeun Lee, Jaewon Min, Jihye Park, Jin Hyeon Kim, Kyoungjin Oh, MinKyu Park, Paul Hyunbin Cho, Seungryong Kim, Yeji Choi.

**Figure 1.** Figure 1: Geometry-Aware Representation Denoising (GARD) framework. Given degraded multiview input images, our approach performs denoising on geometry-aware representations, thereby enabling simultaneous recovery of accurate 3D scene geometry and high-quality multi-view imagery. Abstract Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, … view at source ↗

**Figure 2.** Figure 2: Comparison of restoration denoising spaces. (a) A restore-then-reconstruct pipeline first performs pixel-space restoration prior to 3D reconstruction. However, performing restoration in a single-view setting [69, 6, 5] or within a heavily compressed VAE-based latent space [34, 23] fails to preserve cross-view consistency and fine-grained geometric details, which often results in suboptimal geometric recons… view at source ↗

**Figure 3.** Figure 3: Overview of the GARD framework. (a) The GARD denoiser Sθ(·) is learned within the representation space of a frozen multi-view encoder [29] to restore degraded intermediate representations z K deg into restored representations z K res before they are propagated through the remaining encoder layers. The restored representations Zres are then decoded by their respective decoders to produce geometry prediction… view at source ↗

**Figure 4.** Figure 4: Geometry-aware feature analysis conducted on ETH3D [51]. We evaluate the PCK accuracy of three feature cost volumes [23, 40, 29] under two experimental settings to validate the effectiveness of our proposed denoising space. (a) PCK performance on high-quality (HQ) input images. (b) PCK performance under progressively increasing levels of degradation (mild, moderate, and heavy), demonstrating robustness to … view at source ↗

**Figure 5.** Figure 5: Qualitative results for camera trajectory prediction. We visualize the top-down camera trajectories for degraded multi-view inputs. Compared to the baselines, the proposed GARD produces more accurate and geometrically consistent camera pose trajectories. The black dot indicates the starting camera point. Please zoom in for clearer visualization. GARD denoiser architecture adopts DiTDH from RAE [71], augmen… view at source ↗

**Figure 6.** Figure 6: Qualitative 3D reconstruction results. We visualize the reconstructed 3D point clouds from degraded multi-view inputs. Compared with baseline approaches, the proposed GARD produces more accurate and geometrically consistent 3D reconstructions. Please zoom in for clearer visualization. geometry under degraded conditions. While single-view restoration models may improve perceptual image quality, their inab… view at source ↗

**Figure 7.** Figure 7: Qualitative image restoration results. We visualize restored RGB images from degraded multi-view inputs. Compared with baseline approaches, the proposed GARD effectively recovers high-fidelity multi-view images while preserving fine-grained details. Please zoom in for clearer visualization. 4.3 Ablation Experiments GARD denoiser training components. Tab. 4 presents an ablation study on the training compon… view at source ↗

**Figure 8.** Figure 8: Feature similarity analysis across layers. We evaluate the cosine similarity between the restored feature representations and the corresponding clean HQ representations across the multiview encoder layers of the feed-forward 3D reconstruction model [29]. The GARD denoiser is applied at layer K = 18. Across all DA3 benchmark [29] datasets, the feature similarity of the degraded LQ representations (red) pro… view at source ↗

**Figure 9.** Figure 9: Correspondence visualization of feature cost volumes. Cross-view correspondence visualization of feature cost volumes constructed from VAE [23], DINOv2 [40], and DA3 [29] feature cost volumes. Please zoom in for clearer visualization. A.3 Multi-View Depth Estimation Tab. 6 further presents the multi-view depth estimation evaluation. We report AbsRel↓ and δ1 ↑, which measure the absolute relative depth erro… view at source ↗

**Figure 10.** Figure 10: While SIR-Diff can produce reasonable restorations under mild degradation settings, its [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of target correspondence maps We visualize the effect of attention alignment training which augments the learning of the attention of the GARD denoiser. Dec. L.9 Target correspondence map (HQ point cloud) Dec. L.11 Dec. L.13 (a) Before attention alignment (b) After attention aligment Reference (query) View 1 View 2 View 3 Dec. L.9 Dec. L.11 Dec. L.13 Reference (query) View 1 View 2 View 3 De… view at source ↗

**Figure 12.** Figure 12: Visualization of attention alignment We visualize the effect of attention alignment training which augments the learning of the attention of the GARD denoiser. View selection strategy. To construct multi-view training samples, we adopt a view selection strategy based on the ground-truth camera poses provided in each dataset. Given a target number of views V , we retrieve neighboring frames using an expans… view at source ↗

**Figure 13.** Figure 13: Qualitative camera pose estimation on the DA3 benchmark [29]. We visualize the top-down camera trajectory results for ten input views. The black dot indicates the starting camera point. Please zoom in for clearer visualization. Input Views GARD (Ours) Restormer VAEMVD HI-Diff InstructIR MoCE-IR VRT FMA-Net Input Views GARD (Ours) Restormer VAEMVD HI-Diff InstructIR MoCE-IR VRT FMA-Net Input Views GARD (Ou… view at source ↗

**Figure 14.** Figure 14: Qualitative 3D reconstruction results on the DA3 benchmark [29]. We visualize the 3D reconstruction point cloud results for ten input views. Please zoom in for clearer visualization. 8 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative image restoration results on the DA3 benchmark [29]. We visualize three selected views out of ten input views for each dataset. Please zoom in for clearer visualization. 9 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative depth estimation results on the DA3 benchmark [29]. We visualize three selected views out of ten input views for each dataset. Please zoom in for clearer visualization. 10 [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

read the original abstract

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GARD puts diffusion inside a 3D reconstructor's features to handle degraded multi-view inputs, but the approach depends on an assumption that the features stay useful under real degradations.

read the letter

The paper's main move is to run diffusion-based denoising directly in the feature space of a feed-forward 3D reconstruction model rather than in pixel space. It adds an RGB decoder so the same refined features can also produce cleaned images. That is the concrete thing the work contributes.

The design choice makes sense on paper: the reconstructor already encodes geometry, so denoising there should help recover scene structure without needing an explicit degradation model. The abstract states the method clearly and points to experiments on the DA3 benchmark as evidence.

The soft spot is exactly the one in the stress-test note. The feature extractor was trained on clean data, and nothing in the description adds robustness training or an explicit degradation module to it. If those features lose geometric signal on degraded inputs, the diffusion step has nothing reliable to work with. The abstract claims the framework works but supplies no numbers, ablations, or feature analysis that would let a reader check whether the assumption holds. That leaves the central claim unsupported from what is visible.

This is for people already working on feed-forward multi-view reconstruction who need to deal with real-world images. A reader looking for new robustness tricks might pick up the feature-space idea. It is worth sending to peer review because the framing is coherent and the practical problem is real, even though the current evidence is thin and the key precondition needs direct testing.

Referee Report

2 major / 0 minor

Summary. The paper proposes Geometry-Aware Representation Denoising (GARD), a framework that applies diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. It claims this exploits geometry-aware features to recover accurate scene geometry under real-world degradations, and that an additional RGB decoder enables simultaneous high-quality image restoration. The abstract states that comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the framework's effectiveness.

Significance. If the central claim holds and is supported by rigorous quantitative evidence, the approach could offer a practical route to robust multi-view 3D reconstruction by avoiding explicit degradation modeling. The design choice to operate in the reconstructor's feature space is conceptually interesting and could generalize to other feed-forward models, but the current manuscript provides no data to evaluate whether the claimed gains materialize.

major comments (2)

[Abstract] Abstract: the claim that 'comprehensive experiments on the DA3 benchmark demonstrate the effectiveness' is unsupported because the abstract (and visible manuscript) supplies no quantitative results, baselines, ablation studies, or error analysis. This absence makes the central claim unevaluable from the provided text.
[Abstract / Method] Method description (implicit in abstract): the design assumes the base feed-forward 3D model's feature extractor, trained only on clean data, still yields usable geometry-aware representations on arbitrarily degraded inputs. No robustness training, explicit degradation model, or analysis of feature collapse under degradation is described, leaving the precondition for the diffusion process unsupported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our manuscript. We address each major comment below and note the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'comprehensive experiments on the DA3 benchmark demonstrate the effectiveness' is unsupported because the abstract (and visible manuscript) supplies no quantitative results, baselines, ablation studies, or error analysis. This absence makes the central claim unevaluable from the provided text.

Authors: We agree that the abstract would benefit from explicit quantitative support to make the effectiveness claim directly evaluable. In the revised manuscript we will update the abstract to include representative metrics, such as PSNR/SSIM gains for image restoration and Chamfer distance or depth error reductions for 3D reconstruction relative to the base feed-forward model and other baselines on the DA3 benchmark. revision: yes
Referee: [Abstract / Method] Method description (implicit in abstract): the design assumes the base feed-forward 3D model's feature extractor, trained only on clean data, still yields usable geometry-aware representations on arbitrarily degraded inputs. No robustness training, explicit degradation model, or analysis of feature collapse under degradation is described, leaving the precondition for the diffusion process unsupported.

Authors: The diffusion model is trained end-to-end on feature pairs extracted by the frozen base model from clean versus synthetically degraded multi-view inputs; the denoising objective therefore learns to recover geometry-aware features directly from degraded observations without requiring the base extractor to be retrained or an explicit degradation model to be specified. We acknowledge that an analysis of feature degradation (e.g., cosine similarity or reconstruction error before/after diffusion across degradation types) is currently absent and would strengthen the manuscript. We will add this analysis, including visualizations of feature collapse, in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity; method description introduces independent framework without self-referential reductions

full rationale

The provided abstract and context describe GARD as a novel framework that applies diffusion-based restoration directly in the feature space of an existing feed-forward 3D reconstruction model, exploiting its geometry-aware representations. No equations, parameter fittings, derivations, or self-citations are shown that would reduce any prediction or claim to its own inputs by construction. The design choice is presented as an exploitation of pre-existing model properties rather than a self-definitional loop, fitted-input prediction, or ansatz smuggled via citation. The central claim remains an architectural proposal whose validity depends on external empirical validation rather than internal reduction to the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption Feed-forward 3D reconstruction models produce geometry-aware feature representations that can be effectively denoised via diffusion to recover scene geometry.
This premise is invoked as the core justification for operating the diffusion process in feature space rather than image space.

pith-pipeline@v0.9.1-grok · 5740 in / 1258 out tokens · 23423 ms · 2026-06-29T23:06:54.285049+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 20 canonical work pages · 7 internal anchors

[1]

Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, 120(2):153–168, 2016

Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, 120(2):153–168, 2016

2016
[2]

Cross-view completion models are zero-shot correspondence estimators

Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-view completion models are zero-shot correspondence estimators. arXiv preprint arXiv:2412.09072, 2024

work page arXiv 2024
[3]

Rgb cameras failures and their effects in autonomous driving applications.IEEE Transactions on Dependable and Secure Computing, 20(4):2731– 2745, 2022

Andrea Ceccarelli and Francesco Secci. Rgb cameras failures and their effects in autonomous driving applications.IEEE Transactions on Dependable and Secure Computing, 20(4):2731– 2745, 2022

2022
[4]

Lovif 2026 challenge on real-world all-in-one image restoration: Methods and results, 2026

Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li, Xin He, Naiwei Chen, Shengyuan Li, Fengning Liu, Haoyi Lv, Haowei Peng, Yilian Zhong, Yuxiang Chen, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Kaibin Chen, Xu Zhang, Xuhui Cao, Jiaqi Ma, Ziqi Wang, Shengkai Hu, Yuning Cui, Huan Zhang, Shi Chen, Bin Ren, Lefei Zhang, Guanglu Dong, Qiyao Z...

2026
[5]

Hierar- chical integration diffusion model for realistic image deblurring

Zheng Chen, Yulun Zhang, Liu Ding, Xia Bin, Jinjin Gu, Linghe Kong, and Xin Yuan. Hierar- chical integration diffusion model for realistic image deblurring. InNeurIPS, 2023

2023
[6]

Instructir: High-quality image restoration following human instructions

Marcos V Conde, Gregor Geigle, and Radu Timofte. Instructir: High-quality image restoration following human instructions. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024
[7]

maplab 2.0–a modular and multi-modal mapping framework.IEEE Robotics and Automation Letters, 8(2):520–527, 2022

Andrei Cramariuc, Lukas Bernreiter, Florian Tschopp, Marius Fehr, Victor Reijgwart, Juan Nieto, Roland Siegwart, and Cesar Cadena. maplab 2.0–a modular and multi-modal mapping framework.IEEE Robotics and Automation Letters, 8(2):520–527, 2022

2022
[8]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018

2018
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Dit4sr: Taming diffusion transformer for real-world image super-resolution

Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[11]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[12]

Towards internet-scale multi-view stereo

Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 1434–1441. IEEE, 2010

2010
[13]

Yasutaka Furukawa and Carlos Hernández.Multi-View Stereo: A Tutorial. 01 2015

2015
[14]

Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025. 12

work page arXiv 2025
[15]

Cambridge university press, 2003

Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003

2003
[16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022
[17]

Repurposing geometric foundation models for multi-view diffusion.arXiv preprint arXiv:2603.22275, 2026

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, and Sainan Liu. Repurposing geometric foundation models for multi-view diffusion.arXiv preprint arXiv:2603.22275, 2026

work page arXiv 2026
[18]

A survey on all-in-one image restoration: Taxonomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(12):11892–11911, December 2025

Junjun Jiang, Zengyuan Zuo, Gang Wu, Kui Jiang, and Xianming Liu. A survey on all-in-one image restoration: Taxonomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(12):11892–11911, December 2025

2025
[19]

Matrix: Mask track alignment for interaction-aware video generation, 2025

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiy- oung Kim, and Seungryong Kim. Matrix: Mask track alignment for interaction-aware video generation, 2025

2025
[20]

Computational tradeoffs in image syn- thesis: Diffusion, masked-token, and next-token prediction.arXiv preprint arXiv:2405.13218, 2024

Maciej Kilian, Varun Jampani, and Luke Zettlemoyer. Computational tradeoffs in image syn- thesis: Diffusion, masked-token, and next-token prediction.arXiv preprint arXiv:2405.13218, 2024

work page arXiv 2024
[21]

Unified diffusion transformer for high-fidelity text-aware image restoration, 2025

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, and Seungryong Kim. Unified diffusion transformer for high-fidelity text-aware image restoration, 2025

2025
[22]

Real-time image de-blurring and image processing for a robotic vision system

Michael D Kim and Jun Ueda. Real-time image de-blurring and image processing for a robotic vision system. In2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1899–1904. IEEE, 2015

1904
[23]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[24]

Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

Amandeep Kumar and Vishal M Patel. Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

work page arXiv 2026
[25]

Cameo: Correspondence-attention alignment for multi-view diffusion models.arXiv preprint arXiv:2512.03045, 2025

Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Min- Seop Kwak, Jin-Hwa Kim, and Seungryong Kim. Cameo: Correspondence-attention alignment for multi-view diffusion models.arXiv preprint arXiv:2512.03045, 2025

work page arXiv 2025
[26]

Do- main generalization using large pretrained models with mixture-of-adapters.arXiv preprint arXiv:2310.11031, 2023

Gyuseong Lee, Wooseok Jang, Jin Hyeon Kim, Jaewoo Jung, and Seungryong Kim. Do- main generalization using large pretrained models with mixture-of-adapters.arXiv preprint arXiv:2310.11031, 2023

work page arXiv 2023
[27]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

2024
[28]

Vrt: A video restoration transformer.IEEE Transactions on Image Processing, 33:2171–2182, 2024

Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer.IEEE Transactions on Image Processing, 33:2171–2182, 2024

2024
[29]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Diffbir: Towards blind image restoration with generative diffusion prior, 2024

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior, 2024

2024
[31]

Boosting visual recognition in real-world degradations via unsupervised feature enhancement module with deep channel prior.IEEE Transactions on Intelligent Vehicles, 2024

Zhanwen Liu, Yuhang Li, Yang Wang, Bolin Gao, Yisheng An, and Xiangmo Zhao. Boosting visual recognition in real-world degradations via unsupervised feature enhancement module with deep channel prior.IEEE Transactions on Intelligent Vehicles, 2024. 13

2024
[32]

Depth estimation from monocular images and sparse radar using deep ordinal regression network

Chen-Chou Lo and Patrick Vandewalle. Depth estimation from monocular images and sparse radar using deep ordinal regression network. In2021 IEEE International Conference on Image Processing (ICIP), pages 3343–3347, 2021

2021
[33]

Unsupervised methods for video quality improvement: a survey of restoration and enhancement techniques

Alexandra Malyugina, Yini Li, Joanne Lin, and Nantheera Anantrasirichai. Unsupervised methods for video quality improvement: a survey of restoration and enhancement techniques. arXiv preprint arXiv:2507.08375, 2025

work page arXiv 2025
[34]

Sir-diff: Sparse image sets restoration with multi-view diffusion model

Yucheng Mao, Boyang Wang, Nilesh Kulkarni, and Jeong Joon Park. Sir-diff: Sparse image sets restoration with multi-view diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21620–21630, 2025

2025
[35]

Text-aware image restoration with diffusion models

Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, and Seungryong Kim. Text-aware image restoration with diffusion models. arXiv preprint arXiv:2506.09993, 2025

work page arXiv 2025
[36]

Deep multi-scale convolutional neural network for dynamic scene deblurring

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017

2017
[37]

Deep multi-scale convolutional neural network for dynamic scene deblurring, 2018

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring, 2018

2018
[38]

Emergent temporal correspondences from video diffusion transformers, 2025

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, and Seungryong Kim. Emergent temporal correspondences from video diffusion transformers, 2025

2025
[39]

Augmented reality based on estimation of defocusing and motion blurring from captured images

Bunyo Okumura, Masayuki Kanbara, and Naokazu Yokoya. Augmented reality based on estimation of defocusing and motion blurring from captured images. In2006 IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 219–225. IEEE, 2006

2006
[40]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Visual geometry transformer in the wild: Distractor-free 3d reconstruction

Tianbo Pan, Xingyi Yang, Shizun Wang, and Xinchao Wang. Visual geometry transformer in the wild: Distractor-free 3d reconstruction
[42]

Handling motion-blur in 3d tracking and rendering for augmented reality.IEEE transactions on visualization and computer graphics, 18(9):1449–1459, 2011

Youngmin Park, Vincent Lepetit, and Woontack Woo. Handling motion-blur in 3d tracking and rendering for augmented reality.IEEE transactions on visualization and computer graphics, 18(9):1449–1459, 2011

2011
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[45]

Real-world blur dataset for learning and benchmarking deblurring algorithms

Jaesung Rim, Haeyun Lee, Jucheol Won, and Sunghyun Cho. Real-world blur dataset for learning and benchmarking deblurring algorithms. InEuropean conference on computer vision, pages 184–201. Springer, 2020

2020
[46]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

2021
[47]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[48]

Kimera: an open-source library for real-time metric-semantic localization and mapping

Antoni Rosinol, Marcus Abate, Yun Chang, and Luca Carlone. Kimera: an open-source library for real-time metric-semantic localization and mapping. In2020 IEEE international conference on robotics and automation (ICRA), pages 1689–1696. IEEE, 2020. 14

2020
[49]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

2016
[50]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

2016
[51]

A multi-view stereo benchmark with high-resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017
[52]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, June 2021

2021
[53]

Masked depth modeling for spatial perception.arXiv preprint arXiv:[2601.17895], 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, and Nan Xue. Masked depth modeling for spatial perception.arXiv preprint arXiv:[2601.17895], 2026

work page arXiv 2026
[54]

A fast local descriptor for dense matching

Engin Tola, Vincent Lepetit, and Pascal Fua. A fast local descriptor for dense matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008

2008
[55]

Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

work page arXiv 2026
[56]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

An overview of autonomous vehicles sensors and their vulnerability to weather conditions.Sensors, 21(16):5397, 2021

Jorge Vargas, Suleiman Alsweiss, Onur Toker, Rahul Razdan, and Joshua Santos. An overview of autonomous vehicles sensors and their vulnerability to weather conditions.Sensors, 21(16):5397, 2021

2021
[58]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[59]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[60]

Ddt: Decoupled diffusion transformer

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025
[61]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[62]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020

2020
[63]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

Interactive real-time motion blur.The Visual Computer, 12(6):283–295, 1996

Matthias M Wloka and Robert C Zeleznik. Interactive real-time motion blur.The Visual Computer, 12(6):283–295, 1996. 15

1996
[65]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025
[66]

Scannet++: A high- fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

2023
[67]

Fma-net: Flow-guided dynamic filtering and iterative feature refinement with multi-attention for joint video super-resolution and deblurring

Geunhyuk Youk, Jihyong Oh, and Munchurl Kim. Fma-net: Flow-guided dynamic filtering and iterative feature refinement with multi-attention for joint video super-resolution and deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 44–55, June 2024

2024
[68]

Complexity experts are task-discriminative learners for any image restoration, 2024

Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yuedong Tan, Danda Pani Paudel, Yulun Zhang, and Radu Timofte. Complexity experts are task-discriminative learners for any image restoration, 2024

2024
[69]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022

2022
[70]

Ffdnet: Toward a fast and flexible solution for CNN based image denoising.IEEE Transactions on Image Processing, 2018

Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for CNN based image denoising.IEEE Transactions on Image Processing, 2018

2018
[71]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[72]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024
[73]

Springer International Publishing, Cham, 2018

Wangmeng Zuo, Kai Zhang, and Lei Zhang.Convolutional Neural Networks for Image Denoising and Restoration, pages 93–123. Springer International Publishing, Cham, 2018. 16 Appendix A Extended Experimental Results A.1 Feature Similarity Analysis To further validate the effectiveness of the proposed GARD framework, we investigate the feature similarity across...

2018

[1] [1]

Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, 120(2):153–168, 2016

Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.International Journal of Computer Vision, 120(2):153–168, 2016

2016

[2] [2]

Cross-view completion models are zero-shot correspondence estimators

Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-view completion models are zero-shot correspondence estimators. arXiv preprint arXiv:2412.09072, 2024

work page arXiv 2024

[3] [3]

Rgb cameras failures and their effects in autonomous driving applications.IEEE Transactions on Dependable and Secure Computing, 20(4):2731– 2745, 2022

Andrea Ceccarelli and Francesco Secci. Rgb cameras failures and their effects in autonomous driving applications.IEEE Transactions on Dependable and Secure Computing, 20(4):2731– 2745, 2022

2022

[4] [4]

Lovif 2026 challenge on real-world all-in-one image restoration: Methods and results, 2026

Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li, Xin He, Naiwei Chen, Shengyuan Li, Fengning Liu, Haoyi Lv, Haowei Peng, Yilian Zhong, Yuxiang Chen, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Kaibin Chen, Xu Zhang, Xuhui Cao, Jiaqi Ma, Ziqi Wang, Shengkai Hu, Yuning Cui, Huan Zhang, Shi Chen, Bin Ren, Lefei Zhang, Guanglu Dong, Qiyao Z...

2026

[5] [5]

Hierar- chical integration diffusion model for realistic image deblurring

Zheng Chen, Yulun Zhang, Liu Ding, Xia Bin, Jinjin Gu, Linghe Kong, and Xin Yuan. Hierar- chical integration diffusion model for realistic image deblurring. InNeurIPS, 2023

2023

[6] [6]

Instructir: High-quality image restoration following human instructions

Marcos V Conde, Gregor Geigle, and Radu Timofte. Instructir: High-quality image restoration following human instructions. InProceedings of the European Conference on Computer Vision (ECCV), 2024

2024

[7] [7]

maplab 2.0–a modular and multi-modal mapping framework.IEEE Robotics and Automation Letters, 8(2):520–527, 2022

Andrei Cramariuc, Lukas Bernreiter, Florian Tschopp, Marius Fehr, Victor Reijgwart, Juan Nieto, Roland Siegwart, and Cesar Cadena. maplab 2.0–a modular and multi-modal mapping framework.IEEE Robotics and Automation Letters, 8(2):520–527, 2022

2022

[8] [8]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018

2018

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Dit4sr: Taming diffusion transformer for real-world image super-resolution

Zheng-Peng Duan, Jiawei Zhang, Xin Jin, Ziheng Zhang, Zheng Xiong, Dongqing Zou, Jimmy Ren, Chun-Le Guo, and Chongyi Li. Dit4sr: Taming diffusion transformer for real-world image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[11] [11]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024

[12] [12]

Towards internet-scale multi-view stereo

Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. In2010 IEEE computer society conference on computer vision and pattern recognition, pages 1434–1441. IEEE, 2010

2010

[13] [13]

Yasutaka Furukawa and Carlos Hernández.Multi-View Stereo: A Tutorial. 01 2015

2015

[14] [14]

Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025. 12

work page arXiv 2025

[15] [15]

Cambridge university press, 2003

Richard Hartley and Andrew Zisserman.Multiple view geometry in computer vision. Cambridge university press, 2003

2003

[16] [16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

2022

[17] [17]

Repurposing geometric foundation models for multi-view diffusion.arXiv preprint arXiv:2603.22275, 2026

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, and Sainan Liu. Repurposing geometric foundation models for multi-view diffusion.arXiv preprint arXiv:2603.22275, 2026

work page arXiv 2026

[18] [18]

A survey on all-in-one image restoration: Taxonomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(12):11892–11911, December 2025

Junjun Jiang, Zengyuan Zuo, Gang Wu, Kui Jiang, and Xianming Liu. A survey on all-in-one image restoration: Taxonomy, evaluation and future trends.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47(12):11892–11911, December 2025

2025

[19] [19]

Matrix: Mask track alignment for interaction-aware video generation, 2025

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiy- oung Kim, and Seungryong Kim. Matrix: Mask track alignment for interaction-aware video generation, 2025

2025

[20] [20]

Computational tradeoffs in image syn- thesis: Diffusion, masked-token, and next-token prediction.arXiv preprint arXiv:2405.13218, 2024

Maciej Kilian, Varun Jampani, and Luke Zettlemoyer. Computational tradeoffs in image syn- thesis: Diffusion, masked-token, and next-token prediction.arXiv preprint arXiv:2405.13218, 2024

work page arXiv 2024

[21] [21]

Unified diffusion transformer for high-fidelity text-aware image restoration, 2025

Jin Hyeon Kim, Paul Hyunbin Cho, Claire Kim, Jaewon Min, Jaeeun Lee, Jihye Park, Yeji Choi, and Seungryong Kim. Unified diffusion transformer for high-fidelity text-aware image restoration, 2025

2025

[22] [22]

Real-time image de-blurring and image processing for a robotic vision system

Michael D Kim and Jun Ueda. Real-time image de-blurring and image processing for a robotic vision system. In2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1899–1904. IEEE, 2015

1904

[23] [23]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[24] [24]

Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

Amandeep Kumar and Vishal M Patel. Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

work page arXiv 2026

[25] [25]

Cameo: Correspondence-attention alignment for multi-view diffusion models.arXiv preprint arXiv:2512.03045, 2025

Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Min- Seop Kwak, Jin-Hwa Kim, and Seungryong Kim. Cameo: Correspondence-attention alignment for multi-view diffusion models.arXiv preprint arXiv:2512.03045, 2025

work page arXiv 2025

[26] [26]

Do- main generalization using large pretrained models with mixture-of-adapters.arXiv preprint arXiv:2310.11031, 2023

Gyuseong Lee, Wooseok Jang, Jin Hyeon Kim, Jaewoo Jung, and Seungryong Kim. Do- main generalization using large pretrained models with mixture-of-adapters.arXiv preprint arXiv:2310.11031, 2023

work page arXiv 2023

[27] [27]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

2024

[28] [28]

Vrt: A video restoration transformer.IEEE Transactions on Image Processing, 33:2171–2182, 2024

Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, and Luc Van Gool. Vrt: A video restoration transformer.IEEE Transactions on Image Processing, 33:2171–2182, 2024

2024

[29] [29]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Diffbir: Towards blind image restoration with generative diffusion prior, 2024

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior, 2024

2024

[31] [31]

Boosting visual recognition in real-world degradations via unsupervised feature enhancement module with deep channel prior.IEEE Transactions on Intelligent Vehicles, 2024

Zhanwen Liu, Yuhang Li, Yang Wang, Bolin Gao, Yisheng An, and Xiangmo Zhao. Boosting visual recognition in real-world degradations via unsupervised feature enhancement module with deep channel prior.IEEE Transactions on Intelligent Vehicles, 2024. 13

2024

[32] [32]

Depth estimation from monocular images and sparse radar using deep ordinal regression network

Chen-Chou Lo and Patrick Vandewalle. Depth estimation from monocular images and sparse radar using deep ordinal regression network. In2021 IEEE International Conference on Image Processing (ICIP), pages 3343–3347, 2021

2021

[33] [33]

Unsupervised methods for video quality improvement: a survey of restoration and enhancement techniques

Alexandra Malyugina, Yini Li, Joanne Lin, and Nantheera Anantrasirichai. Unsupervised methods for video quality improvement: a survey of restoration and enhancement techniques. arXiv preprint arXiv:2507.08375, 2025

work page arXiv 2025

[34] [34]

Sir-diff: Sparse image sets restoration with multi-view diffusion model

Yucheng Mao, Boyang Wang, Nilesh Kulkarni, and Jeong Joon Park. Sir-diff: Sparse image sets restoration with multi-view diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21620–21630, 2025

2025

[35] [35]

Text-aware image restoration with diffusion models

Jaewon Min, Jin Hyeon Kim, Paul Hyunbin Cho, Jaeeun Lee, Jihye Park, Minkyu Park, Sangpil Kim, Hyunhee Park, and Seungryong Kim. Text-aware image restoration with diffusion models. arXiv preprint arXiv:2506.09993, 2025

work page arXiv 2025

[36] [36]

Deep multi-scale convolutional neural network for dynamic scene deblurring

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017

2017

[37] [37]

Deep multi-scale convolutional neural network for dynamic scene deblurring, 2018

Seungjun Nah, Tae Hyun Kim, and Kyoung Mu Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring, 2018

2018

[38] [38]

Emergent temporal correspondences from video diffusion transformers, 2025

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, and Seungryong Kim. Emergent temporal correspondences from video diffusion transformers, 2025

2025

[39] [39]

Augmented reality based on estimation of defocusing and motion blurring from captured images

Bunyo Okumura, Masayuki Kanbara, and Naokazu Yokoya. Augmented reality based on estimation of defocusing and motion blurring from captured images. In2006 IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 219–225. IEEE, 2006

2006

[40] [40]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Visual geometry transformer in the wild: Distractor-free 3d reconstruction

Tianbo Pan, Xingyi Yang, Shizun Wang, and Xinchao Wang. Visual geometry transformer in the wild: Distractor-free 3d reconstruction

[42] [42]

Handling motion-blur in 3d tracking and rendering for augmented reality.IEEE transactions on visualization and computer graphics, 18(9):1449–1459, 2011

Youngmin Park, Vincent Lepetit, and Woontack Woo. Handling motion-blur in 3d tracking and rendering for augmented reality.IEEE transactions on visualization and computer graphics, 18(9):1449–1459, 2011

2011

[43] [43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[44] [44]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021

[45] [45]

Real-world blur dataset for learning and benchmarking deblurring algorithms

Jaesung Rim, Haeyun Lee, Jucheol Won, and Sunghyun Cho. Real-world blur dataset for learning and benchmarking deblurring algorithms. InEuropean conference on computer vision, pages 184–201. Springer, 2020

2020

[46] [46]

Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

2021

[47] [47]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[48] [48]

Kimera: an open-source library for real-time metric-semantic localization and mapping

Antoni Rosinol, Marcus Abate, Yun Chang, and Luca Carlone. Kimera: an open-source library for real-time metric-semantic localization and mapping. In2020 IEEE international conference on robotics and automation (ICRA), pages 1689–1696. IEEE, 2020. 14

2020

[49] [49]

Structure-from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

2016

[50] [50]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

2016

[51] [51]

A multi-view stereo benchmark with high-resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017

[52] [52]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8922–8931, June 2021

2021

[53] [53]

Masked depth modeling for spatial perception.arXiv preprint arXiv:[2601.17895], 2026

Bin Tan, Changjiang Sun, Xiage Qin, Hanat Adai, Zelin Fu, Tianxiang Zhou, Han Zhang, Yinghao Xu, Xing Zhu, Yujun Shen, and Nan Xue. Masked depth modeling for spatial perception.arXiv preprint arXiv:[2601.17895], 2026

work page arXiv 2026

[54] [54]

A fast local descriptor for dense matching

Engin Tola, Vincent Lepetit, and Pascal Fua. A fast local descriptor for dense matching. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008

2008

[55] [55]

Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

work page arXiv 2026

[56] [56]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

An overview of autonomous vehicles sensors and their vulnerability to weather conditions.Sensors, 21(16):5397, 2021

Jorge Vargas, Suleiman Alsweiss, Onur Toker, Rahul Razdan, and Joshua Santos. An overview of autonomous vehicles sensors and their vulnerability to weather conditions.Sensors, 21(16):5397, 2021

2021

[58] [58]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[59] [59]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[60] [60]

Ddt: Decoupled diffusion transformer

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025

[61] [61]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024

[62] [62]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020

2020

[63] [63]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-Equivariant Visual Geometry Learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

Interactive real-time motion blur.The Visual Computer, 12(6):283–295, 1996

Matthias M Wloka and Robert C Zeleznik. Interactive real-time motion blur.The Visual Computer, 12(6):283–295, 1996. 15

1996

[65] [65]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025

2025

[66] [66]

Scannet++: A high- fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high- fidelity dataset of 3d indoor scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

2023

[67] [67]

Fma-net: Flow-guided dynamic filtering and iterative feature refinement with multi-attention for joint video super-resolution and deblurring

Geunhyuk Youk, Jihyong Oh, and Munchurl Kim. Fma-net: Flow-guided dynamic filtering and iterative feature refinement with multi-attention for joint video super-resolution and deblurring. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 44–55, June 2024

2024

[68] [68]

Complexity experts are task-discriminative learners for any image restoration, 2024

Eduard Zamfir, Zongwei Wu, Nancy Mehta, Yuedong Tan, Danda Pani Paudel, Yulun Zhang, and Radu Timofte. Complexity experts are task-discriminative learners for any image restoration, 2024

2024

[69] [69]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739, 2022

2022

[70] [70]

Ffdnet: Toward a fast and flexible solution for CNN based image denoising.IEEE Transactions on Image Processing, 2018

Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet: Toward a fast and flexible solution for CNN based image denoising.IEEE Transactions on Image Processing, 2018

2018

[71] [71]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [72]

Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness.arXiv preprint arXiv:2409.18125, 2024

work page arXiv 2024

[73] [73]

Springer International Publishing, Cham, 2018

Wangmeng Zuo, Kai Zhang, and Lei Zhang.Convolutional Neural Networks for Image Denoising and Restoration, pages 93–123. Springer International Publishing, Cham, 2018. 16 Appendix A Extended Experimental Results A.1 Feature Similarity Analysis To further validate the effectiveness of the proposed GARD framework, we investigate the feature similarity across...

2018