arxiv: 2605.11354 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: no theorem link

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Haoyu Zhang , Zeyu Zhang , Zedong Zhou , Yang Zhao , Hao Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionsparse linear attentionquantization-aware trainingfeed-forward transformermulti-view geometryefficiency optimizationFP8 inference

0 comments

The pith

Lite3R replaces dense attention in 3D transformers with sparse linear attention and adds FP8-aware training to cut latency by 1.7-2x and memory by 1.9-2.4x while keeping reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lite3R as a model-agnostic framework to make transformer-based 3D reconstruction practical for real-world use. It tackles the high cost of dense multi-view attention and the instability that comes with running these models at low precision. By swapping in Sparse Linear Attention the method keeps the key geometric interactions across views while lowering token mixing overhead. A parameter-efficient FP8-aware quantization-aware training step with partial attention distillation then freezes most of the pretrained weights and updates only lightweight projection layers so the model stays stable at reduced precision. Tests on VGGT and DA3-Large backbones over BlendedMVS and DTU64 show the expected speed and memory gains with competitive depth, pose, and 3D consistency results.

Core claim

Lite3R is a teacher-student framework that replaces dense multi-view attention with Sparse Linear Attention to cut token-mixing cost and introduces FP8-aware QAT with partial attention distillation that trains only lightweight linear-branch layers while freezing the backbone, enabling stable low-precision execution that retains pretrained geometric priors and delivers 1.7-2.0x lower latency and 1.9-2.4x lower memory on VGGT and DA3-Large without sacrificing overall reconstruction quality on BlendedMVS and DTU64.

What carries the argument

Sparse Linear Attention that preserves geometric cross-view interactions at reduced cost, combined with FP8-aware quantization-aware training and partial attention distillation that freezes most backbone parameters and updates only lightweight projection layers.

If this is right

Large transformer 3D models can run on hardware with tighter memory and compute budgets.
The same efficiency gains apply to other backbones without requiring architecture-specific redesign.
Reconstruction quality on standard benchmarks stays competitive, so depth maps and camera poses remain usable.
The co-design of attention sparsity and low-precision training offers a repeatable path for scaling feed-forward 3D pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The partial-distillation trick could transfer to other quantization bit-widths or compression techniques in vision transformers.
Lower memory footprint may let practitioners increase input resolution or number of input views without new hardware.
The approach may generalize to related multi-view tasks such as novel-view synthesis where attention cost is also a bottleneck.

Load-bearing premise

That sparse linear attention still captures the important geometric relations across views and that the partial FP8 training keeps depth, pose, and 3D consistency intact without full model retraining.

What would settle it

Running the original and Lite3R versions side-by-side on a scene set where the dense-attention model produces accurate point clouds but the sparse version shows measurable increases in depth error or pose inconsistency.

Figures

Figures reproduced from arXiv: 2605.11354 by Hao Tang, Haoyu Zhang, Yang Zhao, Zedong Zhou, Zeyu Zhang.

**Figure 2.** Figure 2: Overall framework of Lite3R. Starting from a dense pretrained 3D reconstruction teacher, Lite3R constructs a lite student by replacing dense attention with Sparse Linear Attention, freezing the inherited backbone projections, and training only lightweight linear-branch projection layers under FP8-aware quantization-aware training. Partial attention distillation preserves intermediate geometric priors, and … view at source ↗

**Figure 3.** Figure 3: Qualitative point-cloud comparison on BlendedMVS between ground truth, the original [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Analysis of Lite3R adaptation sensitivity. Left: layer-wise quantization sensitivity of VGGT, showing that different backbone stages respond unevenly to low-precision perturbations. Right: change pattern of the linear-branch projection layers during training, illustrating how continued FP8-aware QAT increases drift in the small trainable subspace and helps explain the weaker stability of longer schedules. … view at source ↗

**Figure 5.** Figure 5: Visual comparison of the component-ablation results on VGGT over BlendedMVS. The chart summarizes [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Comprehensive visualization of Lite3R main results across nine experimental settings. The bar charts summarize the key quality and efficiency metrics reported in the main paper, highlighting how Lite3R compares with the corresponding higher-precision baselines across different backbones, datasets, and evaluation dimensions. Compatibility: For hardware without native FP8 support, the quantized model can be… view at source ↗

read the original abstract

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lite3R shows concrete latency and memory wins on two 3D recon backbones with sparse linear attention plus lightweight FP8 training, but the geometry preservation under the approximation needs tighter checks.

read the letter

The paper's core move is a teacher-student setup that swaps dense multi-view attention for sparse linear attention inside existing 3D transformers, then adds a parameter-light FP8 quantization-aware training pass that only updates small projection layers while freezing the rest. This combination is new enough in its tailoring to geometry tasks, and the reported numbers are useful: 1.7-2.0x lower latency and 1.9-2.4x less memory on VGGT and DA3-Large, with quality staying competitive on BlendedMVS and DTU64. Releasing code helps anyone who wants to test the claims directly on the same public benchmarks. Those efficiency results are the part that actually moves the needle for deployment work. The method stays model-agnostic, which broadens its reach beyond one architecture. The partial distillation step is a sensible way to keep pretrained priors without retraining everything at low precision. That said, the central assumption is that sparse linear attention still captures the cross-view correspondences needed for consistent depth, pose, and 3D structure. Linear approximations cut quadratic cost by design, but they can drop long-range interactions that dense attention handles directly. The paper's aggregate quality metrics look fine, yet without detailed ablations on consistency errors or view-dependent failure modes, it is hard to know how often subtle degradations appear. The FP8 QAT helps stability, but it does not automatically restore lost geometric signal. This work is aimed at CV engineers who need to run feed-forward 3D reconstruction on tighter hardware budgets. A reader already working on efficient inference or edge deployment will get practical recipes and numbers to build on. It is not foundational, but the claims are concrete and the setup is reproducible. I would send it to peer review because the efficiency gains are measurable, the code is available, and the topic is timely for real-world use. Reviewers can push on the geometry checks and ask for more targeted metrics.

Referee Report

3 major / 2 minor

Summary. The paper proposes Lite3R, a model-agnostic teacher-student framework for efficient feed-forward 3D reconstruction. It replaces dense multi-view attention in transformer backbones (e.g., VGGT, DA3-Large) with Sparse Linear Attention and introduces FP8-aware quantization-aware training (QAT) with partial attention distillation that freezes most pretrained weights. Experiments on BlendedMVS and DTU64 report 1.7-2.0× latency reduction and 1.9-2.4× memory reduction while claiming competitive reconstruction quality.

Significance. If the efficiency claims hold with preserved geometric fidelity, the work provides a practical algorithm-system co-design for deploying large transformer-based 3D models. The model-agnostic framing and parameter-efficient adaptation strategy are strengths that could generalize beyond the tested backbones.

major comments (3)

[§3.1-3.2] §3.1-3.2 (Sparse Linear Attention formulation): the central claim that the sparsity pattern 'preserves important geometric interactions' across views is load-bearing for the quality-preservation argument, yet the manuscript provides no direct ablation measuring cross-view correspondence accuracy, epipolar consistency, or attention-map fidelity between dense and sparse variants. Aggregate metrics alone do not rule out degradation in long-range view-dependent geometry.
[§4.3 and Table 3] §4.3 and Table 3 (FP8-aware QAT with partial distillation): the assertion that freezing the backbone and training only lightweight projections retains pretrained geometric priors is not supported by a controlled ablation isolating the effect of FP8 quantization on depth/pose accuracy before vs. after distillation. Without this, it remains unclear whether the reported quality is due to the method or to the specific datasets' tolerance.
[Table 1 and §5.1] Table 1 and §5.1 (latency/memory results): the 1.7-2.0× and 1.9-2.4× speedups are presented without error bars, multiple random seeds, or hardware-variation controls, undermining the reproducibility of the efficiency claims that constitute the paper's primary contribution.

minor comments (2)

[Abstract and §1] The abstract and §1 use 'competitive reconstruction quality overall' without immediately defining the primary metrics (e.g., Abs Rel, δ<1.25, Chamfer distance) or the exact baselines (VGGT/DA3-Large dense vs. Lite3R).
[Figure 4] Figure 4 (qualitative results) would benefit from side-by-side error maps or zoomed insets highlighting any residual geometric artifacts introduced by the sparse attention.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and proposing targeted revisions to strengthen the paper where appropriate.

read point-by-point responses

Referee: [§3.1-3.2] §3.1-3.2 (Sparse Linear Attention formulation): the central claim that the sparsity pattern 'preserves important geometric interactions' across views is load-bearing for the quality-preservation argument, yet the manuscript provides no direct ablation measuring cross-view correspondence accuracy, epipolar consistency, or attention-map fidelity between dense and sparse variants. Aggregate metrics alone do not rule out degradation in long-range view-dependent geometry.

Authors: We agree that direct evidence on geometric interaction preservation would strengthen the argument. Our evaluation relies on end-to-end metrics (depth accuracy, pose estimation, and novel view synthesis on BlendedMVS and DTU64) that are sensitive to cross-view geometry errors; significant degradation would manifest in these benchmarks. However, we acknowledge the value of more targeted analysis. In the revised version, we will add qualitative attention map visualizations comparing dense and sparse attention in §3.2, along with a quantitative epipolar consistency check on a subset of multi-view pairs in the supplementary material. This provides direct support without altering the core method. revision: partial
Referee: [§4.3 and Table 3] §4.3 and Table 3 (FP8-aware QAT with partial distillation): the assertion that freezing the backbone and training only lightweight projections retains pretrained geometric priors is not supported by a controlled ablation isolating the effect of FP8 quantization on depth/pose accuracy before vs. after distillation. Without this, it remains unclear whether the reported quality is due to the method or to the specific datasets' tolerance.

Authors: We appreciate this observation. The partial distillation strategy is intended to retain priors by freezing the backbone while adapting only the projection layers for FP8 stability. The competitive quality relative to full-precision baselines supports this, but a controlled before/after ablation would clarify the quantization impact. We will add this ablation to §4.3 in the revision, reporting depth and pose accuracy for FP8 with and without the distillation step on the same backbones and datasets. revision: yes
Referee: [Table 1 and §5.1] Table 1 and §5.1 (latency/memory results): the 1.7-2.0× and 1.9-2.4× speedups are presented without error bars, multiple random seeds, or hardware-variation controls, undermining the reproducibility of the efficiency claims that constitute the paper's primary contribution.

Authors: We agree that reproducibility details are important for the efficiency claims. The reported factors were measured on a fixed hardware configuration (NVIDIA A100 GPUs) with consistent input resolutions and batch sizes as described in §5.1. To improve this, the revised manuscript will include error bars derived from 5 independent runs with different random seeds in Table 1, along with explicit hardware specifications and a brief discussion of expected variation across platforms in §5.1. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework validated on external benchmarks

full rationale

The paper presents an empirical framework (Lite3R) that substitutes Sparse Linear Attention for dense multi-view attention and applies FP8-aware QAT with partial distillation, then reports measured latency/memory reductions and reconstruction metrics on public datasets (BlendedMVS, DTU64) using two external backbones (VGGT, DA3-Large). No derivation chain reduces a claimed result to its own fitted parameters or self-referential definitions; the central claims are performance deltas obtained from direct evaluation rather than algebraic identities or self-citation load-bearing premises. The method's assumptions about geometric preservation are tested rather than presupposed as outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes an empirical engineering framework and does not introduce or rely on explicit free parameters, mathematical axioms, or newly invented entities beyond standard adaptations of attention and quantization techniques.

pith-pipeline@v0.9.0 · 5603 in / 1184 out tokens · 81917 ms · 2026-05-13T02:31:56.493702+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

[1]

Test3r: Test-time learning for geometric 3d vision

Anonymous. Test3r: Test-time learning for geometric 3d vision. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[2]

Layer normalization.arXiv.org, 2016

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv.org, 2016

work page 2016
[3]

Courville

Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv.org, 2013

work page 2013
[4]

Point-based multi-view stereo network

Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based multi-view stereo network. In IEEE International Conference on Computer Vision, pages 1538–1547. IEEE, 2019

work page 2019
[5]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.European Conference on Computer Vision, 2024

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.European Conference on Computer Vision, 2024

work page 2024
[6]

Rethinking attention with performers

Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2020

work page 2020
[7]

Fu, Stefano Ermon, A

Tri Dao, Daniel Y . Fu, Stefano Ermon, A. Rudra, and Christopher R’e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Neural Information Processing Systems, pages 16344–16359, 2022

work page 2022
[8]

An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2020

work page 2020
[9]

Mast3r-sfm: a fully-integrated solution for unconstrained structure-from- motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from- motion. InInternational Conference on 3D Vision, 2024

work page 2024
[10]

Gptq: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. InarXiv.org, 2022

work page 2022
[11]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv.org, 2015

work page 2015
[12]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021

work page 2021
[13]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2017

work page 2017
[14]

Jensen, A

R. Jensen, A. Dahl, George V ogiatzis, Engil Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. In2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 406–413. IEEE, 2014

work page 2014
[15]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165, 2020

work page 2020
[16]

Drettakis

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023

work page 2023
[17]

Tanks and temples: Bench- marking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4):1–13, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Bench- marking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4):1–13, 2017. 10

work page 2017
[18]

STream3R: Scalable sequential 3D reconstruction with causal transformer

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. STream3R: Scalable sequential 3D reconstruction with causal transformer. InarXiv.org, 2025

work page 2025
[19]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, 2024

work page 2024
[20]

Wt-mvsnet: Window-based transformers for multi-view stereo

Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo, Wensen Feng, and Kai Zhang. Wt-mvsnet: Window-based transformers for multi-view stereo. InNeural Information Processing Systems, volume 35, pages 8564–8576. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022

work page 2022
[21]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv.org, 2025

work page 2025
[22]

Awq: Activation- aware weight quantization for llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. InarXiv.org, 2023

work page 2023
[23]

Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer

Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. InInternational Conference on Learning Representations, 2021

work page 2021
[24]

Micikevicius, Dusan Stosic, N

P. Micikevicius, Dusan Stosic, N. Burgess, Marius Cornea, P. Dubey, R. Grisenthwaite, Sangwon Ha, A. Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv.org, 2022

work page 2022
[25]

Riku Murai, Eric Dexheimer, and Andrew J. Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InComputer Vision and Pattern Recognition, 2024

work page 2024
[26]

Oquab, Timothée Darcet, Théo Moutakanni, Huy V

M. Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Trans. Mach. Learn. Res., 2023

work page 2023
[27]

Pollefeys, and Johannes L

Linfei Pan, Dániel Baráth, M. Pollefeys, and Johannes L. Schönberger. Global structure-from- motion revisited. InEuropean Conference on Computer Vision, 2024

work page 2024
[28]

Vision transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InIEEE International Conference on Computer Vision, pages 12159–12168. IEEE, 2021

work page 2021
[29]

Speed3r: Sparse feed-forward 3d reconstruction models

Weining Ren, Xiao Tan, and Kai Han. Speed3r: Sparse feed-forward 3d reconstruction models. arXiv preprint arXiv:2603.08055, 2026

work page arXiv 2026
[30]

Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Grégory Rogez

Sara Rojas, M. Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Grégory Rogez. Hamst3r: Human-aware multi-view stereo 3d reconstruction. InIEEE International Conference on Computer Vision, pages 5027–5037. IEEE, 2025

work page 2025
[31]

Schönberger and Jan-Michael Frahm

Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InComputer Vision and Pattern Recognition, pages 4104–4113. IEEE, 2016

work page 2016
[32]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeural In- formation Processing Systems, volume 37, pages 68658–68685. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024

work page 2024
[33]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv.org, 2025

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv.org, 2025

work page 2025
[34]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2021

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2021

work page 2021
[35]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.Computer Vision and Pattern Recognition, 2024

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.Computer Vision and Pattern Recognition, 2024. 11

work page 2024
[36]

Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv’e J’egou

Hugo Touvron, M. Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv’e J’egou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2020

work page 2020
[37]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Pro- cessing Systems, volume 30, pages 5998–6008. Shenzhen Medical Academy of Research and Translation, 2017

work page 2017
[38]

Vats, Sripad Joshi, David J

Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, and Soon-heung Jung. Gc-mvsnet: Multi-view, multi-scale, geometrically-consistent multi-view stereo. InIEEE Workshop/Winter Conference on Applications of Computer Vision, pages 3242–3252, 2023

work page 2023
[39]

Divinet: 3d reconstruction from disparate views using neural template regularization

Aditya V ora, Akshay Gadi Patil, and Hao Zhang. Divinet: 3d reconstruction from disparate views using neural template regularization. InNeural Information Processing Systems, volume 36, pages 66768–66781. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023

work page 2023
[40]

Faster vggt with block-sparse global attention

Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, and Bastian Leibe. Faster vggt with block-sparse global attention. InarXiv.org, 2025. GitHub repository: brianwang00001/sparse-vggt

work page 2025
[41]

Unisdf: Unifying neural representations for high-fidelity 3d reconstruction of complex scenes with reflections

Fangjinhua Wang, Marie-Julie Rakotosaona, Michael Niemeyer, Richard Szeliski, Marc Polle- feys, and Federico Tombari. Unisdf: Unifying neural representations for high-fidelity 3d reconstruction of complex scenes with reflections. InNeural Information Processing Systems, volume 37, 2023

work page 2023
[42]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InComputer Vision and Pattern Recognition, pages 5294–5306. IEEE, 2025

work page 2025
[43]

Vggsfm: Visual geometry grounded deep structure from motion.Computer Vision and Pattern Recognition, 2023

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion.Computer Vision and Pattern Recognition, 2023

work page 2023
[44]

Dust3r: Geometric 3d vision made easy.Computer Vision and Pattern Recognition, 2023

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy.Computer Vision and Pattern Recognition, 2023

work page 2023
[45]

Linformer: Self-attention with linear complexity.arXiv.org, 2020

Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv.org, 2020

work page 2020
[46]

Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang

Weijie Wang, Donny Y . Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang. Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs. InarXiv.org, 2025

work page 2025
[47]

Flashvggt: Efficient and scalable visual geometry transformers with compressed descriptor attention.arXiv.org, 2025

Zipeng Wang and Dan Xu. Flashvggt: Efficient and scalable visual geometry transformers with compressed descriptor attention.arXiv.org, 2025

work page 2025
[48]

Tinyvit: Fast pretraining distillation for small vision transformers

Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. InEuropean Conference on Computer Vision, pages 68–85. Springer Nature Switzerland, 2022

work page 2022
[49]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2022

work page 2022
[50]

Bo Xu, Yuhu Guo, Yuchao Wang, Wenting Wang, Yeung Yam, C. C. Wang, and Xinyi Le. Seres: Semantic-aware neural reconstruction from sparse views.IEEE Transactions on Visualization and Computer Graphics, 2025

work page 2025
[51]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InComputer Vision and Pattern Recognition, pages 21924–21935. IEEE, 2025

work page 2025
[52]

Depth anything v2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeural Information Processing Systems, volume 37, pages 21875– 21911. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024. 12

work page 2024
[53]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 785–801. Springer International Publishing, 2018

work page 2018
[54]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Computer Vision and Pattern Recognition, pages 1790–1799, 2019

work page 2019
[55]

Z. Yuan, J. Cao, Z. Li, H. Jiang, and Z. Wang. Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical refinement and em optimization. InAAAI Conference on Artificial Intelligence, volume 38, pages 6871–6880. Association for the Advancement of Artificial Intelligence (AAAI), 2024

work page 2024
[56]

Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer

Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InInternational Conference on Learning Representations, 2016

work page 2016
[57]

Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv.org, 2025

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv.org, 2025

work page 2025
[58]

Multi-view stereo representation revist: Region-aware mvsnet

Yisu Zhang, Jianke Zhu, and Lixiang Lin. Multi-view stereo representation revist: Region-aware mvsnet. InComputer Vision and Pattern Recognition, pages 17376–17385. IEEE, 2023. 13 A Sparse Linear Attention (SLA) summary Sparse Linear Attention (SLA) is the lightweight attention module used to construct the Lite3R student. As described in the main method, ...

work page 2023
[59]

Train or fine-tune the model with its original precision (typically FP32, FP16, or BF16) until convergence

Baseline Training. Train or fine-tune the model with its original precision (typically FP32, FP16, or BF16) until convergence. This establishes a strong baseline and ensures the model has learned the task-specific representations

work page
[60]

Replace all target linear layers with FP8 fake-quantization layers and continue training for a small number of epochs (typically 1-5)

FP8 QAT Fine-tuning. Replace all target linear layers with FP8 fake-quantization layers and continue training for a small number of epochs (typically 1-5). During this stage: • Forward pass: Both weights and activations are quantized to FP8 E4M3 with dynamic per-tensor scaling • Backward pass: Gradients flow through the STE as if no quantization occurred,...

work page