Recognition: no theorem link
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Pith reviewed 2026-05-13 02:31 UTC · model grok-4.3
The pith
Lite3R replaces dense attention in 3D transformers with sparse linear attention and adds FP8-aware training to cut latency by 1.7-2x and memory by 1.9-2.4x while keeping reconstruction quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lite3R is a teacher-student framework that replaces dense multi-view attention with Sparse Linear Attention to cut token-mixing cost and introduces FP8-aware QAT with partial attention distillation that trains only lightweight linear-branch layers while freezing the backbone, enabling stable low-precision execution that retains pretrained geometric priors and delivers 1.7-2.0x lower latency and 1.9-2.4x lower memory on VGGT and DA3-Large without sacrificing overall reconstruction quality on BlendedMVS and DTU64.
What carries the argument
Sparse Linear Attention that preserves geometric cross-view interactions at reduced cost, combined with FP8-aware quantization-aware training and partial attention distillation that freezes most backbone parameters and updates only lightweight projection layers.
If this is right
- Large transformer 3D models can run on hardware with tighter memory and compute budgets.
- The same efficiency gains apply to other backbones without requiring architecture-specific redesign.
- Reconstruction quality on standard benchmarks stays competitive, so depth maps and camera poses remain usable.
- The co-design of attention sparsity and low-precision training offers a repeatable path for scaling feed-forward 3D pipelines.
Where Pith is reading between the lines
- The partial-distillation trick could transfer to other quantization bit-widths or compression techniques in vision transformers.
- Lower memory footprint may let practitioners increase input resolution or number of input views without new hardware.
- The approach may generalize to related multi-view tasks such as novel-view synthesis where attention cost is also a bottleneck.
Load-bearing premise
That sparse linear attention still captures the important geometric relations across views and that the partial FP8 training keeps depth, pose, and 3D consistency intact without full model retraining.
What would settle it
Running the original and Lite3R versions side-by-side on a scene set where the dense-attention model produces accurate point clouds but the sparse version shows measurable increases in depth error or pose inconsistency.
Figures
read the original abstract
Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Lite3R, a model-agnostic teacher-student framework for efficient feed-forward 3D reconstruction. It replaces dense multi-view attention in transformer backbones (e.g., VGGT, DA3-Large) with Sparse Linear Attention and introduces FP8-aware quantization-aware training (QAT) with partial attention distillation that freezes most pretrained weights. Experiments on BlendedMVS and DTU64 report 1.7-2.0× latency reduction and 1.9-2.4× memory reduction while claiming competitive reconstruction quality.
Significance. If the efficiency claims hold with preserved geometric fidelity, the work provides a practical algorithm-system co-design for deploying large transformer-based 3D models. The model-agnostic framing and parameter-efficient adaptation strategy are strengths that could generalize beyond the tested backbones.
major comments (3)
- [§3.1-3.2] §3.1-3.2 (Sparse Linear Attention formulation): the central claim that the sparsity pattern 'preserves important geometric interactions' across views is load-bearing for the quality-preservation argument, yet the manuscript provides no direct ablation measuring cross-view correspondence accuracy, epipolar consistency, or attention-map fidelity between dense and sparse variants. Aggregate metrics alone do not rule out degradation in long-range view-dependent geometry.
- [§4.3 and Table 3] §4.3 and Table 3 (FP8-aware QAT with partial distillation): the assertion that freezing the backbone and training only lightweight projections retains pretrained geometric priors is not supported by a controlled ablation isolating the effect of FP8 quantization on depth/pose accuracy before vs. after distillation. Without this, it remains unclear whether the reported quality is due to the method or to the specific datasets' tolerance.
- [Table 1 and §5.1] Table 1 and §5.1 (latency/memory results): the 1.7-2.0× and 1.9-2.4× speedups are presented without error bars, multiple random seeds, or hardware-variation controls, undermining the reproducibility of the efficiency claims that constitute the paper's primary contribution.
minor comments (2)
- [Abstract and §1] The abstract and §1 use 'competitive reconstruction quality overall' without immediately defining the primary metrics (e.g., Abs Rel, δ<1.25, Chamfer distance) or the exact baselines (VGGT/DA3-Large dense vs. Lite3R).
- [Figure 4] Figure 4 (qualitative results) would benefit from side-by-side error maps or zoomed insets highlighting any residual geometric artifacts introduced by the sparse attention.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and proposing targeted revisions to strengthen the paper where appropriate.
read point-by-point responses
-
Referee: [§3.1-3.2] §3.1-3.2 (Sparse Linear Attention formulation): the central claim that the sparsity pattern 'preserves important geometric interactions' across views is load-bearing for the quality-preservation argument, yet the manuscript provides no direct ablation measuring cross-view correspondence accuracy, epipolar consistency, or attention-map fidelity between dense and sparse variants. Aggregate metrics alone do not rule out degradation in long-range view-dependent geometry.
Authors: We agree that direct evidence on geometric interaction preservation would strengthen the argument. Our evaluation relies on end-to-end metrics (depth accuracy, pose estimation, and novel view synthesis on BlendedMVS and DTU64) that are sensitive to cross-view geometry errors; significant degradation would manifest in these benchmarks. However, we acknowledge the value of more targeted analysis. In the revised version, we will add qualitative attention map visualizations comparing dense and sparse attention in §3.2, along with a quantitative epipolar consistency check on a subset of multi-view pairs in the supplementary material. This provides direct support without altering the core method. revision: partial
-
Referee: [§4.3 and Table 3] §4.3 and Table 3 (FP8-aware QAT with partial distillation): the assertion that freezing the backbone and training only lightweight projections retains pretrained geometric priors is not supported by a controlled ablation isolating the effect of FP8 quantization on depth/pose accuracy before vs. after distillation. Without this, it remains unclear whether the reported quality is due to the method or to the specific datasets' tolerance.
Authors: We appreciate this observation. The partial distillation strategy is intended to retain priors by freezing the backbone while adapting only the projection layers for FP8 stability. The competitive quality relative to full-precision baselines supports this, but a controlled before/after ablation would clarify the quantization impact. We will add this ablation to §4.3 in the revision, reporting depth and pose accuracy for FP8 with and without the distillation step on the same backbones and datasets. revision: yes
-
Referee: [Table 1 and §5.1] Table 1 and §5.1 (latency/memory results): the 1.7-2.0× and 1.9-2.4× speedups are presented without error bars, multiple random seeds, or hardware-variation controls, undermining the reproducibility of the efficiency claims that constitute the paper's primary contribution.
Authors: We agree that reproducibility details are important for the efficiency claims. The reported factors were measured on a fixed hardware configuration (NVIDIA A100 GPUs) with consistent input resolutions and batch sizes as described in §5.1. To improve this, the revised manuscript will include error bars derived from 5 independent runs with different random seeds in Table 1, along with explicit hardware specifications and a brief discussion of expected variation across platforms in §5.1. revision: yes
Circularity Check
No significant circularity; empirical framework validated on external benchmarks
full rationale
The paper presents an empirical framework (Lite3R) that substitutes Sparse Linear Attention for dense multi-view attention and applies FP8-aware QAT with partial distillation, then reports measured latency/memory reductions and reconstruction metrics on public datasets (BlendedMVS, DTU64) using two external backbones (VGGT, DA3-Large). No derivation chain reduces a claimed result to its own fitted parameters or self-referential definitions; the central claims are performance deltas obtained from direct evaluation rather than algebraic identities or self-citation load-bearing premises. The method's assumptions about geometric preservation are tested rather than presupposed as outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Test3r: Test-time learning for geometric 3d vision
Anonymous. Test3r: Test-time learning for geometric 3d vision. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[2]
Layer normalization.arXiv.org, 2016
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.arXiv.org, 2016
work page 2016
- [3]
-
[4]
Point-based multi-view stereo network
Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based multi-view stereo network. In IEEE International Conference on Computer Vision, pages 1538–1547. IEEE, 2019
work page 2019
-
[5]
Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.European Conference on Computer Vision, 2024
work page 2024
-
[6]
Rethinking attention with performers
Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. InInternational Conference on Learning Representations, 2020
work page 2020
-
[7]
Tri Dao, Daniel Y . Fu, Stefano Ermon, A. Rudra, and Christopher R’e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Neural Information Processing Systems, pages 16344–16359, 2022
work page 2022
-
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2020
work page 2020
-
[9]
Mast3r-sfm: a fully-integrated solution for unconstrained structure-from- motion
Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinzaepfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from- motion. InInternational Conference on 3D Vision, 2024
work page 2024
-
[10]
Gptq: Accurate post-training quantization for generative pre-trained transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. InarXiv.org, 2022
work page 2022
-
[11]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv.org, 2015
work page 2015
-
[12]
Lora: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021
work page 2021
-
[13]
Quantization and training of neural networks for efficient integer-arithmetic-only inference
Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2704–2713, 2017
work page 2017
- [14]
-
[15]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning, pages 5156–5165, 2020
work page 2020
- [16]
-
[17]
Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Bench- marking large-scale scene reconstruction.ACM Transactions on Graphics, 36(4):1–13, 2017. 10
work page 2017
-
[18]
STream3R: Scalable sequential 3D reconstruction with causal transformer
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. STream3R: Scalable sequential 3D reconstruction with causal transformer. InarXiv.org, 2025
work page 2025
-
[19]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[20]
Wt-mvsnet: Window-based transformers for multi-view stereo
Jinli Liao, Yikang Ding, Yoli Shavit, Dihe Huang, Shihao Ren, Jia Guo, Wensen Feng, and Kai Zhang. Wt-mvsnet: Window-based transformers for multi-view stereo. InNeural Information Processing Systems, volume 35, pages 8564–8576. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2022
work page 2022
-
[21]
Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv.org, 2025
work page 2025
-
[22]
Awq: Activation- aware weight quantization for llm compression and acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration. InarXiv.org, 2023
work page 2023
-
[23]
Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer
Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. InInternational Conference on Learning Representations, 2021
work page 2021
-
[24]
P. Micikevicius, Dusan Stosic, N. Burgess, Marius Cornea, P. Dubey, R. Grisenthwaite, Sangwon Ha, A. Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv.org, 2022
work page 2022
-
[25]
Riku Murai, Eric Dexheimer, and Andrew J. Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InComputer Vision and Pattern Recognition, 2024
work page 2024
-
[26]
Oquab, Timothée Darcet, Théo Moutakanni, Huy V
M. Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Trans. Mach. Learn. Res., 2023
work page 2023
-
[27]
Linfei Pan, Dániel Baráth, M. Pollefeys, and Johannes L. Schönberger. Global structure-from- motion revisited. InEuropean Conference on Computer Vision, 2024
work page 2024
-
[28]
Vision transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InIEEE International Conference on Computer Vision, pages 12159–12168. IEEE, 2021
work page 2021
-
[29]
Speed3r: Sparse feed-forward 3d reconstruction models
Weining Ren, Xiao Tan, and Kai Han. Speed3r: Sparse feed-forward 3d reconstruction models. arXiv preprint arXiv:2603.08055, 2026
-
[30]
Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Grégory Rogez
Sara Rojas, M. Armando, Bernard Ghamen, Philippe Weinzaepfel, Vincent Leroy, and Grégory Rogez. Hamst3r: Human-aware multi-view stereo 3d reconstruction. InIEEE International Conference on Computer Vision, pages 5027–5037. IEEE, 2025
work page 2025
-
[31]
Schönberger and Jan-Michael Frahm
Johannes L. Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InComputer Vision and Pattern Recognition, pages 4104–4113. IEEE, 2016
work page 2016
-
[32]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. InNeural In- formation Processing Systems, volume 37, pages 68658–68685. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024
work page 2024
-
[33]
Fastvggt: Training-free acceleration of visual geometry transformer.arXiv.org, 2025
You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv.org, 2025
work page 2025
-
[34]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2021
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2021
work page 2021
-
[35]
Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.Computer Vision and Pattern Recognition, 2024. 11
work page 2024
-
[36]
Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv’e J’egou
Hugo Touvron, M. Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv’e J’egou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, 2020
work page 2020
-
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InNeural Information Pro- cessing Systems, volume 30, pages 5998–6008. Shenzhen Medical Academy of Research and Translation, 2017
work page 2017
-
[38]
Vibhas K. Vats, Sripad Joshi, David J. Crandall, Md. Alimoor Reza, and Soon-heung Jung. Gc-mvsnet: Multi-view, multi-scale, geometrically-consistent multi-view stereo. InIEEE Workshop/Winter Conference on Applications of Computer Vision, pages 3242–3252, 2023
work page 2023
-
[39]
Divinet: 3d reconstruction from disparate views using neural template regularization
Aditya V ora, Akshay Gadi Patil, and Hao Zhang. Divinet: 3d reconstruction from disparate views using neural template regularization. InNeural Information Processing Systems, volume 36, pages 66768–66781. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2023
work page 2023
-
[40]
Faster vggt with block-sparse global attention
Chung-Shien Brian Wang, Christian Schmidt, Jens Piekenbrinck, and Bastian Leibe. Faster vggt with block-sparse global attention. InarXiv.org, 2025. GitHub repository: brianwang00001/sparse-vggt
work page 2025
-
[41]
Fangjinhua Wang, Marie-Julie Rakotosaona, Michael Niemeyer, Richard Szeliski, Marc Polle- feys, and Federico Tombari. Unisdf: Unifying neural representations for high-fidelity 3d reconstruction of complex scenes with reflections. InNeural Information Processing Systems, volume 37, 2023
work page 2023
-
[42]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InComputer Vision and Pattern Recognition, pages 5294–5306. IEEE, 2025
work page 2025
-
[43]
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion.Computer Vision and Pattern Recognition, 2023
work page 2023
-
[44]
Dust3r: Geometric 3d vision made easy.Computer Vision and Pattern Recognition, 2023
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy.Computer Vision and Pattern Recognition, 2023
work page 2023
-
[45]
Linformer: Self-attention with linear complexity.arXiv.org, 2020
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv.org, 2020
work page 2020
-
[46]
Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang
Weijie Wang, Donny Y . Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang. Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs. InarXiv.org, 2025
work page 2025
-
[47]
Zipeng Wang and Dan Xu. Flashvggt: Efficient and scalable visual geometry transformers with compressed descriptor attention.arXiv.org, 2025
work page 2025
-
[48]
Tinyvit: Fast pretraining distillation for small vision transformers
Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. Tinyvit: Fast pretraining distillation for small vision transformers. InEuropean Conference on Computer Vision, pages 68–85. Springer Nature Switzerland, 2022
work page 2022
-
[49]
Smoothquant: Accurate and efficient post-training quantization for large language models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, 2022
work page 2022
-
[50]
Bo Xu, Yuhu Guo, Yuchao Wang, Wenting Wang, Yeung Yam, C. C. Wang, and Xinyi Le. Seres: Semantic-aware neural reconstruction from sparse views.IEEE Transactions on Visualization and Computer Graphics, 2025
work page 2025
-
[51]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InComputer Vision and Pattern Recognition, pages 21924–21935. IEEE, 2025
work page 2025
-
[52]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. InNeural Information Processing Systems, volume 37, pages 21875– 21911. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 2024. 12
work page 2024
-
[53]
Mvsnet: Depth inference for unstructured multi-view stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InEuropean Conference on Computer Vision, pages 785–801. Springer International Publishing, 2018
work page 2018
-
[54]
Blendedmvs: A large-scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Computer Vision and Pattern Recognition, pages 1790–1799, 2019
work page 2019
-
[55]
Z. Yuan, J. Cao, Z. Li, H. Jiang, and Z. Wang. Sd-mvs: Segmentation-driven deformation multi-view stereo with spherical refinement and em optimization. InAAAI Conference on Artificial Intelligence, volume 38, pages 6871–6880. Association for the Advancement of Artificial Intelligence (AAAI), 2024
work page 2024
-
[56]
Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. InInternational Conference on Learning Representations, 2016
work page 2016
-
[57]
Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, et al. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention.arXiv.org, 2025
work page 2025
-
[58]
Multi-view stereo representation revist: Region-aware mvsnet
Yisu Zhang, Jianke Zhu, and Lixiang Lin. Multi-view stereo representation revist: Region-aware mvsnet. InComputer Vision and Pattern Recognition, pages 17376–17385. IEEE, 2023. 13 A Sparse Linear Attention (SLA) summary Sparse Linear Attention (SLA) is the lightweight attention module used to construct the Lite3R student. As described in the main method, ...
work page 2023
-
[59]
Baseline Training. Train or fine-tune the model with its original precision (typically FP32, FP16, or BF16) until convergence. This establishes a strong baseline and ensures the model has learned the task-specific representations
-
[60]
FP8 QAT Fine-tuning. Replace all target linear layers with FP8 fake-quantization layers and continue training for a small number of epochs (typically 1-5). During this stage: • Forward pass: Both weights and activations are quantized to FP8 E4M3 with dynamic per-tensor scaling • Backward pass: Gradients flow through the STE as if no quantization occurred,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.