QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

Hesong Wang; Huan Wang; Zhizhen Pan

arxiv: 2605.31124 · v1 · pith:RPLBO3KVnew · submitted 2026-05-29 · 💻 cs.CV

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

Zhizhen Pan , Hesong Wang , Huan Wang This is my paper

Pith reviewed 2026-06-28 22:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords post-training quantizationvisual geometry transformer3D reconstructionmixed-precisiontransformer compressionedge deploymentgeometric consistency

0 comments

The pith

QVGGT quantizes the 1.2B-parameter VGGT model to W4A16 precision while keeping all 3D prediction heads accurate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a post-training quantization method that compresses VGGT, a transformer that outputs camera parameters, depth maps, and point clouds from single images. It begins by measuring how much each transformer block tolerates reduced precision and assigns higher bits only to the most sensitive blocks. Outlier tokens from camera and register positions are removed during calibration and replaced by a PCA-derived compensation token, after which scale factors are chosen by checking not only layer-wise error but also consistency across the three output heads. The result is near-lossless accuracy at 4-bit weights and 16-bit activations together with large cuts in memory and runtime on real hardware. Readers interested in running 3D geometry models on UAVs or phones would therefore see a direct path to deployment without retraining.

Core claim

QVGGT achieves near-lossless W4A16 quantization of VGGT through three coordinated steps: per-block sensitivity analysis that selects mixed precision for fragile transformer blocks, token filtering that excludes high-variance camera and register tokens from calibration while restoring their information via a PCA-derived global compensation token, and a task-aware scale search that scores candidate scales by both layer reconstruction error and multi-head geometric consistency among camera poses, depth maps, and point maps.

What carries the argument

Per-block quantization sensitivity analysis that drives selective mixed-precision allocation, combined with PCA-compensated token filtering and multi-head geometric consistency checks for scale selection.

If this is right

Memory footprint falls by a factor of 3 to 4.9 compared with the FP32 baseline.
Real hardware inference runs up to 2.8 times faster than the full-precision model.
All three 3D output heads retain their original accuracy after quantization.
Feed-forward 3D reconstruction becomes practical on memory-limited edge platforms such as UAVs and mobile AR devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same block-wise sensitivity and consistency machinery could be tested on other large vision transformers that output structured geometric outputs.
Extending the consistency checks to additional geometric relations might allow even lower average bit widths without retraining.
If the compensation token proves stable, the method could reduce power draw in battery-powered 3D perception systems.

Load-bearing premise

The sensitivity rankings and geometric consistency checks derived from the calibration set will continue to produce effective scales on data and tasks outside those used during development.

What would settle it

Evaluating the quantized model on a fresh 3D geometry benchmark never seen during calibration and measuring whether any prediction head shows more than a small accuracy drop relative to the original FP32 VGGT.

Figures

Figures reproduced from arXiv: 2605.31124 by Hesong Wang, Huan Wang, Zhizhen Pan.

**Figure 1.** Figure 1: Top: We introduce QVGGT, a three-stage post-training quantization framework that stabilizes VGGT by mixed-precision allocation, token filtering with PCA-based information compensation, and task-aware scale search preserving 3D geometric consistency. Bottom: Visualized 3D reconstruction results on real-world scenes show that QVGGT (W4A16) achieves comparable performance to VGGT (FP32), while delivering 4.2×… view at source ↗

**Figure 2.** Figure 2: Overview of QVGGT. Our framework consists of three components to preserve VGGT performance under low-bit quantization. Step 1: Sensitivity analysis. We estimate per-block quantization sensitivity across frame-wise and global transformer blocks, enabling selective mixed-precision assignment for critical layers. Step 2: Token filtering with camera information compensation. To mitigate outlier-dominated scale… view at source ↗

**Figure 3.** Figure 3: Motivation of selective mixed-precision quantization: (a) Per-block sensitivity analysis of VGGT. The abscissa shows the alternating arrangement of frame blocks and global blocks; the ordinate represents the accuracy drop of camera pose prediction on the CO3Dv2 dataset when each block is quantized individually. (b) The outlier distribution histogram of frame block 23 shows that some sensitive blocks have a… view at source ↗

**Figure 4.** Figure 4: Peak memory reduction under different numbers of input [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QVGGT shows a workable PTQ pipeline for VGGT that hits W4A16 with reported near-lossless results on the usual benchmarks, but the generalization of the calibration steps beyond that data is untested.

read the letter

The main takeaway is that this paper gets a 1.2B-parameter 3D geometry transformer down to 4-bit weights and 16-bit activations while keeping accuracy on camera, depth, and point heads close to the original on standard benchmarks. They do it with per-block mixed precision, outlier token filtering plus PCA compensation, and a scale search that adds cross-head geometric consistency checks.

What stands out as new is the specific stacking of these pieces for VGGT. The token filtering step targets the high-variance camera and register tokens that blow up quantization error, then restores cues with a global PCA token. The scale search goes beyond simple layer reconstruction to enforce consistency across the three prediction heads. That is a reasonable engineering extension of existing PTQ work rather than a conceptual leap.

The paper does a solid job on the practical side. It reports 3-4.9× memory reduction and up to 2.8× hardware speedup, which matters for anyone who wants to run feed-forward 3D models on edge hardware. The focus on real deployment constraints is clear and useful.

The soft spot is the lack of evidence that the chosen scales and compensation token hold up outside the calibration distribution. All the sensitivity analysis, filtering, and consistency checks happen on the same family of geometry benchmarks used for final numbers. If new scenes change activation statistics or camera token behavior, the residual error could grow without the checks catching it. The abstract also gives no error bars or detailed baseline tables, so the "near-lossless" claim is hard to judge precisely from the summary.

This is for practitioners who need to compress large vision-geometry models for robotics or AR. A reader already working on PTQ for transformers will get concrete implementation ideas. It deserves a serious referee to verify the experimental setup, check for proper out-of-distribution tests, and confirm the hardware numbers.

Referee Report

2 major / 2 minor

Summary. The paper introduces QVGGT, a post-training quantization (PTQ) framework for the 1.2B-parameter VGGT model. It claims that per-block sensitivity analysis enables selective mixed-precision allocation, token filtering plus PCA-derived compensation mitigates high-variance camera/register token errors, and a task-aware scale search using multi-head geometric consistency yields near-lossless W4A16 quantization. This preserves accuracy across camera, depth, and point-map heads while delivering 3–4.9× memory reduction and up to 2.8× hardware speedup on geometry benchmarks.

Significance. If the reported near-lossless W4A16 performance generalizes, the work would meaningfully advance practical deployment of feed-forward 3D reconstruction models on edge hardware. The approach is empirically grounded in the specific challenges of geometry transformers (outlier tokens, cross-head consistency) rather than generic PTQ, and the multi-head supervision mechanism is a concrete strength that directly ties quantization to the downstream 3D tasks.

major comments (2)

[Abstract and experimental results section] The central claim of near-lossless performance rests on calibration-set scale selection (per-block sensitivity, token filtering, PCA compensation, and task-aware search). No experiments test transfer to out-of-distribution inputs (different lighting, camera trajectories, or scene scales), which is required to substantiate that the filtered tokens and global compensation token remain effective beyond the calibration distribution used for all reported benchmarks.
[Abstract] The abstract states that accuracy of all 3D prediction heads is preserved, yet no quantitative error bars, standard deviations across runs, or per-head absolute/relative error tables are referenced. Without these, it is impossible to verify whether observed differences fall within measurement noise or constitute statistically meaningful preservation.

minor comments (2)

[Method description] Notation for the PCA compensation token and the multi-head consistency loss should be defined explicitly with equations rather than descriptive text only.
[Experiments] The paper should include a direct comparison table against standard PTQ baselines (e.g., GPTQ, AWQ) using identical calibration data and the same VGGT backbone to isolate the contribution of the proposed token-filtering and consistency mechanisms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger evidence on generalization and statistical validation of our claims. We address each major comment below and commit to revisions where the manuscript requires strengthening.

read point-by-point responses

Referee: [Abstract and experimental results section] The central claim of near-lossless performance rests on calibration-set scale selection (per-block sensitivity, token filtering, PCA compensation, and task-aware search). No experiments test transfer to out-of-distribution inputs (different lighting, camera trajectories, or scene scales), which is required to substantiate that the filtered tokens and global compensation token remain effective beyond the calibration distribution used for all reported benchmarks.

Authors: We agree that explicit evaluation on out-of-distribution inputs would provide stronger substantiation for the robustness of token filtering and PCA-derived compensation. While the reported benchmarks encompass diverse scenes and conditions, we will add targeted OOD experiments (e.g., altered lighting, varied camera trajectories, and different scene scales) in the revised manuscript to directly test transfer beyond the calibration distribution. revision: yes
Referee: [Abstract] The abstract states that accuracy of all 3D prediction heads is preserved, yet no quantitative error bars, standard deviations across runs, or per-head absolute/relative error tables are referenced. Without these, it is impossible to verify whether observed differences fall within measurement noise or constitute statistically meaningful preservation.

Authors: We concur that statistical measures are necessary to rigorously support the preservation claim. In the revised manuscript, we will include expanded per-head tables reporting absolute and relative errors for camera, depth, and point-map heads, along with standard deviations computed across multiple runs, to allow assessment of whether differences fall within measurement variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical PTQ method validated on external benchmarks

full rationale

The paper describes an empirical post-training quantization pipeline (per-block sensitivity, token filtering + PCA compensation, task-aware scale search) with no mathematical derivation or closed-form predictions. All reported accuracy, memory, and speedup numbers are measured results on standard geometry benchmarks rather than quantities defined in terms of the method's own fitted parameters or internal calibration outputs. No self-citation chains, ansatzes, or uniqueness theorems are invoked to justify core claims; the work is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical details are summarized at high level.

pith-pipeline@v0.9.1-grok · 5821 in / 1162 out tokens · 14640 ms · 2026-06-28T22:53:11.502726+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Building rome in a day.Communications of the ACM, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 2011. 3

2011
[2]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Al- istarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.arXiv preprint arXiv:2404.00456, 2024. 2, 3

work page arXiv 2024
[3]

Neural rgb-d surface reconstruction

Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, 2022. 2, 6, 7, 8

2022
[4]

Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic- tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer. InECCV, 2024. 3

2024
[5]

EfficientQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. In ACL, 2025. 3

2025
[6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 1

2017
[7]

The case for 4-bit pre- cision: k-bit inference scaling laws

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit pre- cision: k-bit inference scaling laws. InICML, 2023. 6

2023
[8]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Quantized visual geome- try grounded transformer.arXiv preprint arXiv:2509.21302,

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yu- lun Zhang, Michele Magno, et al. Quantized visual geome- try grounded transformer.arXiv preprint arXiv:2509.21302,

work page arXiv
[10]

Build- ing rome on a cloudless day

Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Build- ing rome on a cloudless day. InECCV, 2010. 3

2010
[11]

Optimal brain compres- sion: A framework for accurate post-training quantization and pruning

Elias Frantar and Dan Alistarh. Optimal brain compres- sion: A framework for accurate post-training quantization and pruning. InNeurIPS, 2022. 2

2022
[12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InICCV, 2015. 3

2015
[14]

Cascade cost volume for high-resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InCVPR, 2020. 2

2020
[15]

Deep com- pression: Compressing deep neural network with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep com- pression: Compressing deep neural network with pruning, trained quantization and huffman coding. InICLR, 2016. 2

2016
[16]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,
[17]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge University Press,
[18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InCVPR,
[20]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InECCV, 2024. 2, 3

2024
[21]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Yao Fu, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InMLSys, 2024. 2, 3, 6, 7

2024
[22]

Robust incremental structure-from-motion with hybrid fea- tures

Shaohui Liu, Yidan Gao, Tianyi Zhang, R ´emi Pautrat, Jo- hannes L Sch ¨onberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid fea- tures. InECCV, 2025. 2, 3

2025
[23]

Llm-qat: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of ACL, 2024. 3

2024
[24]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant–llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Multiview stereo with cascaded epipolar raft

Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InECCV, 2022. 2, 3

2022
[26]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 7

2019
[28]

Slow learning and fast inference: Ef- ficient graph similarity computation via knowledge distilla- tion

Can Qin, Handong Zhao, Lichen Wang, Huan Wang, Yulun Zhang, and Yun Fu. Slow learning and fast inference: Ef- ficient graph similarity computation via knowledge distilla- tion. InAdvances in Neural Information Processing Systems,
[29]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4, 8

2021
[30]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 2, 6, 7, 8, 1

2021
[31]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 3

2016
[32]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 3

2016
[33]

Scene co- ordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. InCVPR, 2013. 2, 6, 7

2013
[34]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.arXiv preprint arXiv:2412.06974, 2024

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.arXiv preprint arXiv:2412.06974, 2024. 3

work page arXiv 2024
[36]

Plug-and-play 1

Keda Tao, Haoxuan You, Yang Sui, Can Qin, and Huan Wang. Plug-and-play 1. x-bit kv cache quantization for video large language models.arXiv preprint arXiv:2503.16257,

work page arXiv
[37]

What makes a ”good” data augmentation in knowledge dis- tillation - a statistical perspective

Huan Wang, Suhas Lohit, Michael N Jones, and Yun Fu. What makes a ”good” data augmentation in knowledge dis- tillation - a statistical perspective. InNeurIPS, 2022. 2

2022
[38]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InCVPR, 2024. 2, 3

2024
[39]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 2, 3, 6, 7, 8, 1

2025
[40]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InCVPR, 2025. 3, 7

2025
[41]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 2, 3

2024
[42]

Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo

Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. InICCV, 2021. 3

2021
[43]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1910
[44]

Smoothquant: Accurate and ef- ficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and ef- ficient post-training quantization for large language models. InICML, 2023. 2, 3, 7

2023
[45]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InCVPR, 2025. 3

2025
[46]

Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024

Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024. 3

work page arXiv 2024
[47]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 3

2024
[48]

Zeroquant: Ef- ficient and affordable post-training quantization for large- scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xi- aoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Ef- ficient and affordable post-training quantization for large- scale transformers. InNeurIPS, 2022. 2

2022
[49]

Mquant: Unleashing the inference potential of multimodal large language models via static quantization

JiangYong Yu, Sifan Zhou, Dawei Yang, Shuoyu Li, Shuo Wang, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, and Zhihang Yuan. Mquant: Unleashing the inference potential of multimodal large language models via static quantization. InACM MM, 2025. 3

2025
[50]

Ptq4vit: Post-training quantization for vi- sion transformers with twin uniform quantization

Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vi- sion transformers with twin uniform quantization. InECCV,
[51]

Advancing model pruning via bi-level optimization

Yihua Zhang, Yuguang Yao, Parikshit Ram, Pu Zhao, Tian- long Chen, Mingyi Hong, Yanzhi Wang, and Sijia Liu. Advancing model pruning via bi-level optimization. In NeurIPS, 2022. 2

2022
[52]

Decoupled knowledge distillation

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. InCVPR, 2022. 2

2022
[53]

Atom: Low-bit quantization for efficient and accurate llm serving

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving. InMLSys, 2024. 2, 3

2024
[54]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Obs-diff: Accurate pruning for diffusion mod- els in one-shot.arXiv preprint arXiv:2510.06751, 2025

Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, and Huan Wang. Obs-diff: Accurate pruning for diffusion mod- els in one-shot.arXiv preprint arXiv:2510.06751, 2025. 2 QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer Supplementary Material A. Quantization Implementation Details To facilitate reproducibility, we summarize the key hy- per...

work page arXiv 2025

[1] [1]

Building rome in a day.Communications of the ACM, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 2011. 3

2011

[2] [2]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Al- istarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.arXiv preprint arXiv:2404.00456, 2024. 2, 3

work page arXiv 2024

[3] [3]

Neural rgb-d surface reconstruction

Dejan Azinovi ´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InCVPR, 2022. 2, 6, 7, 8

2022

[4] [4]

Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic- tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer. InECCV, 2024. 3

2024

[5] [5]

EfficientQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. EfficientQAT: Efficient quantization-aware training for large language models. In ACL, 2025. 3

2025

[6] [6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 1

2017

[7] [7]

The case for 4-bit pre- cision: k-bit inference scaling laws

Tim Dettmers and Luke Zettlemoyer. The case for 4-bit pre- cision: k-bit inference scaling laws. InICML, 2023. 6

2023

[8] [8]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.arXiv preprint arXiv:2208.07339,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Quantized visual geome- try grounded transformer.arXiv preprint arXiv:2509.21302,

Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yu- lun Zhang, Michele Magno, et al. Quantized visual geome- try grounded transformer.arXiv preprint arXiv:2509.21302,

work page arXiv

[10] [10]

Build- ing rome on a cloudless day

Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, et al. Build- ing rome on a cloudless day. InECCV, 2010. 3

2010

[11] [11]

Optimal brain compres- sion: A framework for accurate post-training quantization and pruning

Elias Frantar and Dan Alistarh. Optimal brain compres- sion: A framework for accurate post-training quantization and pruning. InNeurIPS, 2022. 2

2022

[12] [12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022. 2, 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Massively parallel multiview stereopsis by surface normal diffusion

Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InICCV, 2015. 3

2015

[14] [14]

Cascade cost volume for high-resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InCVPR, 2020. 2

2020

[15] [15]

Deep com- pression: Compressing deep neural network with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally. Deep com- pression: Compressing deep neural network with pruning, trained quantization and huffman coding. InICLR, 2016. 2

2016

[16] [16]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple View Ge- ometry in Computer Vision. Cambridge University Press,

[17] [17]

Cambridge University Press,

Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge University Press,

[18] [18]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. InCVPR,

[20] [20]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InECCV, 2024. 2, 3

2024

[21] [21]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Yao Fu, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. InMLSys, 2024. 2, 3, 6, 7

2024

[22] [22]

Robust incremental structure-from-motion with hybrid fea- tures

Shaohui Liu, Yidan Gao, Tianyi Zhang, R ´emi Pautrat, Jo- hannes L Sch ¨onberger, Viktor Larsson, and Marc Pollefeys. Robust incremental structure-from-motion with hybrid fea- tures. InECCV, 2025. 2, 3

2025

[23] [23]

Llm-qat: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of ACL, 2024. 3

2024

[24] [24]

SpinQuant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant–llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Multiview stereo with cascaded epipolar raft

Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InECCV, 2022. 2, 3

2022

[26] [26]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019. 7

2019

[28] [28]

Slow learning and fast inference: Ef- ficient graph similarity computation via knowledge distilla- tion

Can Qin, Handong Zhao, Lichen Wang, Huan Wang, Yulun Zhang, and Yun Fu. Slow learning and fast inference: Ef- ficient graph similarity computation via knowledge distilla- tion. InAdvances in Neural Information Processing Systems,

[29] [29]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4, 8

2021

[30] [30]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InICCV, 2021. 2, 6, 7, 8, 1

2021

[31] [31]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 3

2016

[32] [32]

Pixelwise view selection for unstructured multi-view stereo

Johannes L Sch ¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 3

2016

[33] [33]

Scene co- ordinate regression forests for camera relocalization in rgb-d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene co- ordinate regression forests for camera relocalization in rgb-d images. InCVPR, 2013. 2, 6, 7

2013

[34] [34]

A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.arXiv preprint arXiv:2412.06974, 2024

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds.arXiv preprint arXiv:2412.06974, 2024. 3

work page arXiv 2024

[36] [36]

Plug-and-play 1

Keda Tao, Haoxuan You, Yang Sui, Can Qin, and Huan Wang. Plug-and-play 1. x-bit kv cache quantization for video large language models.arXiv preprint arXiv:2503.16257,

work page arXiv

[37] [37]

What makes a ”good” data augmentation in knowledge dis- tillation - a statistical perspective

Huan Wang, Suhas Lohit, Michael N Jones, and Yun Fu. What makes a ”good” data augmentation in knowledge dis- tillation - a statistical perspective. InNeurIPS, 2022. 2

2022

[38] [38]

Vggsfm: Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InCVPR, 2024. 2, 3

2024

[39] [39]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 2, 3, 6, 7, 8, 1

2025

[40] [40]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InCVPR, 2025. 3, 7

2025

[41] [41]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 2, 3

2024

[42] [42]

Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo

Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. InICCV, 2021. 3

2021

[43] [43]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R´emi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019. 7

work page internal anchor Pith review Pith/arXiv arXiv 1910

[44] [44]

Smoothquant: Accurate and ef- ficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and ef- ficient post-training quantization for large language models. InICML, 2023. 2, 3, 7

2023

[45] [45]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InCVPR, 2025. 3

2025

[46] [46]

Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024

Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: To- wards distribution-friendly and outlier-aware post-training quantization for vision transformers.arXiv preprint arXiv:2408.03291, 2024. 3

work page arXiv 2024

[47] [47]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 3

2024

[48] [48]

Zeroquant: Ef- ficient and affordable post-training quantization for large- scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xi- aoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Ef- ficient and affordable post-training quantization for large- scale transformers. InNeurIPS, 2022. 2

2022

[49] [49]

Mquant: Unleashing the inference potential of multimodal large language models via static quantization

JiangYong Yu, Sifan Zhou, Dawei Yang, Shuoyu Li, Shuo Wang, Xing Hu, Chen Xu, Zukang Xu, Changyong Shu, and Zhihang Yuan. Mquant: Unleashing the inference potential of multimodal large language models via static quantization. InACM MM, 2025. 3

2025

[50] [50]

Ptq4vit: Post-training quantization for vi- sion transformers with twin uniform quantization

Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vi- sion transformers with twin uniform quantization. InECCV,

[51] [51]

Advancing model pruning via bi-level optimization

Yihua Zhang, Yuguang Yao, Parikshit Ram, Pu Zhao, Tian- long Chen, Mingyi Hong, Yanzhi Wang, and Sijia Liu. Advancing model pruning via bi-level optimization. In NeurIPS, 2022. 2

2022

[52] [52]

Decoupled knowledge distillation

Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. InCVPR, 2022. 2

2022

[53] [53]

Atom: Low-bit quantization for efficient and accurate llm serving

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, and Baris Kasikci. Atom: Low-bit quantization for efficient and accurate llm serving. InMLSys, 2024. 2, 3

2024

[54] [54]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

Obs-diff: Accurate pruning for diffusion mod- els in one-shot.arXiv preprint arXiv:2510.06751, 2025

Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, and Huan Wang. Obs-diff: Accurate pruning for diffusion mod- els in one-shot.arXiv preprint arXiv:2510.06751, 2025. 2 QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer Supplementary Material A. Quantization Implementation Details To facilitate reproducibility, we summarize the key hy- per...

work page arXiv 2025