Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer
Pith reviewed 2026-05-25 06:34 UTC · model grok-4.3
The pith
Different geometry tasks in a shared transformer backbone have unequal sensitivity to quantization errors, so weighting calibration by per-task Fisher information preserves accuracy on the most affected outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task, yielding up to 39 percent relative improvement under 4-bit quantization on VGGT across camera pose estimation, point map reconstruction, and depth estimation.
What carries the argument
Diagonal Fisher information matrix that ranks quantization sensitivity per task, block, and channel and then scales the learnable affine transformations applied during post-training quantization calibration.
If this is right
- Quantized VGGT keeps higher accuracy on sensitive tasks such as depth estimation and pose prediction than uniform PTQ baselines.
- A single shared backbone can be compressed once and still serve multiple downstream geometry tasks without large per-task drops.
- The method delivers consistent gains across camera pose, point map, and depth outputs under 4-bit weights.
- Calibration now explicitly protects task-critical channels rather than averaging sensitivity across all outputs.
Where Pith is reading between the lines
- The same ranking step could be tested on other multi-task vision transformers whose backbone is shared across outputs with different error tolerances.
- Choosing calibration images that better sample the statistics of each individual task might further strengthen the Fisher ranks.
- If the Fisher diagonal already captures the dominant error modes, adding second-order cross terms between channels would likely add little extra benefit.
Load-bearing premise
The diagonal Fisher information matrix computed on calibration data accurately ranks which blocks and channels matter most for accuracy on each geometric task after quantization.
What would settle it
Applying the Fisher-guided scaling to the same VGGT model and calibration set produces no measurable reduction in the accuracy gap between sensitive and insensitive tasks relative to standard uniform calibration under identical 4-bit settings.
Figures
read the original abstract
Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization. Code is available at https://github.com/ypzhng/FGQ.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fisher-Guided Quantization (FGQ) for the Visual Geometry Grounded Transformer (VGGT), a feed-forward model for multiple 3D geometry tasks including depth estimation, camera pose prediction, and point cloud reconstruction. The key observation is that these tasks have different sensitivities to quantization due to varying contributions from shared transformer blocks and channels. FGQ leverages the diagonal Fisher information matrix computed on calibration data to measure these sensitivities and adjusts the learnable affine transformations accordingly to better protect task-critical elements. This leads to improved performance over standard PTQ methods, with reported gains of up to 39% relative improvement in 4-bit quantization settings across the tasks.
Significance. Should the central mechanism hold, FGQ offers a task-aware approach to PTQ that could benefit other multi-task vision transformers by avoiding uniform treatment that harms sensitive tasks. The empirical results on VGGT are promising, and the open-sourced code at the provided GitHub link enhances the work's value for the community by supporting reproducibility.
major comments (1)
- [Method] Method section (Fisher-guided modulation of learnable affine transforms): The claim that the diagonal Fisher information matrix computed on calibration data produces a reliable per-task ranking of block and channel quantization sensitivity is load-bearing for the central contribution. However, the manuscript provides no direct validation (e.g., correlation between derived ranks and measured task-specific accuracy degradation when selectively increasing quantization error on high-Fisher components), leaving the stress-test concern unaddressed that local curvature may not predict effects of structured quantization perturbations and that the diagonal approximation ignores common channel correlations in transformer activations.
minor comments (1)
- [Abstract] Abstract: the phrase 'up to 39% relative improvement' should specify the exact baseline method and error metric for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. Below we address the major comment.
read point-by-point responses
-
Referee: The claim that the diagonal Fisher information matrix computed on calibration data produces a reliable per-task ranking of block and channel quantization sensitivity is load-bearing for the central contribution. However, the manuscript provides no direct validation (e.g., correlation between derived ranks and measured task-specific accuracy degradation when selectively increasing quantization error on high-Fisher components), leaving the stress-test concern unaddressed that local curvature may not predict effects of structured quantization perturbations and that the diagonal approximation ignores common channel correlations in transformer activations.
Authors: We agree that an explicit correlation analysis between Fisher-derived rankings and measured task-specific degradation under selective perturbations would strengthen the central claim. The current manuscript relies on the established role of (diagonal) Fisher information as a local sensitivity measure in the quantization and pruning literature, together with the observed 39% relative gains on sensitive tasks as indirect support. The diagonal approximation is adopted for tractability on billion-parameter models; the learnable affine transformations are intended to mitigate some effects of ignored correlations during calibration. To directly address the referee's stress-test concern we will add a new experiment subsection that (i) ranks blocks/channels by Fisher value per task, (ii) selectively increases quantization noise on high- versus low-Fisher elements, and (iii) reports the resulting task-specific accuracy degradation together with rank-correlation plots. This addition will be included in the revised manuscript. revision: yes
Circularity Check
No circularity: method is a heuristic incorporation of standard Fisher information into PTQ calibration
full rationale
The paper describes FGQ as using the diagonal Fisher information matrix (computed on calibration data) to rank task/block/channel sensitivities and modulate learnable affine transforms. This is presented as an empirical design choice rather than a derivation from first principles. No equations are shown that reduce a claimed prediction back to fitted inputs by construction, and no self-citation chain is invoked to justify uniqueness or load-bearing premises. Results are reported as empirical improvements on held-out geometric tasks, consistent with the reader's assessment of score 2.0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The diagonal of the Fisher information matrix computed on calibration data ranks the relative importance of transformer blocks and channels for each geometric task under quantization noise.
Reference graph
Works this paper leans on
-
[1]
Quantized visual geometry grounded transformer
Weilun Feng, Haotong Qin, Mingqiang Wu, Chuanguang Yang, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Yulun Zhang, Michele Magno, et al. Quantized visual geometry grounded transformer. arXiv preprint arXiv:2509.21302,
-
[2]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, and Jiangmiao Pang. g2vlm: Geometry grounded vision language model with unified 3d reconstruction and spatial reasoning.arXiv preprint arXiv:2511.21688,
-
[4]
Junhong Lin, Kangli Wang, Shunzhou Wang, Songlin Fan, Ge Li, and Wei Gao. Vgd: Visual geometry gaussian splatting for feed-forward surround-view driving reconstruction.arXiv preprint arXiv:2510.19578,
-
[5]
SpinQuant: LLM quantization with learned rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krish- namoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Tail-aware post-training quantization for 3d geometry models.arXiv preprint arXiv:2602.01741,
Sicheng Pan, Chen Tang, Shuzhao Xie, Ke Yang, Weixiang Zhang, Jiawei Li, Bin Chen, Shu-Tao Xia, and Zhi Wang. Tail-aware post-training quantization for 3d geometry models.arXiv preprint arXiv:2602.01741,
-
[8]
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liu- juan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Litevggt: Boosting vanilla vggt via geometry-aware cached token merging
Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, et al. Litevggt: Boosting vanilla vggt via geometry-aware cached token merging. arXiv preprint arXiv:2512.04939,
-
[10]
Avggt: Rethinking global attention for accelerating vggt.arXiv preprint arXiv:2512.02541, 2025a
Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, and Jianfu Zhang. Avggt: Rethinking global attention for accelerating vggt.arXiv preprint arXiv:2512.02541, 2025a. Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quan...
-
[11]
Huanrui Yang, Yafeng Huang, Zhen Dong, Denis A Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Yuan Du, Kurt Keutzer, and Shanghang Zhang. Fisher-aware quantization for detr detectors with critical-category objectives.arXiv preprint arXiv:2407.03442,
-
[12]
Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
Sizhe Yang, Linning Xu, Hao Li, Juncheng Mu, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Robo3r: Enhancing robotic manipulation with accurate feed-forward 3d reconstruction.arXiv preprint arXiv:2602.10101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Yipu Zhang, Jintao Cheng, Xingyu Liu, Zeyu Li, Carol Jingyi Li, Jin Wu, Lin Jiang, Yuan Xie, Jiang Xu, and Wei Zhang. Versaq-3d: A reconfigurable accelerator enabling feed-forward and generalizable 3d reconstruction via versatile quantization.arXiv preprint arXiv:2601.20317,
-
[14]
Stereo Magnification: Learning View Synthesis using Multiplane Images
Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
The same argument applies to discrete labels by replacing integrals with sums
11 A Proof of Proposition 3.1 We prove the identity for a continuous label space. The same argument applies to discrete labels by replacing integrals with sums. Fix an input x. Since pz(y|x) is a normalized conditional distribution, Z pz(y|x)dy= 1.(17) Under the stated regularity conditions, we may differentiate Eq.(17) with respect to z. Differentiating ...
work page 2021
-
[16]
Table 7 summarizes FGQ’s performance onπ3
D FGQ Evaluation Result onπ 3 We further extend our method to π3 [Wang et al., 2025b], the most recent feed-forward 3D recon- struction model. Table 7 summarizes FGQ’s performance onπ3. On camera pose estimation, RTN collapses under 4/4 quantization, with AUC@3 on Co3Dv2 dropping from 0.5140 (FP16) to 0.0246, showing that π3 is highly sensitive to quantiz...
work page 2021
-
[17]
and 7-Scenes [Shotton et al., 2013] (Fig. 6). Moreover, we provide the visualization result for π3 [Wang et al., 2025b] on 7-Scenes [Shotton et al., 2013] in Fig. 7 to show FGQ’s improvement over RTN. 16 Ground Truth𝛑3 RTN W4A4𝛑3 FGQ W4A4 (Ours) Figure 7: Visual comparison results for π3 on 7-Scenes [Shotton et al., 2013] for Ground Truth, RTN W4A4, and F...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.