Recognition: no theorem link
PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers
Pith reviewed 2026-05-12 02:14 UTC · model grok-4.3
The pith
Pre-AA token pruning with a distilled scorer lets frozen VGGT cut latency by over 5x while keeping reconstruction quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaceVGGT prunes DINO patch tokens before the first AA block of a frozen VGGT. A lightweight Token Scorer estimates per-token importance from DINO features, distilled against AA-internal attention targets and then refined under downstream camera, depth, and point-map losses. Per-frame keep budgets fix the visible sequence length while an importance-adaptive merge/prune assignment respects a fixed total merge budget. The Feature-guided Restoration module reconstructs the dense spatial grid required by the heads. On ScanNet-50 this yields 5.1 times lower inference latency than unmodified VGGT at N=300 and 1.47 times lower than LiteVGGT at N=1000 while remaining on the reconstruction quality-lat
What carries the argument
The Token Scorer, a lightweight network that ranks DINO patch tokens by estimated importance using distilled attention targets and downstream geometry losses, paired with the Feature-guided Restoration module that reconstructs the pruned dense spatial grid for the prediction heads.
If this is right
- Pruning occurs before any AA block, so the backbone-visible sequence length remains fixed and computation is predictable.
- Importance-adaptive merge and prune assignment preserves residual content from high-saliency frames under a fixed total merge budget.
- The backbone stays frozen, so the acceleration applies to any pretrained VGGT without retraining the geometry transformer itself.
- The method identifies pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.
- Latency reductions of 5.1 times versus the original model at low token counts and 1.47 times versus the optimized LiteVGGT at high token counts are observed on ScanNet-50.
Where Pith is reading between the lines
- The same pre-attention pruning pattern could accelerate other vision transformers that use quadratic attention without requiring full task-specific retraining.
- In real-time 3D reconstruction pipelines such as SLAM, the reduced latency might allow processing of longer video sequences on edge hardware.
- Success of early DINO-feature scoring implies that token saliency for 3D geometry can often be predicted before any attention computation occurs.
- Dynamic, scene-dependent keep budgets could further improve the quality-latency trade-off beyond the fixed-budget scheme tested here.
Load-bearing premise
A lightweight scorer trained on DINO features and AA-internal attention can accurately rank which tokens matter for downstream camera, depth, and point-map accuracy so that the restoration module can recover enough spatial information without degrading final geometry outputs.
What would settle it
A measurable increase in camera-pose error, depth error, or point-map error on ScanNet-50 when comparing the pruned model to the unpruned VGGT at the same effective N would falsify the claim that quality is preserved.
Figures
read the original abstract
Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by \(5.1\times\) over unmodified VGGT at \(N=300\) and \(1.47\times\) over LiteVGGT at \(N=1000\). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaceVGGT, a pre-Alternating-Attention (pre-AA) token pruning framework for frozen Visual Geometry Transformers (VGGT). A lightweight Token Scorer is trained on DINO patch features, first distilled against AA-internal attention maps from the unpruned backbone and then refined under downstream camera, depth, and point-map losses. Tokens are pruned before the first AA block using a per-frame keep budget and importance-adaptive merge/prune assignment under a fixed total merge budget; a Feature-guided Restoration module reconstructs the dense spatial grid for the prediction heads. On ScanNet-50 and 7-Scenes, the method is claimed to remain on the reconstruction quality-latency frontier, delivering 5.1× latency reduction versus unmodified VGGT at N=300 and 1.47× versus LiteVGGT at N=1000.
Significance. If the empirical results hold under fuller validation, the work demonstrates that pre-AA pruning with a distilled lightweight scorer can accelerate frozen VGGT-style geometry transformers without retraining the backbone, addressing the quadratic scaling of the AA stack for longer sequences. This is a practical efficiency contribution for feed-forward 3D reconstruction pipelines, and the combination of attention distillation followed by task-specific refinement plus feature-guided restoration is a concrete engineering advance that could transfer to related transformer-based geometry models.
major comments (2)
- [Experimental evaluation and §3 (Token Scorer training)] The central quality-preservation claim rests on the Token Scorer's ability to rank tokens by geometric utility (camera, depth, point-map accuracy) rather than merely by AA-internal attention saliency. The manuscript provides no direct analysis or ablation (e.g., correlation between scorer outputs and per-token error contribution to downstream heads, or performance when using only the attention-distillation target versus the refined scorer) that would confirm transfer from the AA attention target to the final geometry tasks; this is load-bearing for the reported frontier results.
- [Abstract and results tables] Latency and quality numbers (5.1× at N=300 on ScanNet-50, 1.47× at N=1000) are reported without error bars, standard deviations across runs, or full experimental protocol details (random seeds, exact hardware, batching). This weakens confidence in the precise speedup factors and in the claim that quality is preserved on the frontier.
minor comments (2)
- [Method] Notation for the per-frame keep budget and total merge budget should be introduced with explicit symbols and constraints in the method section to clarify how the importance-adaptive assignment is computed.
- [§3.3] The Feature-guided Restoration module is described at a high level; a diagram or pseudocode showing how restored features are injected into the prediction heads would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing where the manuscript is incomplete and outlining the revisions we will make.
read point-by-point responses
-
Referee: [Experimental evaluation and §3 (Token Scorer training)] The central quality-preservation claim rests on the Token Scorer's ability to rank tokens by geometric utility (camera, depth, point-map accuracy) rather than merely by AA-internal attention saliency. The manuscript provides no direct analysis or ablation (e.g., correlation between scorer outputs and per-token error contribution to downstream heads, or performance when using only the attention-distillation target versus the refined scorer) that would confirm transfer from the AA attention target to the final geometry tasks; this is load-bearing for the reported frontier results.
Authors: We agree that an explicit ablation isolating the contribution of the task-specific refinement stage (versus attention distillation alone) would strengthen the central claim. The current manuscript describes the two-stage training but does not report the downstream metrics for the distilled-only scorer. We will add this ablation to §3 and the experimental results, showing camera, depth, and point-map accuracy for both variants. We will also include a correlation analysis between scorer outputs and per-token reconstruction error contributions where the data permit. revision: yes
-
Referee: [Abstract and results tables] Latency and quality numbers (5.1× at N=300 on ScanNet-50, 1.47× at N=1000) are reported without error bars, standard deviations across runs, or full experimental protocol details (random seeds, exact hardware, batching). This weakens confidence in the precise speedup factors and in the claim that quality is preserved on the frontier.
Authors: We acknowledge that the absence of error bars, standard deviations, and complete protocol details reduces confidence in the exact reported factors. In the revised manuscript we will report standard deviations over multiple runs with documented random seeds, specify the exact hardware, batch sizes, and measurement methodology for all latency and quality numbers, and update the abstract and tables accordingly. revision: yes
Circularity Check
No significant circularity: training and evaluation remain independent of reported gains
full rationale
The paper's chain consists of (1) distilling a Token Scorer from AA-internal attention targets of the frozen backbone, (2) refining it under separate downstream losses, (3) applying a fixed-budget prune/merge plus Feature-guided Restoration, and (4) measuring latency and reconstruction metrics on ScanNet-50/7-Scenes. None of these steps reduce the headline latency reductions (5.1×, 1.47×) to fitted parameters by construction, nor invoke self-cited uniqueness theorems or ansatzes that presuppose the final result. The central claim is therefore an empirical observation rather than a tautology.
Axiom & Free-Parameter Ledger
free parameters (2)
- per-frame keep budget
- total merge budget
axioms (1)
- domain assumption DINO patch features contain sufficient signal to train a scorer that ranks token importance for VGGT geometry tasks
invented entities (2)
-
Token Scorer
no independent evidence
-
Feature-guided Restoration module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[2]
DUSt3R: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jérôme Revaud. DUSt3R: Geometric 3d vision made easy. InCVPR, 2024
work page 2024
-
[3]
Grounding image matching in 3d with MASt3R
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. InECCV, 2024
work page 2024
-
[4]
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928, 2025
-
[5]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
-
[6]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021
work page 2021
-
[7]
Training data-efficient image transformers and distillation through attention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning (ICML), pages 10347–10357, 2021
work page 2021
-
[8]
Swin Transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021
work page 2021
-
[9]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions
Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 568–578, 2021. 10
work page 2021
-
[10]
You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free accelera- tion of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025
-
[11]
Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. Litevggt: Boosting vanilla vggt via geometry-aware cached token merging.arXiv preprint arXiv:2512.04939, 2025
-
[12]
Co-me: Confidence-guided token merging for visual geometric transformers, 2025
Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, and Sebastian Scherer. Co-me: Confidence-guided token merging for visual geometric transformers, 2025
work page 2025
-
[13]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021
work page 2021
-
[14]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022
work page 2022
-
[15]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Token merging: Your ViT but faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023
work page 2023
-
[17]
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2D Feature Representations by 3D-Aware Fine-Tuning. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[18]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025
-
[19]
Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025
-
[20]
Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026
-
[21]
Dynam- icViT: Efficient vision transformers with dynamic token sparsification
Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynam- icViT: Efficient vision transformers with dynamic token sparsification. InNeurIPS, 2021
work page 2021
-
[22]
IA-RED2: Interpretability-aware redundancy reduction for vision transformers
Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-RED2: Interpretability-aware redundancy reduction for vision transformers. InAdvances in Neural Information Processing Systems (NeurIPS), pages 24898–24911, 2021
work page 2021
-
[23]
Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov
Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-ViT: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022
work page 2022
-
[24]
Evo-ViT: Slow-fast token evolution for dynamic vision transformer
Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-ViT: Slow-fast token evolution for dynamic vision transformer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022
work page 2022
-
[25]
Adaptive token sampling for efficient vision transformers
Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen Gall. Adaptive token sampling for efficient vision transformers. InEuropean Conference on Computer Vision (ECCV), pages 396–414, 2022
work page 2022
-
[26]
Not all patches are what you need: Expediting vision transformers via token reorganizations
Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. InICLR, 2022. 11
work page 2022
-
[27]
Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. TokenLearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems (NeurIPS), pages 12786–12797, 2021
work page 2021
-
[28]
SparseViT: Revisiting activation sparsity for efficient high-resolution vision transformer
Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. SparseViT: Revisiting activation sparsity for efficient high-resolution vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2061–2070, 2023
work page 2061
-
[29]
Scene coordinate regression forests for camera relocalization in RGB-D images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013
work page 2013
-
[30]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017
work page 2017
-
[31]
Real-time rgb-d camera relocalization
Ben Glocker, Shahram Izadi, Jamie Shotton, and Antonio Criminisi. Real-time rgb-d camera relocalization. InInternational Symposium on Mixed and Augmented Reality (ISMAR). IEEE, October 2013
work page 2013
-
[32]
FlashAttention-2: Faster attention with better parallelism and work partitioning
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[33]
Goldman, Matthias Nießner, and Justus Thies
Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. InCVPR, 2022
work page 2022
-
[34]
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, 2021. A Experimental Details A.1 Training Configuration Details We train the Token Scorer and Feature-guided Restoration module for 100 epochs each, using the AdamW optimizer with a learning rate of 1×10 −5 and a batch size of 24. The distillation stage ...
-
[35]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.