arxiv: 2605.08371 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

Haotang Li , Zhenyu Qi , Shaohan Henry Wang , Kebin Peng , Zi Wang , Qing Guo , Sen He , Huanrui Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:14 UTC · model grok-4.3

classification 💻 cs.CV

keywords token pruningvisual geometry transformeralternating attention3D reconstructionDINO featureslatency reductionfeature restorationScanNet

0 comments

The pith

Pre-AA token pruning with a distilled scorer lets frozen VGGT cut latency by over 5x while keeping reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PaceVGGT as a pre-alternating-attention token pruning method for Visual Geometry Transformers. It trains a lightweight Token Scorer on DINO patch features to estimate importance, first distilling from the backbone's internal attention maps and then refining with losses on camera pose, depth, and point maps. A per-frame keep budget and adaptive merge/prune scheme control the input length and total merges, after which a Feature-guided Restoration module rebuilds the dense spatial grid for the prediction heads. Experiments on ScanNet-50 and 7-Scenes show the approach stays on the quality-latency frontier, delivering 5.1 times lower latency than the unmodified VGGT at N=300 and 1.47 times lower than LiteVGGT at N=1000.

Core claim

PaceVGGT prunes DINO patch tokens before the first AA block of a frozen VGGT. A lightweight Token Scorer estimates per-token importance from DINO features, distilled against AA-internal attention targets and then refined under downstream camera, depth, and point-map losses. Per-frame keep budgets fix the visible sequence length while an importance-adaptive merge/prune assignment respects a fixed total merge budget. The Feature-guided Restoration module reconstructs the dense spatial grid required by the heads. On ScanNet-50 this yields 5.1 times lower inference latency than unmodified VGGT at N=300 and 1.47 times lower than LiteVGGT at N=1000 while remaining on the reconstruction quality-lat

What carries the argument

The Token Scorer, a lightweight network that ranks DINO patch tokens by estimated importance using distilled attention targets and downstream geometry losses, paired with the Feature-guided Restoration module that reconstructs the pruned dense spatial grid for the prediction heads.

If this is right

Pruning occurs before any AA block, so the backbone-visible sequence length remains fixed and computation is predictable.
Importance-adaptive merge and prune assignment preserves residual content from high-saliency frames under a fixed total merge budget.
The backbone stays frozen, so the acceleration applies to any pretrained VGGT without retraining the geometry transformer itself.
The method identifies pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.
Latency reductions of 5.1 times versus the original model at low token counts and 1.47 times versus the optimized LiteVGGT at high token counts are observed on ScanNet-50.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-attention pruning pattern could accelerate other vision transformers that use quadratic attention without requiring full task-specific retraining.
In real-time 3D reconstruction pipelines such as SLAM, the reduced latency might allow processing of longer video sequences on edge hardware.
Success of early DINO-feature scoring implies that token saliency for 3D geometry can often be predicted before any attention computation occurs.
Dynamic, scene-dependent keep budgets could further improve the quality-latency trade-off beyond the fixed-budget scheme tested here.

Load-bearing premise

A lightweight scorer trained on DINO features and AA-internal attention can accurately rank which tokens matter for downstream camera, depth, and point-map accuracy so that the restoration module can recover enough spatial information without degrading final geometry outputs.

What would settle it

A measurable increase in camera-pose error, depth error, or point-map error on ScanNet-50 when comparing the pruned model to the unpruned VGGT at the same effective N would falsify the claim that quality is preserved.

Figures

Figures reproduced from arXiv: 2605.08371 by Haotang Li, Huanrui Yang, Kebin Peng, Qing Guo, Sen He, Shaohan Henry Wang, Zhenyu Qi, Zi Wang.

**Figure 2.** Figure 2: Overview of PaceVGGT. The Token Scorer assigns an importance score to every DINO [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results. Top row: point cloud reconstructions on ScanNet-50. Bottom row: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by \(5.1\times\) over unmodified VGGT at \(N=300\) and \(1.47\times\) over LiteVGGT at \(N=1000\). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaceVGGT moves token pruning before the first AA block in frozen VGGT via a distilled scorer plus restoration, delivering reported 5x latency cuts at N=300 on ScanNet while claiming to hold reconstruction quality.

read the letter

The core advance is the pre-AA placement: they train a lightweight Token Scorer on DINO features, distill it against internal attention maps from the unpruned model, then fine-tune it on the actual camera, depth, and point-map losses. This is paired with an importance-adaptive merge/prune scheme that respects a per-frame keep budget and a total merge budget, plus a Feature-guided Restoration module that tries to rebuild the dense grid the prediction heads expect. Those pieces together let them keep the VGGT backbone untouched and still report staying on the quality-latency frontier on ScanNet-50 and 7-Scenes, with the 5.1× speedup over baseline VGGT at N=300 and 1.47× over LiteVGGT at N=1000 looking like the headline numbers. The design is a clean, practical response to the quadratic cost of AA on long sequences. The adaptive budget handling and the two-stage scorer training are the parts that feel least like generic token dropping. The main soft spot is the untested transfer: the scorer is optimized against attention targets first, then downstream losses, but pruning happens before any AA layer sees the tokens. Nothing in the abstract shows that attention saliency aligns well enough with geometric utility for the frozen backbone, so the restoration module has to carry a lot of the load. No error bars, no detailed ablations on the scorer alone, and no protocol for how the budgets were chosen make it hard to judge robustness. The free parameters (keep budget and merge budget) are explicit, which is good, but they still need to be tuned per setting. This is useful reading for anyone already running VGGT-style feed-forward 3D models and looking for inference speed without retraining the heavy backbone. A referee could usefully check the ablations and the scorer-to-geometry correlation on held-out data; the idea is concrete enough to deserve that review even if the current evidence is still light.

Referee Report

2 major / 2 minor

Summary. The paper introduces PaceVGGT, a pre-Alternating-Attention (pre-AA) token pruning framework for frozen Visual Geometry Transformers (VGGT). A lightweight Token Scorer is trained on DINO patch features, first distilled against AA-internal attention maps from the unpruned backbone and then refined under downstream camera, depth, and point-map losses. Tokens are pruned before the first AA block using a per-frame keep budget and importance-adaptive merge/prune assignment under a fixed total merge budget; a Feature-guided Restoration module reconstructs the dense spatial grid for the prediction heads. On ScanNet-50 and 7-Scenes, the method is claimed to remain on the reconstruction quality-latency frontier, delivering 5.1× latency reduction versus unmodified VGGT at N=300 and 1.47× versus LiteVGGT at N=1000.

Significance. If the empirical results hold under fuller validation, the work demonstrates that pre-AA pruning with a distilled lightweight scorer can accelerate frozen VGGT-style geometry transformers without retraining the backbone, addressing the quadratic scaling of the AA stack for longer sequences. This is a practical efficiency contribution for feed-forward 3D reconstruction pipelines, and the combination of attention distillation followed by task-specific refinement plus feature-guided restoration is a concrete engineering advance that could transfer to related transformer-based geometry models.

major comments (2)

[Experimental evaluation and §3 (Token Scorer training)] The central quality-preservation claim rests on the Token Scorer's ability to rank tokens by geometric utility (camera, depth, point-map accuracy) rather than merely by AA-internal attention saliency. The manuscript provides no direct analysis or ablation (e.g., correlation between scorer outputs and per-token error contribution to downstream heads, or performance when using only the attention-distillation target versus the refined scorer) that would confirm transfer from the AA attention target to the final geometry tasks; this is load-bearing for the reported frontier results.
[Abstract and results tables] Latency and quality numbers (5.1× at N=300 on ScanNet-50, 1.47× at N=1000) are reported without error bars, standard deviations across runs, or full experimental protocol details (random seeds, exact hardware, batching). This weakens confidence in the precise speedup factors and in the claim that quality is preserved on the frontier.

minor comments (2)

[Method] Notation for the per-frame keep budget and total merge budget should be introduced with explicit symbols and constraints in the method section to clarify how the importance-adaptive assignment is computed.
[§3.3] The Feature-guided Restoration module is described at a high level; a diagram or pseudocode showing how restored features are injected into the prediction heads would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing where the manuscript is incomplete and outlining the revisions we will make.

read point-by-point responses

Referee: [Experimental evaluation and §3 (Token Scorer training)] The central quality-preservation claim rests on the Token Scorer's ability to rank tokens by geometric utility (camera, depth, point-map accuracy) rather than merely by AA-internal attention saliency. The manuscript provides no direct analysis or ablation (e.g., correlation between scorer outputs and per-token error contribution to downstream heads, or performance when using only the attention-distillation target versus the refined scorer) that would confirm transfer from the AA attention target to the final geometry tasks; this is load-bearing for the reported frontier results.

Authors: We agree that an explicit ablation isolating the contribution of the task-specific refinement stage (versus attention distillation alone) would strengthen the central claim. The current manuscript describes the two-stage training but does not report the downstream metrics for the distilled-only scorer. We will add this ablation to §3 and the experimental results, showing camera, depth, and point-map accuracy for both variants. We will also include a correlation analysis between scorer outputs and per-token reconstruction error contributions where the data permit. revision: yes
Referee: [Abstract and results tables] Latency and quality numbers (5.1× at N=300 on ScanNet-50, 1.47× at N=1000) are reported without error bars, standard deviations across runs, or full experimental protocol details (random seeds, exact hardware, batching). This weakens confidence in the precise speedup factors and in the claim that quality is preserved on the frontier.

Authors: We acknowledge that the absence of error bars, standard deviations, and complete protocol details reduces confidence in the exact reported factors. In the revised manuscript we will report standard deviations over multiple runs with documented random seeds, specify the exact hardware, batch sizes, and measurement methodology for all latency and quality numbers, and update the abstract and tables accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity: training and evaluation remain independent of reported gains

full rationale

The paper's chain consists of (1) distilling a Token Scorer from AA-internal attention targets of the frozen backbone, (2) refining it under separate downstream losses, (3) applying a fixed-budget prune/merge plus Feature-guided Restoration, and (4) measuring latency and reconstruction metrics on ScanNet-50/7-Scenes. None of these steps reduce the headline latency reductions (5.1×, 1.47×) to fitted parameters by construction, nor invoke self-cited uniqueness theorems or ansatzes that presuppose the final result. The central claim is therefore an empirical observation rather than a tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The approach rests on a small number of new trained components and fixed budgets rather than many free parameters or invented physical entities.

free parameters (2)

per-frame keep budget
Fixed hyperparameter controlling backbone-visible sequence length entering AA.
total merge budget
Fixed hyperparameter governing importance-adaptive merge/prune assignment across frames.

axioms (1)

domain assumption DINO patch features contain sufficient signal to train a scorer that ranks token importance for VGGT geometry tasks
The Token Scorer takes DINO features as input and is distilled against AA attention.

invented entities (2)

Token Scorer no independent evidence
purpose: Lightweight network estimating per-token importance from DINO features
New trained module introduced to enable pre-AA pruning.
Feature-guided Restoration module no independent evidence
purpose: Reconstructs the dense spatial grid required by prediction heads after pruning
New module to compensate for token removal.

pith-pipeline@v0.9.0 · 5590 in / 1470 out tokens · 41993 ms · 2026-05-12T02:14:16.573019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[2]

DUSt3R: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jérôme Revaud. DUSt3R: Geometric 3d vision made easy. InCVPR, 2024

work page 2024
[3]

Grounding image matching in 3d with MASt3R

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. InECCV, 2024

work page 2024
[4]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928,

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3d reconstruction of 1000+ images in one forward pass.arXiv preprint arXiv:2501.13928, 2025

work page arXiv 2025
[5]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025
[6]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[7]

Training data-efficient image transformers and distillation through attention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning (ICML), pages 10347–10357, 2021

work page 2021
[8]

Swin Transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021

work page 2021
[9]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 568–578, 2021. 10

work page 2021
[10]

Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free accelera- tion of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

work page arXiv 2025
[11]

Litevggt: Boosting vanilla vggt via geometry-aware cached token merging.arXiv preprint arXiv:2512.04939, 2025

Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, and Xiao-Xiao Long. Litevggt: Boosting vanilla vggt via geometry-aware cached token merging.arXiv preprint arXiv:2512.04939, 2025

work page arXiv 2025
[12]

Co-me: Confidence-guided token merging for visual geometric transformers, 2025

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, and Sebastian Scherer. Co-me: Confidence-guided token merging for visual geometric transformers, 2025

work page 2025
[13]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650–9660, 2021

work page 2021
[14]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022

work page 2022
[15]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab et al. DINOv2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InICLR, 2023

work page 2023
[17]

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2D Feature Representations by 3D-Aware Fine-Tuning. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[18]

Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

work page arXiv 2025
[19]

Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

work page arXiv 2025
[20]

Infinitevggt: Visual geometry grounded transformer for endless streams.arXiv preprint arXiv:2601.02281, 2026

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026
[21]

Dynam- icViT: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynam- icViT: Efficient vision transformers with dynamic token sparsification. InNeurIPS, 2021

work page 2021
[22]

IA-RED2: Interpretability-aware redundancy reduction for vision transformers

Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. IA-RED2: Interpretability-aware redundancy reduction for vision transformers. InAdvances in Neural Information Processing Systems (NeurIPS), pages 24898–24911, 2021

work page 2021
[23]

Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov

Hongxu Yin, Arash Vahdat, Jose M. Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-ViT: Adaptive tokens for efficient vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022

work page 2022
[24]

Evo-ViT: Slow-fast token evolution for dynamic vision transformer

Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-ViT: Slow-fast token evolution for dynamic vision transformer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022

work page 2022
[25]

Adaptive token sampling for efficient vision transformers

Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Jürgen Gall. Adaptive token sampling for efficient vision transformers. InEuropean Conference on Computer Vision (ECCV), pages 396–414, 2022

work page 2022
[26]

Not all patches are what you need: Expediting vision transformers via token reorganizations

Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. InICLR, 2022. 11

work page 2022
[27]

Michael S. Ryoo, A. J. Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia Angelova. TokenLearner: Adaptive space-time tokenization for videos. InAdvances in Neural Information Processing Systems (NeurIPS), pages 12786–12797, 2021

work page 2021
[28]

SparseViT: Revisiting activation sparsity for efficient high-resolution vision transformer

Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. SparseViT: Revisiting activation sparsity for efficient high-resolution vision transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2061–2070, 2023

work page 2061
[29]

Scene coordinate regression forests for camera relocalization in RGB-D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013

work page 2013
[30]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

work page 2017
[31]

Real-time rgb-d camera relocalization

Ben Glocker, Shahram Izadi, Jamie Shotton, and Antonio Criminisi. Real-time rgb-d camera relocalization. InInternational Symposium on Mixed and Augmented Reality (ISMAR). IEEE, October 2013

work page 2013
[32]

FlashAttention-2: Faster attention with better parallelism and work partitioning

Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[33]

Goldman, Matthias Nießner, and Justus Thies

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B. Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D surface reconstruction. InCVPR, 2022

work page 2022
[34]

Limitations

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, 2021. A Experimental Details A.1 Training Configuration Details We train the Token Scorer and Feature-guided Restoration module for 100 epochs each, using the AdamW optimizer with a learning rate of 1×10 −5 and a batch size of 24. The distillation stage ...

work page arXiv 2021
[35]

Guidelines: 21 • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page