Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Pith reviewed 2026-05-20 23:19 UTC · model grok-4.3
The pith
Asymmetric token reduction in attention layers speeds up 3D reconstruction by up to 28 times
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Applying distinct reduction factors with intra-group merging for queries and lightweight pruning for key-values, plus adaptive adjustment across layers, yields up to 28x speedup on 1000-frame inputs while maintaining competitive reconstruction quality.
What carries the argument
Asymmetric token reduction framework that decouples the compression rates for query tokens and key-value tokens according to their distinct roles in the attention mechanism.
If this is right
- Up to 28 times faster processing for inputs with 1000 frames
- Compatible with multiple pretrained models including VGGT and π³ without any retraining
- Reconstruction quality stays competitive with the uncompressed versions
- Adaptive per-layer adjustment for key-value tokens improves the quality-speed balance
Where Pith is reading between the lines
- Similar role-based compression differences may exist in other transformer applications for vision or language
- Extending the method to even longer sequences or real-time video could be tested directly
- The approach might reduce memory use enough to run these models on consumer hardware
Load-bearing premise
The difference in compression tolerance between query tokens and key-value tokens stays reliable across the tested models and sequence lengths.
What would settle it
Measure the reconstruction accuracy on a long video sequence using both the original model and the Spark3R version to check if quality drops significantly despite the speedup.
Figures
read the original abstract
Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $\pi^3$, Depth-Anything-3, and VGGT-$\Omega$, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Spark3R, a training-free acceleration framework for feed-forward 3D reconstruction models based on Vision Transformers. It exploits an observed asymmetry in token roles: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and can tolerate aggressive compression. By applying distinct reduction strategies—intra-group merging for queries and pruning for KV, with layer-adaptive KV factors—it achieves up to 28× speedup on 1000-frame inputs across models like VGGT and π³ while maintaining competitive reconstruction quality.
Significance. If the identified property of differential compression sensitivity holds generally, Spark3R could enable scaling of feed-forward 3D reconstruction to long video sequences without retraining or significant quality loss, addressing a key bottleneck in quadratic attention costs. The plug-and-play integration and empirical speedups on multiple pretrained models represent a practical contribution to efficient 3D vision pipelines.
major comments (2)
- [§3.1–3.2] §3.1–3.2: The core claim that query tokens are distinctly more sensitive to compression than KV tokens (enabling asymmetric reduction without retraining) is presented as an empirical property of pretrained models, yet the manuscript reports no quantitative ablation measuring reconstruction metrics (e.g., Chamfer distance, normal consistency) when applying separate reduction ratios to queries versus KV tokens, nor any sensitivity curves or error bars across scenes. This leaves the stability of the sensitivity gap insufficiently supported for the claimed training-free deployment.
- [§4.3, Table 2] §4.3, Table 2: The reported 28× speedup on 1000-frame inputs is given without per-scene variance or comparison against uniform token-merging baselines at equivalent total token budgets; it is therefore unclear whether the quality-efficiency tradeoff is driven by the asymmetric design or simply by overall token count reduction.
minor comments (2)
- [Abstract] Abstract and §1: The phrase 'competitive reconstruction quality' should be accompanied by the specific metrics and reference methods used for comparison.
- [§3] Notation in §3: Define the intra-group merging operation and the layer-adaptive KV factor selection heuristic more formally (e.g., via pseudocode or equations) to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical grounding of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [§3.1–3.2] §3.1–3.2: The core claim that query tokens are distinctly more sensitive to compression than KV tokens (enabling asymmetric reduction without retraining) is presented as an empirical property of pretrained models, yet the manuscript reports no quantitative ablation measuring reconstruction metrics (e.g., Chamfer distance, normal consistency) when applying separate reduction ratios to queries versus KV tokens, nor any sensitivity curves or error bars across scenes. This leaves the stability of the sensitivity gap insufficiently supported for the claimed training-free deployment.
Authors: We agree that dedicated quantitative ablations would provide stronger support for the observed asymmetry. In the revised manuscript we add a new set of experiments in §3.2 that plot Chamfer distance and normal consistency as functions of independent query and KV reduction ratios. The curves include error bars computed over 10 diverse scenes and confirm that query tokens remain markedly more sensitive across the tested range, while KV tokens tolerate higher compression with minimal degradation. These results directly substantiate the stability of the sensitivity gap and the suitability of the training-free approach. revision: yes
-
Referee: [§4.3, Table 2] §4.3, Table 2: The reported 28× speedup on 1000-frame inputs is given without per-scene variance or comparison against uniform token-merging baselines at equivalent total token budgets; it is therefore unclear whether the quality-efficiency tradeoff is driven by the asymmetric design or simply by overall token count reduction.
Authors: We acknowledge the value of variance reporting and controlled baselines. We have updated Table 2 to include per-scene means and standard deviations for both quality metrics and runtime on the 1000-frame setting. In addition, we introduce a new comparison in §4.3 against uniform token merging performed at identical total token budgets; the results show that the asymmetric strategy consistently yields lower Chamfer distance than the uniform baseline at the same speedup, indicating that the quality-efficiency advantage is attributable to the differential treatment of queries and KV tokens rather than token count reduction alone. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper identifies an empirical distinction between query-token sensitivity and KV-token tolerance to compression as a guiding observation, then applies distinct reduction factors (intra-group merging on queries, pruning on KV, plus layer-adaptive KV factors) in a training-free manner. No equations or self-citations reduce the claimed speedup or quality retention to a fitted constant or input by construction; the performance numbers are measured outcomes on external pretrained models rather than tautological re-statements of the initial observation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression.
Reference graph
Works this paper leans on
-
[1]
Structure-from-motion revisited,
J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProc. of CVPR, 2016, pp. 4104–4113
work page 2016
-
[2]
Model reduction for real-time fluids , year =
N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,”ACM Trans. Graph., vol. 25, no. 3, p. 835–846, Jul. 2006. [Online]. Available: https://doi.org/10.1145/1141911.1141964
-
[3]
Pixelwise view selection for unstructured multi-view stereo,
J. L. Sch ¨onberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” inProc. of ECCV. Springer, 2016, pp. 501–518
work page 2016
-
[4]
Mvsnet: Depth inference for unstructured multi-view stereo,
Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,” inProc. of ECCV, 2018, pp. 767– 783
work page 2018
-
[5]
Nerf: Representing scenes as neural radiance fields for view synthesis,
B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021
work page 2021
-
[6]
3d gaussian splatting for real-time radiance field rendering
B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Transactions on Graphics, vol. 42, no. 4, pp. 139–1, 2023
work page 2023
-
[7]
Scaffold- gs: Structured 3d gaussians for view-adaptive rendering,
T. Lu, M. Yu, L. Xu, Y . Xiangli, L. Wang, D. Lin, and B. Dai, “Scaffold- gs: Structured 3d gaussians for view-adaptive rendering,” inProc. of CVPR, 2024, pp. 20 654–20 664
work page 2024
-
[8]
HiCoM: Hierarchical coherent motion for dynamic streamable scenes with 3D gaussian splatting,
Q. Gao, J. Meng, C. Wen, J. Chen, and J. Zhang, “HiCoM: Hierarchical coherent motion for dynamic streamable scenes with 3D gaussian splatting,” inProc. of NeurIPS, 2024
work page 2024
-
[9]
J. Fu, Q. Gao, C. Wen, Y . Wu, S. Ma, J. Zhang, and J. Zhang, “Recon- gs: Continuum-preserved gaussian streaming for fast and compact re- construction of dynamic scenes,” inProc. of NeurIPS, 2025
work page 2025
-
[10]
Virpnet: A multimodal virtual point generation network for 3d object detection,
L. Wang, S. Sun, and J. Zhao, “Virpnet: A multimodal virtual point generation network for 3d object detection,”IEEE Transactions on Multimedia, vol. 26, pp. 10 597–10 609, 2024
work page 2024
-
[11]
Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction
S. Yang, L. Xu, H. Li, J. Mu, J. Zeng, D. Lin, and J. Pang, “Robo3r: Enhancing robotic manipulation with accurate feed-forward 3d recon- struction,”arXiv preprint arXiv:2602.10101, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Language-assisted 3d scene understanding,
Y . Wu, Q. Gao, R. Zhang, H. Li, and J. Zhang, “Language-assisted 3d scene understanding,”IEEE Transactions on Multimedia, vol. 27, pp. 3869–3879, 2025
work page 2025
-
[13]
3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,
H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu, “3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,” IEEE Transactions on Multimedia, vol. 27, pp. 2899–2911, 2025
work page 2025
-
[14]
Vggt: Visual geometry grounded transformer,
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProc. of CVPR, 2025
work page 2025
-
[15]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Scalable permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Depth Anything 3: Recovering the Visual Space from Any Views
H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
J. Wang, M. Chen, S. Zhang, N. Karaev, J. Sch ¨onberger, P. Labatut, P. Bojanowski, D. Novotny, A. Vedaldi, and C. Rupprecht, “Vggt-Ω,” inProc. of CVPR, 2026
work page 2026
-
[18]
Continuous 3d perception model with persistent state,
Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inProc. of CVPR, 2025, pp. 10 510–10 522
work page 2025
-
[19]
TTT3R: 3D Reconstruction as Test-Time Training
X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d reconstruction as test-time training,”arXiv preprint arXiv:2509.26645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
InfiniteVGGT: Visual geometry grounded transformer for endless streams
S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang, “Infinitevggt: Visual geometry grounded transformer for endless streams,”arXiv preprint arXiv:2601.02281, 2026
-
[21]
K. Deng, Z. Ti, J. Xu, J. Yang, and J. Xie, “Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences,” arXiv preprint arXiv:2507.16443, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Laser: Layer-wise scale alignment for training-free streaming 4d reconstruction,
T. Ding, Y . Xie, Y . Liang, M. Chatterjee, P. Miraldo, and H. Jiang, “Laser: Layer-wise scale alignment for training-free streaming 4d reconstruction,” 2026. [Online]. Available: https://arxiv.org/abs/2512. 13680
work page 2026
-
[23]
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry transformer,” arXiv preprint arXiv:2509.02560, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Litevggt: Boosting vanilla vggt via geometry-aware cached token merging
Z. Shu, C. Lin, T. Xie, W. Yin, B. Li, Z. Pu, W. Li, Y . Yao, X. Cao, X. Guo, and X.-X. Long, “Litevggt: Boosting vanilla vggt via geometry-aware cached token merging,” 2025. [Online]. Available: https://arxiv.org/abs/2512.04939
-
[25]
Dust3r: Geometric 3d vision made easy,
S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProc. of CVPR, 2024, pp. 20 697– 20 709
work page 2024
-
[26]
Grounding image matching in 3d with mast3r,
V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inProc. of ECCV. Springer, 2024, pp. 71–91
work page 2024
-
[27]
Y . Wu, W. Zheng, J. Zhou, and J. Lu, “Point3r: Streaming 3d reconstruction with explicit spatial pointer memory,” 2025. [Online]. Available: https://arxiv.org/abs/2507.02863
-
[28]
Streaming 4D Visual Geometry Transformer
D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu, “Streaming 4d visual geometry transformer,” 2026. [Online]. Available: https://arxiv.org/abs/2507.11539
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Vgg-t 3: Offline feed-forward 3d reconstruction at scale,
S. Elflein, R. Li, S. Agostinho, Z. Gojcic, L. Leal-Taix ´e, Q. Zhou, and A. Osep, “Vgg-t 3: Offline feed-forward 3d reconstruction at scale,”
-
[30]
Available: https://arxiv.org/abs/2602.23361
[Online]. Available: https://arxiv.org/abs/2602.23361
-
[31]
Zipmap: Linear-time stateful 3d reconstruction via test-time training,
H. Jin, R. Wu, T. Zhang, R. Gao, J. T. Barron, N. Snavely, and A. Holynski, “Zipmap: Linear-time stateful 3d reconstruction via test-time training,” 2026. [Online]. Available: https://arxiv.org/abs/2603. 04385
work page 2026
-
[32]
T. Zhang, S. Bi, Y . Hong, K. Zhang, F. Luan, S. Yang, K. Sunkavalli, W. T. Freeman, and H. Tan, “Test-time training done right,” 2025. [Online]. Available: https://arxiv.org/abs/2505.23884
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://arxiv.org/abs/1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
FlashAttention: Fast and memory-efficient exact attention with IO-awareness,
T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inProc. of NeurIPS, 2022
work page 2022
-
[35]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learning Research Journal, pp. 1–31, 2024
work page 2024
-
[36]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProc. of CVPR, 2021, pp. 12 179–12 188
work page 2021
-
[37]
Token merging: Your ViT but faster,
D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,” inICLR, 2023
work page 2023
-
[38]
Token merging for fast stable diffusion,
D. Bolya and J. Hoffman, “Token merging for fast stable diffusion,”
- [39]
-
[40]
Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026
A. Zweiger, X. Fu, H. Guo, and Y . Kim, “Fast kv compaction via attention matching,” 2026. [Online]. Available: https://arxiv.org/abs/ 2602.16284
work page internal anchor Pith review arXiv 2026
-
[41]
Scene coordinate regression forests for camera relocalization in rgb-d images,
J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” inProc. of CVPR, 2013, pp. 2930–2937. 12
work page 2013
-
[42]
Neural rgb-d surface reconstruction,
D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” inProc. of CVPR, 2022, pp. 6290–6301
work page 2022
-
[43]
A benchmark for the evaluation of rgb-d slam systems,
J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inProc. of IROS. IEEE, 2012, pp. 573–580
work page 2012
-
[44]
Scannet: Richly-annotated 3d reconstructions of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProc. of CVPR, 2017, pp. 5828–5839
work page 2017
-
[45]
A naturalistic open source movie for optical flow evaluation,
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inProc. of ECCV. Springer, 2012, pp. 611–625
work page 2012
-
[46]
Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,
E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,” inProc. of IROS. IEEE, 2019, pp. 7855–7862
work page 2019
-
[47]
Vision meets robotics: The kitti dataset,
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. Zecheng Tangis currently pursuing the M.S. degree in computer science and technology with Peking University Shenzhen Graduate School, Shen- zhen, China. His research interests incl...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.