pith. sign in

arxiv: 2605.06270 · v2 · pith:6GRWJT5Ynew · submitted 2026-05-07 · 💻 cs.CV

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

Pith reviewed 2026-05-20 23:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords asymmetric token reduction3D reconstructionVision Transformersfeed-forward modelstoken compressionacceleration frameworkquery key-value roles
0
0 comments X

The pith

Asymmetric token reduction in attention layers speeds up 3D reconstruction by up to 28 times

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Feed-forward 3D reconstruction models based on Vision Transformers estimate scene geometry directly from images but face quadratic costs when processing hundreds of frames. The work identifies that query tokens, which request view-specific geometry, lose quality quickly under compression while key-value tokens, which provide shared scene information, can be reduced much more. By merging query tokens in groups and pruning key-value tokens at different rates, with the key-value rate changing per layer, the method delivers large efficiency gains. This training-free approach plugs into existing models and makes long-sequence reconstruction practical.

Core claim

Query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Applying distinct reduction factors with intra-group merging for queries and lightweight pruning for key-values, plus adaptive adjustment across layers, yields up to 28x speedup on 1000-frame inputs while maintaining competitive reconstruction quality.

What carries the argument

Asymmetric token reduction framework that decouples the compression rates for query tokens and key-value tokens according to their distinct roles in the attention mechanism.

If this is right

  • Up to 28 times faster processing for inputs with 1000 frames
  • Compatible with multiple pretrained models including VGGT and π³ without any retraining
  • Reconstruction quality stays competitive with the uncompressed versions
  • Adaptive per-layer adjustment for key-value tokens improves the quality-speed balance

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar role-based compression differences may exist in other transformer applications for vision or language
  • Extending the method to even longer sequences or real-time video could be tested directly
  • The approach might reduce memory use enough to run these models on consumer hardware

Load-bearing premise

The difference in compression tolerance between query tokens and key-value tokens stays reliable across the tested models and sequence lengths.

What would settle it

Measure the reconstruction accuracy on a long video sequence using both the original model and the Spark3R version to check if quality drops significantly despite the speedup.

Figures

Figures reproduced from arXiv: 2605.06270 by Haijie Li, Jian Zhang, Jiaqi Zhang, Jiaye Fu, Qiankun Gao, Siwei Ma, Yanmin Wu, Zecheng Tang.

Figure 1
Figure 1. Figure 1: Compression sensitivity of different token roles in VGGT. We separately compress query tokens (orange), key-value tokens (blue), and both jointly (red) at increasing reduction factors and report pose error (ATE ↓). Key-value tokens tolerate aggressive compression with negligible quality loss, while query tokens degrade sharply beyond a reduction factor of 12. Joint uniform compression yields the steepest c… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Spark3R. (Top) Spark3R applies asymmetric token reduction to the global attention layers of a feed-forward 3D reconstruction model, with separate reduction factors rQ and rKV (rKV > rQ in general). (Middle) A layer-adaptive key-value reduction schedule assigns each layer a large or small rKV based on its measured sensitivity to compression. (Bottom) Detailed illustration of the asymmetric reduc… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of inter-frame distances between merged source– view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of cosine similarities between matched source–destination view at source ↗
Figure 6
Figure 6. Figure 6: Per-layer sensitivity to key-value pruning in view at source ↗
Figure 5
Figure 5. Figure 5: Per-layer sensitivity to key-value pruning in VGGT (24 layers). Each [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with unaccelerated base models. Each pair shows the original model and its Spark3R-accelerated counterpart. Spark3R preserves fine-grained geometric details and produces point clouds visually comparable to the unaccelerated baselines. Notably, for VGGT, Spark3R even improves the reconstruction quality by alleviating attention dilution on long sequences. FastVGGT TTT3R ZipMap Ours+𝜋 O… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with other acceleration methods. FastVGGT produces blurred geometry, while TTT3R exhibits noticeable artifacts. ZipMap yields more complete results but still suffers from subtle structural distortions (e.g., misaligned door parts in the red dashed box). Spark3R+VGGT substantially sharpens the reconstruction over FastVGGT, and Spark3R applied to π 3 and DA3 further surpasses ZipMap wi… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison with unaccelerated baselines. Each pair shows the original model and its Spark3R-accelerated counterpart. Spark3R preserves fine-grained geometric details and produces point clouds visually comparable to the unaccelerated baselines. Notably, for VGGT, Spark3R even improves the reconstruction quality by alleviating attention dilution on long sequences. this regime. Merging vs. Pruning… view at source ↗
Figure 9
Figure 9. Figure 9: ATE and wall-clock merging time as a function of the group size view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison with other acceleration methods. The streaming methods CUT3R and TTT3R produce fragmented and noisy reconstructions. FastVGGT produces blurred geometry in the blue dashed box, while ZipMap exhibits subtle structural distortions in the red dashed box (i.e., misaligned door parts). Spark3R+VGGT sharpens the reconstruction over FastVGGT, and Spark3R applied to π 3 , DA3, and VGGT-Ω furt… view at source ↗
Figure 10
Figure 10. Figure 10: Merging vs. pruning for key-value tokens. Both strategies use the same temporal stride partitioning into source and destination tokens. Top: ATE as a function of the number of input frames; both achieve nearly identical pose error. Bottom: wall-clock token reduction time. Token merging grows superlinearly due to the bipartite similarity computation, while token pruning remains near zero throughout. wall-c… view at source ↗
Figure 9
Figure 9. Figure 9: ATE and wall-clock merging time as a function of the group size [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Merging vs. pruning for key-value tokens. Both strategies use the same temporal stride partitioning into source and destination tokens. Top: ATE as a function of the number of input frames; both achieve nearly identical pose error. Bottom: wall-clock token reduction time. Token merging grows superlinearly due to the bipartite similarity computation, while token pruning remains near zero throughout [PITH_… view at source ↗
read the original abstract

Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $\pi^3$, Depth-Anything-3, and VGGT-$\Omega$, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Spark3R, a training-free acceleration framework for feed-forward 3D reconstruction models based on Vision Transformers. It exploits an observed asymmetry in token roles: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and can tolerate aggressive compression. By applying distinct reduction strategies—intra-group merging for queries and pruning for KV, with layer-adaptive KV factors—it achieves up to 28× speedup on 1000-frame inputs across models like VGGT and π³ while maintaining competitive reconstruction quality.

Significance. If the identified property of differential compression sensitivity holds generally, Spark3R could enable scaling of feed-forward 3D reconstruction to long video sequences without retraining or significant quality loss, addressing a key bottleneck in quadratic attention costs. The plug-and-play integration and empirical speedups on multiple pretrained models represent a practical contribution to efficient 3D vision pipelines.

major comments (2)
  1. [§3.1–3.2] §3.1–3.2: The core claim that query tokens are distinctly more sensitive to compression than KV tokens (enabling asymmetric reduction without retraining) is presented as an empirical property of pretrained models, yet the manuscript reports no quantitative ablation measuring reconstruction metrics (e.g., Chamfer distance, normal consistency) when applying separate reduction ratios to queries versus KV tokens, nor any sensitivity curves or error bars across scenes. This leaves the stability of the sensitivity gap insufficiently supported for the claimed training-free deployment.
  2. [§4.3, Table 2] §4.3, Table 2: The reported 28× speedup on 1000-frame inputs is given without per-scene variance or comparison against uniform token-merging baselines at equivalent total token budgets; it is therefore unclear whether the quality-efficiency tradeoff is driven by the asymmetric design or simply by overall token count reduction.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrase 'competitive reconstruction quality' should be accompanied by the specific metrics and reference methods used for comparison.
  2. [§3] Notation in §3: Define the intra-group merging operation and the layer-adaptive KV factor selection heuristic more formally (e.g., via pseudocode or equations) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the empirical grounding of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [§3.1–3.2] §3.1–3.2: The core claim that query tokens are distinctly more sensitive to compression than KV tokens (enabling asymmetric reduction without retraining) is presented as an empirical property of pretrained models, yet the manuscript reports no quantitative ablation measuring reconstruction metrics (e.g., Chamfer distance, normal consistency) when applying separate reduction ratios to queries versus KV tokens, nor any sensitivity curves or error bars across scenes. This leaves the stability of the sensitivity gap insufficiently supported for the claimed training-free deployment.

    Authors: We agree that dedicated quantitative ablations would provide stronger support for the observed asymmetry. In the revised manuscript we add a new set of experiments in §3.2 that plot Chamfer distance and normal consistency as functions of independent query and KV reduction ratios. The curves include error bars computed over 10 diverse scenes and confirm that query tokens remain markedly more sensitive across the tested range, while KV tokens tolerate higher compression with minimal degradation. These results directly substantiate the stability of the sensitivity gap and the suitability of the training-free approach. revision: yes

  2. Referee: [§4.3, Table 2] §4.3, Table 2: The reported 28× speedup on 1000-frame inputs is given without per-scene variance or comparison against uniform token-merging baselines at equivalent total token budgets; it is therefore unclear whether the quality-efficiency tradeoff is driven by the asymmetric design or simply by overall token count reduction.

    Authors: We acknowledge the value of variance reporting and controlled baselines. We have updated Table 2 to include per-scene means and standard deviations for both quality metrics and runtime on the 1000-frame setting. In addition, we introduce a new comparison in §4.3 against uniform token merging performed at identical total token budgets; the results show that the asymmetric strategy consistently yields lower Chamfer distance than the uniform baseline at the same speedup, indicating that the quality-efficiency advantage is attributable to the differential treatment of queries and KV tokens rather than token count reduction alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper identifies an empirical distinction between query-token sensitivity and KV-token tolerance to compression as a guiding observation, then applies distinct reduction factors (intra-group merging on queries, pruning on KV, plus layer-adaptive KV factors) in a training-free manner. No equations or self-citations reduce the claimed speedup or quality retention to a fitted constant or input by construction; the performance numbers are measured outcomes on external pretrained models rather than tautological re-statements of the initial observation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report rests on the domain assumption that query and key-value tokens exhibit reliably different compression sensitivities in the cited feed-forward 3D models; no free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression.
    This property is stated as the guiding insight for the asymmetric reduction design.

pith-pipeline@v0.9.0 · 5815 in / 1265 out tokens · 55365 ms · 2026-05-20T23:19:00.393096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 10 internal anchors

  1. [1]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProc. of CVPR, 2016, pp. 4104–4113

  2. [2]

    Model reduction for real-time fluids , year =

    N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,”ACM Trans. Graph., vol. 25, no. 3, p. 835–846, Jul. 2006. [Online]. Available: https://doi.org/10.1145/1141911.1141964

  3. [3]

    Pixelwise view selection for unstructured multi-view stereo,

    J. L. Sch ¨onberger, E. Zheng, J.-M. Frahm, and M. Pollefeys, “Pixelwise view selection for unstructured multi-view stereo,” inProc. of ECCV. Springer, 2016, pp. 501–518

  4. [4]

    Mvsnet: Depth inference for unstructured multi-view stereo,

    Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan, “Mvsnet: Depth inference for unstructured multi-view stereo,” inProc. of ECCV, 2018, pp. 767– 783

  5. [5]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  6. [6]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Transactions on Graphics, vol. 42, no. 4, pp. 139–1, 2023

  7. [7]

    Scaffold- gs: Structured 3d gaussians for view-adaptive rendering,

    T. Lu, M. Yu, L. Xu, Y . Xiangli, L. Wang, D. Lin, and B. Dai, “Scaffold- gs: Structured 3d gaussians for view-adaptive rendering,” inProc. of CVPR, 2024, pp. 20 654–20 664

  8. [8]

    HiCoM: Hierarchical coherent motion for dynamic streamable scenes with 3D gaussian splatting,

    Q. Gao, J. Meng, C. Wen, J. Chen, and J. Zhang, “HiCoM: Hierarchical coherent motion for dynamic streamable scenes with 3D gaussian splatting,” inProc. of NeurIPS, 2024

  9. [9]

    Recon- gs: Continuum-preserved gaussian streaming for fast and compact re- construction of dynamic scenes,

    J. Fu, Q. Gao, C. Wen, Y . Wu, S. Ma, J. Zhang, and J. Zhang, “Recon- gs: Continuum-preserved gaussian streaming for fast and compact re- construction of dynamic scenes,” inProc. of NeurIPS, 2025

  10. [10]

    Virpnet: A multimodal virtual point generation network for 3d object detection,

    L. Wang, S. Sun, and J. Zhao, “Virpnet: A multimodal virtual point generation network for 3d object detection,”IEEE Transactions on Multimedia, vol. 26, pp. 10 597–10 609, 2024

  11. [11]

    Robo3R: Enhancing Robotic Manipulation with Accurate Feed-Forward 3D Reconstruction

    S. Yang, L. Xu, H. Li, J. Mu, J. Zeng, D. Lin, and J. Pang, “Robo3r: Enhancing robotic manipulation with accurate feed-forward 3d recon- struction,”arXiv preprint arXiv:2602.10101, 2026

  12. [12]

    Language-assisted 3d scene understanding,

    Y . Wu, Q. Gao, R. Zhang, H. Li, and J. Zhang, “Language-assisted 3d scene understanding,”IEEE Transactions on Multimedia, vol. 27, pp. 3869–3879, 2025

  13. [13]

    3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,

    H. Xiong, Y . Zhuge, J. Zhu, L. Zhang, and H. Lu, “3ur-llm: An end- to-end multimodal large language model for 3d scene understanding,” IEEE Transactions on Multimedia, vol. 27, pp. 2899–2911, 2025

  14. [14]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProc. of CVPR, 2025

  15. [15]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Scalable permutation-equivariant visual geometry learning,”arXiv preprint arXiv:2507.13347, 2025

  16. [16]

    Depth Anything 3: Recovering the Visual Space from Any Views

    H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang, “Depth anything 3: Recovering the visual space from any views,”arXiv preprint arXiv:2511.10647, 2025

  17. [17]

    J. Wang, M. Chen, S. Zhang, N. Karaev, J. Sch ¨onberger, P. Labatut, P. Bojanowski, D. Novotny, A. Vedaldi, and C. Rupprecht, “Vggt-Ω,” inProc. of CVPR, 2026

  18. [18]

    Continuous 3d perception model with persistent state,

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inProc. of CVPR, 2025, pp. 10 510–10 522

  19. [19]

    TTT3R: 3D Reconstruction as Test-Time Training

    X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d reconstruction as test-time training,”arXiv preprint arXiv:2509.26645, 2025

  20. [20]

    InfiniteVGGT: Visual geometry grounded transformer for endless streams

    S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang, “Infinitevggt: Visual geometry grounded transformer for endless streams,”arXiv preprint arXiv:2601.02281, 2026

  21. [21]

    VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

    K. Deng, Z. Ti, J. Xu, J. Yang, and J. Xie, “Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences,” arXiv preprint arXiv:2507.16443, 2025

  22. [22]

    Laser: Layer-wise scale alignment for training-free streaming 4d reconstruction,

    T. Ding, Y . Xie, Y . Liang, M. Chatterjee, P. Miraldo, and H. Jiang, “Laser: Layer-wise scale alignment for training-free streaming 4d reconstruction,” 2026. [Online]. Available: https://arxiv.org/abs/2512. 13680

  23. [23]

    FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

    Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry transformer,” arXiv preprint arXiv:2509.02560, 2025

  24. [24]

    Litevggt: Boosting vanilla vggt via geometry-aware cached token merging

    Z. Shu, C. Lin, T. Xie, W. Yin, B. Li, Z. Pu, W. Li, Y . Yao, X. Cao, X. Guo, and X.-X. Long, “Litevggt: Boosting vanilla vggt via geometry-aware cached token merging,” 2025. [Online]. Available: https://arxiv.org/abs/2512.04939

  25. [25]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProc. of CVPR, 2024, pp. 20 697– 20 709

  26. [26]

    Grounding image matching in 3d with mast3r,

    V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inProc. of ECCV. Springer, 2024, pp. 71–91

  27. [27]

    Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

    Y . Wu, W. Zheng, J. Zhou, and J. Lu, “Point3r: Streaming 3d reconstruction with explicit spatial pointer memory,” 2025. [Online]. Available: https://arxiv.org/abs/2507.02863

  28. [28]

    Streaming 4D Visual Geometry Transformer

    D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu, “Streaming 4d visual geometry transformer,” 2026. [Online]. Available: https://arxiv.org/abs/2507.11539

  29. [29]

    Vgg-t 3: Offline feed-forward 3d reconstruction at scale,

    S. Elflein, R. Li, S. Agostinho, Z. Gojcic, L. Leal-Taix ´e, Q. Zhou, and A. Osep, “Vgg-t 3: Offline feed-forward 3d reconstruction at scale,”

  30. [30]

    Available: https://arxiv.org/abs/2602.23361

    [Online]. Available: https://arxiv.org/abs/2602.23361

  31. [31]

    Zipmap: Linear-time stateful 3d reconstruction via test-time training,

    H. Jin, R. Wu, T. Zhang, R. Gao, J. T. Barron, N. Snavely, and A. Holynski, “Zipmap: Linear-time stateful 3d reconstruction via test-time training,” 2026. [Online]. Available: https://arxiv.org/abs/2603. 04385

  32. [32]

    Test-Time Training Done Right

    T. Zhang, S. Bi, Y . Hong, K. Zhang, F. Luan, S. Yang, K. Sunkavalli, W. T. Freeman, and H. Tan, “Test-time training done right,” 2025. [Online]. Available: https://arxiv.org/abs/2505.23884

  33. [33]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://arxiv.org/abs/1706.03762

  34. [34]

    FlashAttention: Fast and memory-efficient exact attention with IO-awareness,

    T. Dao, D. Y . Fu, S. Ermon, A. Rudra, and C. R ´e, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” inProc. of NeurIPS, 2022

  35. [35]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learning Research Journal, pp. 1–31, 2024

  36. [36]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProc. of CVPR, 2021, pp. 12 179–12 188

  37. [37]

    Token merging: Your ViT but faster,

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your ViT but faster,” inICLR, 2023

  38. [38]

    Token merging for fast stable diffusion,

    D. Bolya and J. Hoffman, “Token merging for fast stable diffusion,”

  39. [39]

    Bolya and J

    [Online]. Available: https://arxiv.org/abs/2303.17604

  40. [40]

    Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026

    A. Zweiger, X. Fu, H. Guo, and Y . Kim, “Fast kv compaction via attention matching,” 2026. [Online]. Available: https://arxiv.org/abs/ 2602.16284

  41. [41]

    Scene coordinate regression forests for camera relocalization in rgb-d images,

    J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgib- bon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” inProc. of CVPR, 2013, pp. 2930–2937. 12

  42. [42]

    Neural rgb-d surface reconstruction,

    D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies, “Neural rgb-d surface reconstruction,” inProc. of CVPR, 2022, pp. 6290–6301

  43. [43]

    A benchmark for the evaluation of rgb-d slam systems,

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” inProc. of IROS. IEEE, 2012, pp. 573–580

  44. [44]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” inProc. of CVPR, 2017, pp. 5828–5839

  45. [45]

    A naturalistic open source movie for optical flow evaluation,

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” inProc. of ECCV. Springer, 2012, pp. 611–625

  46. [46]

    Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,

    E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “Re- fusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals,” inProc. of IROS. IEEE, 2019, pp. 7855–7862

  47. [47]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013. Zecheng Tangis currently pursuing the M.S. degree in computer science and technology with Peking University Shenzhen Graduate School, Shen- zhen, China. His research interests incl...