Recognition: unknown
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
Pith reviewed 2026-05-15 16:15 UTC · model grok-4.3
The pith
ZipMap zips entire image collections into a compact hidden state in one pass to enable linear-time 3D reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than 20× faster than state-of-the-art methods such as VGGT, while matching or surpassing their accuracy in bidirectional 3D reconstruction.
What carries the argument
Test-time training layers that zip the image collection into a compact hidden scene state in a single forward pass, carrying the state for subsequent reconstruction steps.
Load-bearing premise
Test-time training layers can compress an arbitrary image collection into a compact hidden state that preserves reconstruction accuracy without any post-hoc tuning or scene-specific assumptions.
What would settle it
Running ZipMap on a large, diverse image collection and finding that the 3D reconstruction error is substantially higher than that of quadratic-time methods like VGGT.
Figures
read the original abstract
Feed-forward transformer models have driven rapid progress in 3D vision, but state-of-the-art methods such as VGGT and $\pi^3$ have a computational cost that scales quadratically with the number of input images, making them inefficient when applied to large image collections. Sequential-reconstruction approaches reduce this cost but sacrifice reconstruction quality. We introduce ZipMap, a stateful feed-forward model that achieves linear-time, bidirectional 3D reconstruction while matching or surpassing the accuracy of quadratic-time methods. ZipMap employs test-time training layers to zip an entire image collection into a compact hidden scene state in a single forward pass, enabling reconstruction of over 700 frames in under 10 seconds on a single H100 GPU, more than $20\times$ faster than state-of-the-art methods such as VGGT. Moreover, we demonstrate the benefits of having a stateful representation in real-time scene-state querying and its extension to sequential streaming reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ZipMap, a stateful feed-forward transformer model that uses test-time training layers to compress an arbitrary collection of images into a compact hidden scene state in a single forward pass. This enables linear-time bidirectional 3D reconstruction that matches or exceeds the accuracy of quadratic-cost methods such as VGGT, with reported performance of over 700 frames in under 10 seconds on a single H100 GPU (more than 20× faster), plus extensions to real-time scene-state querying and sequential streaming reconstruction.
Significance. If the central claims hold, the work would represent a meaningful advance in scalable 3D vision by removing the quadratic bottleneck of attention-based feed-forward reconstructors while preserving bidirectional accuracy. The stateful hidden representation could enable new streaming and interactive applications; the explicit credit for the architectural innovation in test-time training layers for compression is noted.
major comments (2)
- [Section 3] Section 3 (architecture): the description of the test-time training layers does not supply an explicit recurrence relation or compression operator whose per-image cost is strictly O(1) independent of collection size N while guaranteeing preservation of long-range geometric constraints; without this, the linear-time claim and the fixed-size hidden state as a lossless summary remain unverified.
- [Abstract] Abstract and experimental sections: the specific speed and accuracy claims (700 frames <10 s, parity with quadratic baselines) are presented without error bars, ablations on hidden-state capacity versus reconstruction error, or quantitative tables comparing to VGGT/π³ on standard metrics; these omissions make it impossible to assess whether the compact state truly supports the bidirectional guarantee.
minor comments (2)
- [Abstract] The abstract states 'matching or surpassing the accuracy' but does not name the evaluation metrics or the precise baselines used for the 20× speedup comparison.
- [Section 3] Notation for the hidden scene state size and its independence from N should be formalized with an equation or pseudocode in Section 3.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We agree that clarifying the architectural details and strengthening the experimental validation will improve the manuscript. We address each major comment below and will incorporate the suggested changes in the revised version.
read point-by-point responses
-
Referee: [Section 3] Section 3 (architecture): the description of the test-time training layers does not supply an explicit recurrence relation or compression operator whose per-image cost is strictly O(1) independent of collection size N while guaranteeing preservation of long-range geometric constraints; without this, the linear-time claim and the fixed-size hidden state as a lossless summary remain unverified.
Authors: We thank the referee for this observation. In the revised manuscript we will augment Section 3 with an explicit recurrence relation for the test-time training layers: the hidden state is updated as h_t = Compress(h_{t-1}, x_t; θ), where Compress is realized by a fixed-size MLP-based operator whose per-image cost is strictly O(1) with respect to total collection size N. Because the state dimension is constant, the overall complexity remains linear in N. Long-range geometric constraints are preserved by the bidirectional querying mechanism that reads from the same compact state; we will add a short proof sketch showing that the operator maintains the necessary cross-view consistency invariants. These additions will make the linear-time and stateful claims fully verifiable. revision: yes
-
Referee: [Abstract] Abstract and experimental sections: the specific speed and accuracy claims (700 frames <10 s, parity with quadratic baselines) are presented without error bars, ablations on hidden-state capacity versus reconstruction error, or quantitative tables comparing to VGGT/π³ on standard metrics; these omissions make it impossible to assess whether the compact state truly supports the bidirectional guarantee.
Authors: We agree that the current experimental presentation is insufficient for rigorous assessment. In the revised version we will (i) report all timing and accuracy numbers with error bars computed over multiple runs, (ii) add an ablation study that varies hidden-state capacity and plots the resulting reconstruction error, and (iii) include a full quantitative comparison table against VGGT and π³ using standard metrics (PSNR, SSIM, camera-pose error, etc.). These results will be placed in the experimental section and will directly substantiate that the compact state supports bidirectional reconstruction at the claimed accuracy. revision: yes
Circularity Check
No circularity: architectural claim is independent of inputs
full rationale
The paper introduces ZipMap as a feed-forward architecture that uses test-time training layers to produce a fixed-size hidden scene state from an arbitrary image collection. No equations, fitted parameters, or self-citations are shown that would make the linear-time bidirectional reconstruction equivalent to the input by construction. The compression into a compact state and the resulting O(N) scaling are presented as consequences of the chosen layer design rather than a renaming or re-derivation of the quadratic baselines. Empirical timing claims (700 frames in <10s) are external performance assertions, not tautological outputs of the method definition itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Test-time training layers can compress an arbitrary collection of images into a compact hidden state that supports accurate bidirectional reconstruction.
invented entities (1)
-
Compact hidden scene state
no independent evidence
Forward citations
Cited by 4 Pith papers
-
TurboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention
TurboVGGT uses adaptive sparse global attention with varying sparsity levels across frames and layers plus frame attention to enable faster multi-view 3D reconstruction while keeping competitive quality versus prior s...
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Reference graph
Works this paper leans on
-
[1]
Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, and Richard Szeliski. Building Rome in a Day.ICCV, 2009. 2
work page 2009
-
[2]
Map-free Vi- sual Relocalization: Metric Pose Relative to a Single Image
Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Dani- yar Turmukhambetov, and Eric Brachmann. Map-free Vi- sual Relocalization: Metric Pose Relative to a Single Image. ECCV, 2022. 14
work page 2022
-
[3]
Neural RGB-D Surface Reconstruction.CVPR, 2022
Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural RGB-D Surface Reconstruction.CVPR, 2022. 7
work page 2022
-
[4]
Barron, Ben Mildenhall, Dor Verbin, Pratul P
Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields.CVPR, 2022. 18
work page 2022
-
[5]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. ARKitScenes: A Diverse Real- World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data.arXiv:2111.08897, 2021. 14
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Atlas: Learning to optimally memorize the context at test time, 2025
Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mir- rokni. Atlas: Learning to optimally memorize the context at test time, 2025. 2
work page 2025
-
[7]
Titans: Learning to memorize at test time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2
work page 2025
-
[8]
Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang
Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion.CVPR, 2023. 14
work page 2023
-
[9]
TransformerFusion: Monocular RGB Scene Reconstruction using Transformers.NeurIPS, 2021
Aljaˇz Boˇziˇc, Pablo Palafox, Justus Thies, Angela Dai, and Matthias Nießner. TransformerFusion: Monocular RGB Scene Reconstruction using Transformers.NeurIPS, 2021. 7, 15, 17
work page 2021
-
[10]
A Naturalistic Open Source Movie for Optical Flow Evaluation.ECCV, 2012
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A Naturalistic Open Source Movie for Optical Flow Evaluation.ECCV, 2012. 5, 7
work page 2012
-
[11]
Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual KITTI 2.arXiv:2001.10773, 2020. 14
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[12]
Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal- ber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.3DV, 2017. 14
work page 2017
-
[13]
TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025. 1, 2, 5, 7, 13, 14, 15, 17
-
[14]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly- annotated 3D Reconstructions of Indoor Scenes.CVPR, 2017. 5, 7, 8, 13, 14
work page 2017
-
[15]
FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning.ICLR, 2024
Tri Dao. FlashAttention-2: Faster Attention with Better Par- allelism and Work Partitioning.ICLR, 2024. 13
work page 2024
-
[16]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt- long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. 18
work page 2025
-
[17]
Mid-air: A multi-modal dataset for extremely low altitude drone flights
Michael Fonder and Marc Van Droogenbroeck. Mid-air: A multi-modal dataset for extremely low altitude drone flights. CVPR-W, 2019. 14
work page 2019
-
[18]
Building Rome on a Cloudless Day.ECCV, 2010
Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, and Marc Pollefeys. Building Rome on a Cloudless Day.ECCV, 2010. 2
work page 2010
-
[19]
Towards internet-scale multi-view stereo
Yasutaka Furukawa, Brian Curless, Steven M Seitz, and Richard Szeliski. Towards internet-scale multi-view stereo. CVPR, 2010. 2
work page 2010
-
[20]
Vision meets Robotics: The KITTI Dataset.IJRR,
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets Robotics: The KITTI Dataset.IJRR,
-
[21]
Kubric: A Scalable Dataset Generator.CVPR, 2022
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A Scalable Dataset Generator.CVPR, 2022. 14
work page 2022
-
[22]
Mamba: Linear-time sequence mod- eling with selective state spaces.COLM, 2024
Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces.COLM, 2024. 2
work page 2024
-
[23]
Using Fast Weights to Deblur Old Memories.Cognitive Science Society, 1987
Geoffrey E Hinton and David C Plaut. Using Fast Weights to Deblur Old Memories.Cognitive Science Society, 1987. 2
work page 1987
-
[24]
Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V . Le. Transformer Quality in Linear Time, 2022. 4
work page 2022
-
[25]
DeepMVS: Learning Multi-View Stereopsis.CVPR, 2018
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. DeepMVS: Learning Multi-View Stereopsis.CVPR, 2018. 14
work page 2018
-
[26]
Dahl, George V ogiatzis, Engil Tola, and Henrik Aanæs
Rasmus Ramsbøl Jensen, A. Dahl, George V ogiatzis, Engil Tola, and Henrik Aanæs. Large Scale Multi-view Stereopsis Evaluation.CVPR, 2014. 7, 16, 17
work page 2014
-
[27]
RayZer: A Self-supervised Large View Synthe- sis Model.ICCV, 2025
Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, et al. RayZer: A Self-supervised Large View Synthe- sis Model.ICCV, 2025. 4
work page 2025
-
[28]
Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 18
work page 2025
-
[29]
Lvsm: A large view synthesis model with minimal 3d inductive bias
Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. InThe Thirteenth International Conference on Learning Representations, 2025. 4
work page 2025
-
[30]
Muon: An Optimizer for Hidden Layers in Neural Networks, 2024
Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An Optimizer for Hidden Layers in Neural Networks, 2024. 4
work page 2024
-
[31]
Dy- namicStereo: Consistent Dynamic Depth from Stereo Videos
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicStereo: Consistent Dynamic Depth from Stereo Videos. CVPR, 2023. 14
work page 2023
-
[32]
Transformers are RNNs: Fast Autoregres- sive Transformers with Linear Attention .ICML, 2020
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are RNNs: Fast Autoregres- sive Transformers with Linear Attention .ICML, 2020. 2
work page 2020
-
[33]
MapAnything: Universal feed- forward metric 3D reconstruction
Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bul`o, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstructio...
work page 2026
-
[34]
Ground- ing Image Matching in 3D with MASt3R, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3D with MASt3R, 2024. 2, 15
work page 2024
-
[35]
MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond. ICCV, 2023. 14
work page 2023
-
[36]
MegaDepth: Learning Single- View Depth Prediction from Internet Photos.CVPR, 2018
Zhengqi Li and Noah Snavely. MegaDepth: Learning Single- View Depth Prediction from Internet Photos.CVPR, 2018. 14, 15
work page 2018
-
[37]
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision.CVPR, 2024
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision.CVPR, 2024. 7, 8, 13, 14
work page 2024
-
[38]
John McCormac, Ankur Handa, Stefan Leutenegger, and An- drew J Davison. SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? ICCV, 2017. 14
work page 2017
-
[39]
Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andr´es Bruhn. Spring: A High-Resolution High- Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo.CVPR, 2023. 14
work page 2023
-
[40]
Maxime Oquab, Timoth´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Lab...
- [41]
-
[42]
Global Structure-from-Motion Revisited
Linfei Pan, D´aniel Bar´ath, Marc Pollefeys, and Johannes L Sch¨onberger. Global Structure-from-Motion Revisited. ECCV, 2024. 2
work page 2024
-
[43]
Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception
Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard New- combe, and Yuheng Carl Ren. Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception. ICCV, 2023. 14
work page 2023
-
[44]
Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- Ground: A Large-Scale Dataset for Ground Robot Perception and Navigation.arXiv:2505.10696, 2025. 14
-
[45]
RWKV: Reinventing RNNs for the Transformer Era
Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, et al. RWKV: Reinventing RNNs for the Transformer Era.arXiv:2305.13048, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free. arXiv:2505.06708, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Vi- sion Transformers for Dense Prediction.ICCV, 2021
Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion Transformers for Dense Prediction.ICCV, 2021. 4
work page 2021
-
[48]
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction.ICCV, 2021. 5, 14, 17
work page 2021
-
[49]
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding.ICCV, 2021
Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Ku- mar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding.ICCV, 2021. 14
work page 2021
-
[50]
Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkor- eit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations.CVPR,
-
[51]
Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot, Daniel Duckworth, Mario Luˇci´c, and Klaus Greff. RUST: Latent Neural Scene Representations from Unposed Imagery.CVPR, 2023. 4
work page 2023
-
[52]
Linear Transformers Are Secretly Fast Weight Programmers.ICML,
Imanol Schlag, Kazuki Irie, and J¨urgen Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers.ICML,
-
[53]
J¨urgen Schmidhuber. Learning to Control Fast-Weight Memo- ries: An Alternative to Dynamic Recurrent Networks.Neural Computation, 1992. 2
work page 1992
-
[54]
Schonberger and Jan-Michael Frahm
Johannes L. Schonberger and Jan-Michael Frahm. Structure- From-Motion Revisited.CVPR, 2016. 2
work page 2016
-
[55]
Thomas Sch¨ops, Johannes L. Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A Multi-View Stereo Benchmark with High- Resolution Images and Multi-Camera Videos.CVPR, 2017. 7, 8, 16, 17
work page 2017
-
[56]
GLU Variants Improve Transformer
Noam Shazeer. GLU Variants Improve Transformer. arXiv:2002.05202, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[57]
You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry trans- former.arXiv preprint arXiv:2509.02560, 2025. 2
-
[58]
Scene Coordinate Regression Forests for Camera Relocaliza- tion in RGB-D Images.CVPR, 2013
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew William Fitzgibbon. Scene Coordinate Regression Forests for Camera Relocaliza- tion in RGB-D Images.CVPR, 2013. 7, 13, 17
work page 2013
-
[59]
Indoor Segmentation and Support Inference from RGBD Images.ECCV, 2012
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor Segmentation and Support Inference from RGBD Images.ECCV, 2012. 7, 15
work page 2012
-
[60]
Skeletal graphs for efficient structure from motion.CVPR, 2008
Noah Snavely, Steven M Seitz, and Richard Szeliski. Skeletal graphs for efficient structure from motion.CVPR, 2008. 2
work page 2008
-
[61]
A benchmark for the evaluation of RGB-D SLAM systems.IROS, 2012
J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems.IROS, 2012. 5, 7
work page 2012
-
[62]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[63]
Scalability in perception for autonomous driving: Waymo open dataset.CVPR, 2020
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset.CVPR, 2020. 14
work page 2020
-
[64]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States.ICML, 2025
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (Learn at Test Time): RNNs with Expressive Hidden States.ICML, 2025. 1, 2
work page 2025
-
[65]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- eit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,
-
[66]
Chung-Shien Brian Wang, Christian Schmidt, Jens Pieken- brinck, and Bastian Leibe. Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers.arXiv preprint arXiv:2509.07120, 2025. 2
-
[67]
3D Reconstruction with Spatial Memory.arXiv:2408.16061, 2024
Hengyi Wang and Lourdes Agapito. 3D Reconstruction with Spatial Memory.arXiv:2408.16061, 2024. 2
-
[68]
VGGT: Visual Geometry Grounded Transformer.CVPR, 2025
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer.CVPR, 2025. 1, 2, 3, 4, 5, 7, 13, 14, 15
work page 2025
-
[69]
Kaixuan Wang and Shaojie Shen. Flow-Motion and Depth Network for Monocular Stereo and Beyond.IEEE Robotics and Automation Letters, 2020. 14
work page 2020
-
[70]
Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. Test-time regression: a unifying framework for designing sequence models with associative memory.arXiv:2501.12352, 2025. 4
-
[71]
Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State.CVPR, 2025. 1, 2, 4, 5, 7, 13, 14, 15, 17
work page 2025
-
[72]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision.CVPR, 2025. 5, 15
work page 2025
-
[73]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details, 2025
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details, 2025. 15
work page 2025
-
[74]
DUSt3R: Geometric 3d vision made easy.CVPR, 2024
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3d vision made easy.CVPR, 2024. 1, 2
work page 2024
-
[75]
TartanAir: A Dataset to Push the Limits of Visual SLAM.IROS, 2020
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A Dataset to Push the Limits of Visual SLAM.IROS, 2020. 14
work page 2020
-
[76]
π3: Scalable Permutation-Equivariant Visual Geometry Learning, 2025
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable Permutation-Equivariant Visual Geometry Learning, 2025. 2, 3, 4, 5, 7, 8, 13, 14, 15, 16, 17
work page 2025
-
[77]
Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Liang Pan Jiawei Ren, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, and Ziwei Liu. OmniObject3D: Large-V ocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation.CVPR, 2023. 14
work page 2023
-
[78]
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory.arXiv:2507.02863, 2025
Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory.arXiv:2507.02863, 2025. 1, 2
-
[79]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass.CVPR, 2025. 2, 3, 5, 7, 15
work page 2025
-
[80]
Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length.arXiv:2406.06484, 2024. 2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.