pith. sign in

arxiv: 2605.23889 · v1 · pith:SZUGHSIVnew · submitted 2026-05-22 · 💻 cs.CV

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming 3D reconstructionlong-horizon attentiongeometric propagationcausal transformeronline reconstructionevidence influence kernelspatiotemporal attention
0
0 comments X

The pith

HorizonStream factorizes geometric evidence influence to enable stable streaming 3D reconstruction over sequences exceeding 10,000 frames with constant memory and linear time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix drift, jitter, and collapse that occur when existing methods try to reconstruct 3D scenes from continuous video streams under causal and memory limits. It identifies the root cause as uniform influence patterns that cannot handle the mix of short-lived and persistent geometric evidence. HorizonStream addresses this by defining geometric propagation as an evidence influence kernel and splitting that kernel into a long-range temporal part and a short-range spatial part. The resulting architecture trains on short 48-frame clips yet runs stably on sequences over 10,000 frames while using fixed memory and linear compute per frame. If the approach holds, streaming 3D reconstruction could become reliable for extended real-world operation without restarts or unbounded resources.

Core claim

HorizonStream formalizes geometric propagation as an evidence influence kernel and explicitly factorizes it into independent components. The long-range temporal factor uses Geometric Linear Attention that learns channel-wise decay rates for bounded, multi-timescale propagation of evidence. The short-range spatial factor employs Geometric Local Attention with Spatiotemporal RoPE to perform reliable 3D matching while avoiding attention sinks. Metric Readout Tokens then extract stable scale and rigid pose from the accumulated geometric state. This design yields state-of-the-art performance on long sequences after training only on short clips.

What carries the argument

The evidence influence kernel, factorized into Geometric Linear Attention for long-range temporal propagation with channel-wise decay rates and Geometric Local Attention with Spatiotemporal RoPE for short-range spatial matching, together with Metric Readout Tokens for scale and pose recovery.

If this is right

  • Streaming 3D reconstruction scales to arbitrary sequence lengths without drift or collapse.
  • Models trained on short clips generalize directly to long real-world streams.
  • Memory remains constant and per-frame compute scales linearly rather than quadratically.
  • Causal, bounded-memory constraints no longer force accuracy trade-offs on extended inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same kernel-factorization pattern could stabilize other long-horizon causal vision tasks such as online SLAM or video depth estimation.
  • Constant-memory operation would make continuous 3D mapping practical on resource-limited platforms like mobile robots.
  • Testing the factorization on non-rigid or highly dynamic scenes would reveal whether the temporal-spatial split remains sufficient.

Load-bearing premise

Geometric propagation can be usefully formalized as an evidence influence kernel that factorizes cleanly into independent long-range temporal and short-range spatial components without loss of critical 3D consistency information.

What would settle it

Measure reconstruction error and memory usage on a held-out 15,000-frame sequence; the claim fails if error grows with length or memory exceeds the constant bound observed on shorter training clips.

Figures

Figures reproduced from arXiv: 2605.23889 by Chong Cheng, Guanzhi Ding, Hao Wang, Nanjie Yao, Peilin Tao, Qian Zhang, Weiqiang Ren, Wei Yin, Xianda Chen, Xiaoyang Guo, Yuansen Du, Zhengqing Chen.

Figure 1
Figure 1. Figure 1: Geometric evidence influence patterns on KITTI and long-sequence scaling on VBR. Prior [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of long-range streaming 3D reconstruction across diverse scenes. Our [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of HorizonStream. Given an RGB stream, the model causally processes the most recent W frames. Geometric Local Attention handles local matching, Geometric Linear Attention propagates long-range geometry with an O(1) recurrent geometric state, and Metric Readout Tokens recover stable scale and pose. An optional loop-closure module refines the trajectory. To systematically address these requirements,… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on long-sequence 3D reconstruction. As sequence length grows, existing methods show pose degradation, drift, or collapse. Lingbot-map exhibits progressively stronger pose jitter over longer rollouts, while HorizonStream maintains stable pose estimation. Metric Readout Tokens (MRT) and relative pose fusion. Long streaming reconstruction requires metric scale and pose to remain consist… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on 3D reconstruction. Left: trajectory. Right: 3D reconstruction. HorizonStream maintains stable geometry. Lingbot-map preserves trajectory direction but exhibits increasing jitter, causing point cloud overlap. driving, large-scale reconstruction, and synthetic environments, including ScanNet++ [53], Hyper￾sim [29], Replica [34], 7Scenes [33], ARKitScenes [1], WildRGB-D [49], Waymo [… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Learned retention spectra in Geometric Linear Attention. Effective lifetimes τ = −1/ log ¯γ vary across channels and layers. Layer 4 exhibits broad mid-range retention, while Layer 17 develops a sharper long-retention tail, supporting channel-wise multi-timescale propagation. (b) Retention-band ablation. Replacing the learned channel-wise retention spectrum with fixed short-, medium-, or long-horizon b… view at source ↗
Figure 7
Figure 7. Figure 7: Memory and runtime scaling. HorizonStream keeps peak memory nearly constant and [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training convergence under different attention mechanisms for cross-window propagation. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Effect of loop closure on long sequences. Loop closure reduces ATE on sequences with [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Failure cases on ultra-long sequences. Ground-truth trajectory (red), online prediction [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Linear probing of frozen Geometric Linear [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes HorizonStream, a Transformer for online 3D reconstruction that formalizes geometric propagation via an evidence influence kernel explicitly factorized into (1) Geometric Linear Attention with learned per-channel decay rates for long-range multi-timescale temporal propagation, (2) Geometric Local Attention augmented by Spatiotemporal RoPE for short-range 3D matching, and (3) Metric Readout Tokens that recover scale and rigid pose from the persistent state. The central claim is that a model trained only on 48-frame clips generalizes stably to sequences exceeding 10,000 frames while maintaining constant memory and linear time complexity, outperforming prior streaming methods.

Significance. If the generalization and factorization claims hold with supporting experiments, the work would be significant for causal, bounded-memory 3D reconstruction on long sequences. The explicit multi-timescale decay mechanism and separation of temporal and spatial factors address a recognized mismatch between uniform attention patterns and heterogeneous geometric evidence lifetimes. The project page link indicates potential for reproducibility checks.

major comments (2)
  1. [Abstract] Abstract: the claim that training on 48-frame clips yields stable behavior on >10,000-frame sequences with constant memory is presented without any quantitative results, ablation tables, error curves, or drift metrics. This absence makes the central generalization claim impossible to evaluate.
  2. [Abstract] Abstract: the factorization of the evidence influence kernel into independent channel-wise temporal decay and Spatiotemporal-RoPE spatial components is asserted to preserve 3D consistency, yet no derivation, invariance argument, or counter-example analysis is supplied showing that cross-scale relations (e.g., persistent scale coupled to local correspondences) survive the product decomposition. This factorization is load-bearing for the long-horizon extrapolation claim.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'pathological influence patterns' is used without a precise definition or citation to the specific failure modes in sliding-window or causal-attention baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback. We address the two major comments on the abstract below, providing clarifications from the full manuscript while noting opportunities for revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that training on 48-frame clips yields stable behavior on >10,000-frame sequences with constant memory is presented without any quantitative results, ablation tables, error curves, or drift metrics. This absence makes the central generalization claim impossible to evaluate.

    Authors: We agree the abstract is a high-level summary and omits specific numbers. The full manuscript (Sections 4–5) includes quantitative results: tables reporting reconstruction metrics (e.g., absolute trajectory error and depth accuracy) on sequences exceeding 10,000 frames, ablation studies on training clip length, and plots of cumulative drift versus sequence length demonstrating stable generalization with O(1) memory. We will revise the abstract to incorporate one or two key quantitative highlights (e.g., “maintaining <X cm ATE on 10k-frame sequences”) to improve evaluability while preserving conciseness. revision: yes

  2. Referee: [Abstract] Abstract: the factorization of the evidence influence kernel into independent channel-wise temporal decay and Spatiotemporal-RoPE spatial components is asserted to preserve 3D consistency, yet no derivation, invariance argument, or counter-example analysis is supplied showing that cross-scale relations (e.g., persistent scale coupled to local correspondences) survive the product decomposition. This factorization is load-bearing for the long-horizon extrapolation claim.

    Authors: The abstract summarizes the approach; the full manuscript (Section 3) formally defines the evidence influence kernel K(t,s) and derives its factorization into the product of a channel-wise linear temporal term (with per-channel decay) and a local spatiotemporal term (with RoPE). The derivation shows that the product preserves the required separation of timescales while the Metric Readout Tokens recover coupled scale-pose quantities from the persistent state. A short invariance sketch appears in the supplementary material. We can add a one-sentence reference to this derivation in the abstract if desired, but the core mathematical argument is already in the body. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on architectural design and experiments, not self-definition or fitted inputs

full rationale

The abstract presents the evidence influence kernel factorization and HorizonStream architecture as a proposed solution to identified failure modes (drift, cache saturation), with the long-horizon generalization claim supported by training on 48-frame clips and reported experimental outcomes on longer sequences. No equations, self-referential definitions, or self-citations appear that would reduce the claimed performance or factorization to an input by construction. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing prior results from the same authors.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 3 invented entities

Review is based solely on the abstract; full details on parameters, assumptions, and entities are unavailable. The ledger reflects only what is explicitly named in the abstract.

free parameters (1)
  • channel-wise decay rates
    Learned rates in Geometric Linear Attention for multi-timescale propagation.
axioms (2)
  • domain assumption Streaming geometry is temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale.
    This is presented as the root cause of failures in existing methods.
  • ad hoc to paper Geometric propagation can be formalized as an evidence influence kernel that can be explicitly factorized into long-range temporal and short-range spatial factors.
    This formalization is the stated foundation for the HorizonStream design.
invented entities (3)
  • Geometric Linear Attention no independent evidence
    purpose: Enable bounded multi-timescale propagation of geometric evidence
    New attention mechanism introduced for the long-range factor.
  • Geometric Local Attention with Spatiotemporal RoPE no independent evidence
    purpose: Perform reliable 3D matching while suppressing attention sinks
    New attention variant for the short-range factor.
  • Metric Readout Tokens no independent evidence
    purpose: Recover stable scale and rigid pose from persistent geometric state
    Introduced component for metric readout.

pith-pipeline@v0.9.0 · 5806 in / 1508 out tokens · 49635 ms · 2026-05-25T04:31:40.693278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We formalize geometric propagation as an evidence influence kernel... K(t,i)=Kspatial(t,i)·Ktime(t,i). ... channel-wise retention vector γt=σ(Wγxt+bγ)∈(0,1)d, St=diag(γt)St−1+ϕ(kt)ṽ⊤t. ... τ(c)=−1/logγ̄(c)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 28 canonical work pages · 17 internal anchors

  1. [1]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

  2. [2]

    VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024

    Leonardo Brizi, Emanuele Giacomini, Luca Di Giammarino, Simone Ferrari, Omar Salem, Lorenzo De Rebotti, and Giorgio Grisetti. VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024

  3. [3]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773, 2020

  4. [4]

    Gomez Rodriguez, Jose M

    Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM.IEEE Transactions on Robotics, 37(6), 2021

  5. [5]

    Geometric Context Transformer for Streaming 3D Reconstruction

    Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026

  6. [6]

    TTT3R: 3D Reconstruction as Test-Time Training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D recon- struction as test-time training.arXiv preprint arXiv:2509.26645, 2025

  7. [7]

    Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025

    Chong Cheng, Yu Hu, Sicheng Yu, Beizhen Zhao, Zijian Wang, and Hao Wang. Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025. URL https://arxiv.org/ab s/2507.08136

  8. [8]

    Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025

    Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2502.17377

  9. [9]

    Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025

    Chong Cheng, Zijian Wang, Sicheng Yu, Yu Hu, Nanjie Yao, and Hao Wang. Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025. URL https://arxiv.org/abs/ 2507.18541

  10. [10]

    Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025

    Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, and Hao Wang. Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025. URL https://arxiv.org/abs/ 2507.03737

  11. [11]

    LongStream: Long-sequence streaming autoregressive visual geometry

    Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026

  12. [12]

    VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. URL https://arxiv.or g/abs/2507.16443. 10

  13. [13]

    Are we ready for autonomous driving? The KITTI Vision Benchmark Suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. InCVPR, 2012

  14. [14]

    When attention sink emerges in language models: An empirical view

    Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InICLR, 2025

  15. [15]

    arXiv preprint arXiv:2511.19971 (2025)

    Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction, 2025. URL https://arxiv. org/abs/2511.19971

  16. [16]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InICML, 2020

  17. [17]

    Mapanything: Universal feed-forward metric 3d reconstruction,

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruction,

  18. [18]

    URLhttps://arxiv.org/abs/2509.13414

  19. [19]

    Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

    Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URL https://arxiv.org/abs/2508.10893

  20. [20]

    Grounding image matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. InECCV, 2024

  21. [21]

    MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. InICCV, 2023

  22. [22]

    MegaDepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

  23. [23]

    DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR, 2024

  24. [24]

    Test-Time Training with KV Binding Is Secretly Linear Attention

    Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, and Ruilong Li. Test-time training with kv binding is secretly linear attention, 2026. URLhttps://arxiv.org/abs/2602.21204

  25. [25]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. VGGT-SLAM: Dense RGB SLAM optimized on the SL(4) manifold.arXiv preprint arXiv:2505.12549, 2025

  26. [26]

    Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InCVPR, 2025

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  28. [28]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

  29. [29]

    Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. InICCV, 2021

  30. [30]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021. 11

  31. [31]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016

  32. [32]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

  33. [33]

    FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

    You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free accelera- tion of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

  34. [34]

    Scene coordinate regression forests for camera relocalization in RGB-D images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013

  35. [35]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The Replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

  36. [36]

    A benchmark for the evaluation of RGB-D SLAM systems

    Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012

  37. [37]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

  38. [38]

    Scalability in perception for autonomous driving: Waymo Open Dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo Open Dataset. InCVPR, 2020

  39. [39]

    The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025

    Yifu Tao, Miguel Ángel Muñoz-Bañón, Lintong Zhang, Jiahao Wang, Lanke Frank Tarimo Fu, and Maurice Fallon. The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025

  40. [40]

    DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

    Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeurIPS, 2021

  41. [41]

    Deep patch visual odometry

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. InNeurIPS, 2023

  42. [42]

    3D reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025

  43. [43]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

  44. [44]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

  45. [45]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024

  46. [46]

    TartanAir: A dataset to push the limits of visual SLAM

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020

  47. [47]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning, 2025. URLhttps://arxiv.org/abs/2507.13347

  48. [48]

    Mapillary street-level sequences: A dataset for lifelong place recognition

    Frederik Warburg, Søren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In CVPR, 2020

  49. [49]

    Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

    Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 12

  50. [50]

    RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos

    Hongchi Xia, Yang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos. InCVPR, 2024

  51. [51]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

  52. [52]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025. URLhttps://arxiv.org/abs/2412.06464

  53. [53]

    BlendedMVS: A large-scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020

  54. [54]

    ScanNet++: A high-fidelity dataset of 3D indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023

  55. [55]

    Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025

    Sicheng Yu, Chong Cheng, Yifan Zhou, Xiaojun Yang, and Hao Wang. Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025. URL https://arxiv.org/abs/2502.1 5633

  56. [56]

    InfiniteVGGT: Visual geometry grounded transformer for endless streams

    Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026

  57. [57]

    MonST3R: A simple approach for estimating geometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. InICLR, 2025

  58. [58]

    LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory, 2026. URLhttps://arxiv.org/abs/2603.03269

  59. [59]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

  60. [60]

    Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J

    Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. PointOdyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023

  61. [61]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. URL https://arxiv. org/abs/2509.12201

  62. [62]

    Streaming 4D Visual Geometry Transformer

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539. 13 A Geometric Attention Dilution We formalize why causal softmax attention cannot serve as long-range cross-window memory in streaming 3D reconstruction. Let Ωt denote the set of 3D points visible at t...