HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3
The pith
HorizonStream factorizes geometric evidence influence to enable stable streaming 3D reconstruction over sequences exceeding 10,000 frames with constant memory and linear time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HorizonStream formalizes geometric propagation as an evidence influence kernel and explicitly factorizes it into independent components. The long-range temporal factor uses Geometric Linear Attention that learns channel-wise decay rates for bounded, multi-timescale propagation of evidence. The short-range spatial factor employs Geometric Local Attention with Spatiotemporal RoPE to perform reliable 3D matching while avoiding attention sinks. Metric Readout Tokens then extract stable scale and rigid pose from the accumulated geometric state. This design yields state-of-the-art performance on long sequences after training only on short clips.
What carries the argument
The evidence influence kernel, factorized into Geometric Linear Attention for long-range temporal propagation with channel-wise decay rates and Geometric Local Attention with Spatiotemporal RoPE for short-range spatial matching, together with Metric Readout Tokens for scale and pose recovery.
If this is right
- Streaming 3D reconstruction scales to arbitrary sequence lengths without drift or collapse.
- Models trained on short clips generalize directly to long real-world streams.
- Memory remains constant and per-frame compute scales linearly rather than quadratically.
- Causal, bounded-memory constraints no longer force accuracy trade-offs on extended inputs.
Where Pith is reading between the lines
- The same kernel-factorization pattern could stabilize other long-horizon causal vision tasks such as online SLAM or video depth estimation.
- Constant-memory operation would make continuous 3D mapping practical on resource-limited platforms like mobile robots.
- Testing the factorization on non-rigid or highly dynamic scenes would reveal whether the temporal-spatial split remains sufficient.
Load-bearing premise
Geometric propagation can be usefully formalized as an evidence influence kernel that factorizes cleanly into independent long-range temporal and short-range spatial components without loss of critical 3D consistency information.
What would settle it
Measure reconstruction error and memory usage on a held-out 15,000-frame sequence; the claim fails if error grows with length or memory exceeds the constant bound observed on shorter training clips.
Figures
read the original abstract
Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes HorizonStream, a Transformer for online 3D reconstruction that formalizes geometric propagation via an evidence influence kernel explicitly factorized into (1) Geometric Linear Attention with learned per-channel decay rates for long-range multi-timescale temporal propagation, (2) Geometric Local Attention augmented by Spatiotemporal RoPE for short-range 3D matching, and (3) Metric Readout Tokens that recover scale and rigid pose from the persistent state. The central claim is that a model trained only on 48-frame clips generalizes stably to sequences exceeding 10,000 frames while maintaining constant memory and linear time complexity, outperforming prior streaming methods.
Significance. If the generalization and factorization claims hold with supporting experiments, the work would be significant for causal, bounded-memory 3D reconstruction on long sequences. The explicit multi-timescale decay mechanism and separation of temporal and spatial factors address a recognized mismatch between uniform attention patterns and heterogeneous geometric evidence lifetimes. The project page link indicates potential for reproducibility checks.
major comments (2)
- [Abstract] Abstract: the claim that training on 48-frame clips yields stable behavior on >10,000-frame sequences with constant memory is presented without any quantitative results, ablation tables, error curves, or drift metrics. This absence makes the central generalization claim impossible to evaluate.
- [Abstract] Abstract: the factorization of the evidence influence kernel into independent channel-wise temporal decay and Spatiotemporal-RoPE spatial components is asserted to preserve 3D consistency, yet no derivation, invariance argument, or counter-example analysis is supplied showing that cross-scale relations (e.g., persistent scale coupled to local correspondences) survive the product decomposition. This factorization is load-bearing for the long-horizon extrapolation claim.
minor comments (1)
- [Abstract] Abstract: the phrase 'pathological influence patterns' is used without a precise definition or citation to the specific failure modes in sliding-window or causal-attention baselines.
Simulated Author's Rebuttal
We thank the referee for their detailed feedback. We address the two major comments on the abstract below, providing clarifications from the full manuscript while noting opportunities for revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that training on 48-frame clips yields stable behavior on >10,000-frame sequences with constant memory is presented without any quantitative results, ablation tables, error curves, or drift metrics. This absence makes the central generalization claim impossible to evaluate.
Authors: We agree the abstract is a high-level summary and omits specific numbers. The full manuscript (Sections 4–5) includes quantitative results: tables reporting reconstruction metrics (e.g., absolute trajectory error and depth accuracy) on sequences exceeding 10,000 frames, ablation studies on training clip length, and plots of cumulative drift versus sequence length demonstrating stable generalization with O(1) memory. We will revise the abstract to incorporate one or two key quantitative highlights (e.g., “maintaining <X cm ATE on 10k-frame sequences”) to improve evaluability while preserving conciseness. revision: yes
-
Referee: [Abstract] Abstract: the factorization of the evidence influence kernel into independent channel-wise temporal decay and Spatiotemporal-RoPE spatial components is asserted to preserve 3D consistency, yet no derivation, invariance argument, or counter-example analysis is supplied showing that cross-scale relations (e.g., persistent scale coupled to local correspondences) survive the product decomposition. This factorization is load-bearing for the long-horizon extrapolation claim.
Authors: The abstract summarizes the approach; the full manuscript (Section 3) formally defines the evidence influence kernel K(t,s) and derives its factorization into the product of a channel-wise linear temporal term (with per-channel decay) and a local spatiotemporal term (with RoPE). The derivation shows that the product preserves the required separation of timescales while the Metric Readout Tokens recover coupled scale-pose quantities from the persistent state. A short invariance sketch appears in the supplementary material. We can add a one-sentence reference to this derivation in the abstract if desired, but the core mathematical argument is already in the body. revision: partial
Circularity Check
No circularity: claims rest on architectural design and experiments, not self-definition or fitted inputs
full rationale
The abstract presents the evidence influence kernel factorization and HorizonStream architecture as a proposed solution to identified failure modes (drift, cache saturation), with the long-horizon generalization claim supported by training on 48-frame clips and reported experimental outcomes on longer sequences. No equations, self-referential definitions, or self-citations appear that would reduce the claimed performance or factorization to an input by construction. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing prior results from the same authors.
Axiom & Free-Parameter Ledger
free parameters (1)
- channel-wise decay rates
axioms (2)
- domain assumption Streaming geometry is temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale.
- ad hoc to paper Geometric propagation can be formalized as an evidence influence kernel that can be explicitly factorized into long-range temporal and short-range spatial factors.
invented entities (3)
-
Geometric Linear Attention
no independent evidence
-
Geometric Local Attention with Spatiotemporal RoPE
no independent evidence
-
Metric Readout Tokens
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We formalize geometric propagation as an evidence influence kernel... K(t,i)=Kspatial(t,i)·Ktime(t,i). ... channel-wise retention vector γt=σ(Wγxt+bγ)∈(0,1)d, St=diag(γt)St−1+ϕ(kt)ṽ⊤t. ... τ(c)=−1/logγ̄(c)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data
Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024
Leonardo Brizi, Emanuele Giacomini, Luca Di Giammarino, Simone Ferrari, Omar Salem, Lorenzo De Rebotti, and Giorgio Grisetti. VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024
-
[3]
Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[4]
Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM.IEEE Transactions on Robotics, 37(6), 2021
work page 2021
-
[5]
Geometric Context Transformer for Streaming 3D Reconstruction
Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D recon- struction as test-time training.arXiv preprint arXiv:2509.26645, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025
Chong Cheng, Yu Hu, Sicheng Yu, Beizhen Zhao, Zijian Wang, and Hao Wang. Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025. URL https://arxiv.org/ab s/2507.08136
-
[8]
Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025
Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2502.17377
-
[9]
Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025
Chong Cheng, Zijian Wang, Sicheng Yu, Yu Hu, Nanjie Yao, and Hao Wang. Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025. URL https://arxiv.org/abs/ 2507.18541
-
[10]
Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025
Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, and Hao Wang. Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025. URL https://arxiv.org/abs/ 2507.03737
-
[11]
LongStream: Long-sequence streaming autoregressive visual geometry
Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026
work page 2026
-
[12]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. URL https://arxiv.or g/abs/2507.16443. 10
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Are we ready for autonomous driving? The KITTI Vision Benchmark Suite
Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. InCVPR, 2012
work page 2012
-
[14]
When attention sink emerges in language models: An empirical view
Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InICLR, 2025
work page 2025
-
[15]
arXiv preprint arXiv:2511.19971 (2025)
Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction, 2025. URL https://arxiv. org/abs/2511.19971
-
[16]
Transformers are RNNs: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InICML, 2020
work page 2020
-
[17]
Mapanything: Universal feed-forward metric 3d reconstruction,
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruction,
-
[18]
URLhttps://arxiv.org/abs/2509.13414
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URL https://arxiv.org/abs/2508.10893
-
[20]
Grounding image matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. InECCV, 2024
work page 2024
-
[21]
MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond
Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. InICCV, 2023
work page 2023
-
[22]
MegaDepth: Learning single-view depth prediction from internet photos
Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. InCVPR, 2018
work page 2018
-
[23]
DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR, 2024
work page 2024
-
[24]
Test-Time Training with KV Binding Is Secretly Linear Attention
Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, and Ruilong Li. Test-time training with kv binding is secretly linear attention, 2026. URLhttps://arxiv.org/abs/2602.21204
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Dominic Maggio, Hyungtae Lim, and Luca Carlone. VGGT-SLAM: Dense RGB SLAM optimized on the SL(4) manifold.arXiv preprint arXiv:2505.12549, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InCVPR, 2025
work page 2025
-
[27]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction
Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. InICCV, 2021
work page 2021
- [30]
-
[31]
Structure-from-motion revisited
Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016
work page 2016
-
[32]
Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger
Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017
work page 2017
-
[33]
FastVGGT: Training-Free Acceleration of Visual Geometry Transformer
You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free accelera- tion of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Scene coordinate regression forests for camera relocalization in RGB-D images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013
work page 2013
-
[35]
The Replica Dataset: A Digital Replica of Indoor Spaces
Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The Replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[36]
A benchmark for the evaluation of RGB-D SLAM systems
Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012
work page 2012
-
[37]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024
work page 2024
-
[38]
Scalability in perception for autonomous driving: Waymo Open Dataset
Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo Open Dataset. InCVPR, 2020
work page 2020
-
[39]
Yifu Tao, Miguel Ángel Muñoz-Bañón, Lintong Zhang, Jiahao Wang, Lanke Frank Tarimo Fu, and Maurice Fallon. The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025
work page 2025
-
[40]
DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras
Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeurIPS, 2021
work page 2021
-
[41]
Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. InNeurIPS, 2023
work page 2023
-
[42]
3D reconstruction with spatial memory
Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025
work page 2025
-
[43]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025
work page 2025
-
[44]
Continuous 3d perception model with persistent state
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
-
[45]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024
work page 2024
-
[46]
TartanAir: A dataset to push the limits of visual SLAM
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020
work page 2020
-
[47]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning, 2025. URLhttps://arxiv.org/abs/2507.13347
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Mapillary street-level sequences: A dataset for lifelong place recognition
Frederik Warburg, Søren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In CVPR, 2020
work page 2020
-
[49]
Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 12
-
[50]
RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos
Hongchi Xia, Yang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos. InCVPR, 2024
work page 2024
-
[51]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024
work page 2024
-
[52]
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025. URLhttps://arxiv.org/abs/2412.06464
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
BlendedMVS: A large-scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020
work page 2020
-
[54]
ScanNet++: A high-fidelity dataset of 3D indoor scenes
Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023
work page 2023
-
[55]
Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025
Sicheng Yu, Chong Cheng, Yifan Zhou, Xiaojun Yang, and Hao Wang. Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025. URL https://arxiv.org/abs/2502.1 5633
work page 2025
-
[56]
InfiniteVGGT: Visual geometry grounded transformer for endless streams
Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026
-
[57]
MonST3R: A simple approach for estimating geometry in the presence of motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. InICLR, 2025
work page 2025
-
[58]
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory, 2026. URLhttps://arxiv.org/abs/2603.03269
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[59]
Kimi Linear: An Expressive, Efficient Attention Architecture
Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J
Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. PointOdyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023
work page 2023
-
[61]
Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. URL https://arxiv. org/abs/2509.12201
-
[62]
Streaming 4D Visual Geometry Transformer
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539. 13 A Geometric Attention Dilution We formalize why causal softmax attention cannot serve as long-range cross-window memory in streaming 3D reconstruction. Let Ωt denote the set of 3D points visible at t...
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.