HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Chong Cheng; Guanzhi Ding; Hao Wang; Nanjie Yao; Peilin Tao; Qian Zhang; Weiqiang Ren; Wei Yin; Xianda Chen; Xiaoyang Guo

arxiv: 2605.23889 · v1 · pith:SZUGHSIVnew · submitted 2026-05-22 · 💻 cs.CV

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

Chong Cheng , Peilin Tao , Nanjie Yao , Guanzhi Ding , Xianda Chen , Yuansen Du , Xiaoyang Guo , Wei Yin

show 4 more authors

Weiqiang Ren Qian Zhang Zhengqing Chen Hao Wang

This is my paper

Pith reviewed 2026-05-25 04:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming 3D reconstructionlong-horizon attentiongeometric propagationcausal transformeronline reconstructionevidence influence kernelspatiotemporal attention

0 comments

The pith

HorizonStream factorizes geometric evidence influence to enable stable streaming 3D reconstruction over sequences exceeding 10,000 frames with constant memory and linear time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix drift, jitter, and collapse that occur when existing methods try to reconstruct 3D scenes from continuous video streams under causal and memory limits. It identifies the root cause as uniform influence patterns that cannot handle the mix of short-lived and persistent geometric evidence. HorizonStream addresses this by defining geometric propagation as an evidence influence kernel and splitting that kernel into a long-range temporal part and a short-range spatial part. The resulting architecture trains on short 48-frame clips yet runs stably on sequences over 10,000 frames while using fixed memory and linear compute per frame. If the approach holds, streaming 3D reconstruction could become reliable for extended real-world operation without restarts or unbounded resources.

Core claim

HorizonStream formalizes geometric propagation as an evidence influence kernel and explicitly factorizes it into independent components. The long-range temporal factor uses Geometric Linear Attention that learns channel-wise decay rates for bounded, multi-timescale propagation of evidence. The short-range spatial factor employs Geometric Local Attention with Spatiotemporal RoPE to perform reliable 3D matching while avoiding attention sinks. Metric Readout Tokens then extract stable scale and rigid pose from the accumulated geometric state. This design yields state-of-the-art performance on long sequences after training only on short clips.

What carries the argument

The evidence influence kernel, factorized into Geometric Linear Attention for long-range temporal propagation with channel-wise decay rates and Geometric Local Attention with Spatiotemporal RoPE for short-range spatial matching, together with Metric Readout Tokens for scale and pose recovery.

If this is right

Streaming 3D reconstruction scales to arbitrary sequence lengths without drift or collapse.
Models trained on short clips generalize directly to long real-world streams.
Memory remains constant and per-frame compute scales linearly rather than quadratically.
Causal, bounded-memory constraints no longer force accuracy trade-offs on extended inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same kernel-factorization pattern could stabilize other long-horizon causal vision tasks such as online SLAM or video depth estimation.
Constant-memory operation would make continuous 3D mapping practical on resource-limited platforms like mobile robots.
Testing the factorization on non-rigid or highly dynamic scenes would reveal whether the temporal-spatial split remains sufficient.

Load-bearing premise

Geometric propagation can be usefully formalized as an evidence influence kernel that factorizes cleanly into independent long-range temporal and short-range spatial components without loss of critical 3D consistency information.

What would settle it

Measure reconstruction error and memory usage on a held-out 15,000-frame sequence; the claim fails if error grows with length or memory exceeds the constant bound observed on shorter training clips.

Figures

Figures reproduced from arXiv: 2605.23889 by Chong Cheng, Guanzhi Ding, Hao Wang, Nanjie Yao, Peilin Tao, Qian Zhang, Weiqiang Ren, Wei Yin, Xianda Chen, Xiaoyang Guo, Yuansen Du, Zhengqing Chen.

**Figure 2.** Figure 2: Visualization of long-range streaming 3D reconstruction across diverse scenes. Our [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of HorizonStream. Given an RGB stream, the model causally processes the most recent W frames. Geometric Local Attention handles local matching, Geometric Linear Attention propagates long-range geometry with an O(1) recurrent geometric state, and Metric Readout Tokens recover stable scale and pose. An optional loop-closure module refines the trajectory. To systematically address these requirements,… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on long-sequence 3D reconstruction. As sequence length grows, existing methods show pose degradation, drift, or collapse. Lingbot-map exhibits progressively stronger pose jitter over longer rollouts, while HorizonStream maintains stable pose estimation. Metric Readout Tokens (MRT) and relative pose fusion. Long streaming reconstruction requires metric scale and pose to remain consist… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on 3D reconstruction. Left: trajectory. Right: 3D reconstruction. HorizonStream maintains stable geometry. Lingbot-map preserves trajectory direction but exhibits increasing jitter, causing point cloud overlap. driving, large-scale reconstruction, and synthetic environments, including ScanNet++ [53], Hypersim [29], Replica [34], 7Scenes [33], ARKitScenes [1], WildRGB-D [49], Waymo [… view at source ↗

**Figure 6.** Figure 6: (a) Learned retention spectra in Geometric Linear Attention. Effective lifetimes τ = −1/ log ¯γ vary across channels and layers. Layer 4 exhibits broad mid-range retention, while Layer 17 develops a sharper long-retention tail, supporting channel-wise multi-timescale propagation. (b) Retention-band ablation. Replacing the learned channel-wise retention spectrum with fixed short-, medium-, or long-horizon b… view at source ↗

**Figure 7.** Figure 7: Memory and runtime scaling. HorizonStream keeps peak memory nearly constant and [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Training convergence under different attention mechanisms for cross-window propagation. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Effect of loop closure on long sequences. Loop closure reduces ATE on sequences with [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Failure cases on ultra-long sequences. Ground-truth trajectory (red), online prediction [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Linear probing of frozen Geometric Linear [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Online 3D reconstruction requires estimating camera pose and scene geometry under strict causal and bounded-memory constraints. Existing methods often suffer from drift, jitter, or collapse on long sequences. We trace these failures to a fundamental mismatch. Streaming geometry is inherently temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale. However, current architectures impose uniform and pathological influence patterns. For example, sliding windows enforce hard cutoffs, while ungated recurrence and causal attention cause cache saturation and spike-like attention sinks. To resolve this, we formalize geometric propagation as an \emph{evidence influence kernel} and propose HorizonStream, a long-horizon Transformer that explicitly factorizes this kernel. For the long-range temporal factor, Geometric Linear Attention learns channel-wise decay rates to enable bounded, multi-timescale propagation of geometric evidence. For the short-range spatial factor, Geometric Local Attention with Spatiotemporal RoPE performs reliable 3D matching while suppressing attention sinks. Finally, Metric Readout Tokens recover stable scale and rigid pose directly from the persistent geometric state. Extensive experiments show that HorizonStream, trained on only 48-frame clips, generalizes stably to sequences exceeding 10,000\ frames with constant memory and linear time, achieving state-of-the-art streaming 3D reconstruction performance. Project Page: https://3dagentworld.github.io/horizonstream/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HorizonStream factorizes geometric evidence into channel-wise temporal decay and spatial RoPE attention to claim stable 10k-frame streaming reconstruction from 48-frame training, but the abstract leaves the consistency of that split unaddressed.

read the letter

The main takeaway is that this paper splits geometric propagation into a long-range temporal factor using learned per-channel decay rates in linear attention and a short-range spatial factor using local attention with spatiotemporal RoPE, plus metric readout tokens to pull out scale and pose. They report that this lets the model, trained only on short clips, run on sequences over 10,000 frames with fixed memory and linear time while hitting state-of-the-art numbers on streaming 3D reconstruction tasks.

Referee Report

2 major / 1 minor

Summary. The paper proposes HorizonStream, a Transformer for online 3D reconstruction that formalizes geometric propagation via an evidence influence kernel explicitly factorized into (1) Geometric Linear Attention with learned per-channel decay rates for long-range multi-timescale temporal propagation, (2) Geometric Local Attention augmented by Spatiotemporal RoPE for short-range 3D matching, and (3) Metric Readout Tokens that recover scale and rigid pose from the persistent state. The central claim is that a model trained only on 48-frame clips generalizes stably to sequences exceeding 10,000 frames while maintaining constant memory and linear time complexity, outperforming prior streaming methods.

Significance. If the generalization and factorization claims hold with supporting experiments, the work would be significant for causal, bounded-memory 3D reconstruction on long sequences. The explicit multi-timescale decay mechanism and separation of temporal and spatial factors address a recognized mismatch between uniform attention patterns and heterogeneous geometric evidence lifetimes. The project page link indicates potential for reproducibility checks.

major comments (2)

[Abstract] Abstract: the claim that training on 48-frame clips yields stable behavior on >10,000-frame sequences with constant memory is presented without any quantitative results, ablation tables, error curves, or drift metrics. This absence makes the central generalization claim impossible to evaluate.
[Abstract] Abstract: the factorization of the evidence influence kernel into independent channel-wise temporal decay and Spatiotemporal-RoPE spatial components is asserted to preserve 3D consistency, yet no derivation, invariance argument, or counter-example analysis is supplied showing that cross-scale relations (e.g., persistent scale coupled to local correspondences) survive the product decomposition. This factorization is load-bearing for the long-horizon extrapolation claim.

minor comments (1)

[Abstract] Abstract: the phrase 'pathological influence patterns' is used without a precise definition or citation to the specific failure modes in sliding-window or causal-attention baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback. We address the two major comments on the abstract below, providing clarifications from the full manuscript while noting opportunities for revision.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that training on 48-frame clips yields stable behavior on >10,000-frame sequences with constant memory is presented without any quantitative results, ablation tables, error curves, or drift metrics. This absence makes the central generalization claim impossible to evaluate.

Authors: We agree the abstract is a high-level summary and omits specific numbers. The full manuscript (Sections 4–5) includes quantitative results: tables reporting reconstruction metrics (e.g., absolute trajectory error and depth accuracy) on sequences exceeding 10,000 frames, ablation studies on training clip length, and plots of cumulative drift versus sequence length demonstrating stable generalization with O(1) memory. We will revise the abstract to incorporate one or two key quantitative highlights (e.g., “maintaining <X cm ATE on 10k-frame sequences”) to improve evaluability while preserving conciseness. revision: yes
Referee: [Abstract] Abstract: the factorization of the evidence influence kernel into independent channel-wise temporal decay and Spatiotemporal-RoPE spatial components is asserted to preserve 3D consistency, yet no derivation, invariance argument, or counter-example analysis is supplied showing that cross-scale relations (e.g., persistent scale coupled to local correspondences) survive the product decomposition. This factorization is load-bearing for the long-horizon extrapolation claim.

Authors: The abstract summarizes the approach; the full manuscript (Section 3) formally defines the evidence influence kernel K(t,s) and derives its factorization into the product of a channel-wise linear temporal term (with per-channel decay) and a local spatiotemporal term (with RoPE). The derivation shows that the product preserves the required separation of timescales while the Metric Readout Tokens recover coupled scale-pose quantities from the persistent state. A short invariance sketch appears in the supplementary material. We can add a one-sentence reference to this derivation in the abstract if desired, but the core mathematical argument is already in the body. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on architectural design and experiments, not self-definition or fitted inputs

full rationale

The abstract presents the evidence influence kernel factorization and HorizonStream architecture as a proposed solution to identified failure modes (drift, cache saturation), with the long-horizon generalization claim supported by training on 48-frame clips and reported experimental outcomes on longer sequences. No equations, self-referential definitions, or self-citations appear that would reduce the claimed performance or factorization to an input by construction. The derivation chain is self-contained against external benchmarks and does not invoke load-bearing prior results from the same authors.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 3 invented entities

Review is based solely on the abstract; full details on parameters, assumptions, and entities are unavailable. The ledger reflects only what is explicitly named in the abstract.

free parameters (1)

channel-wise decay rates
Learned rates in Geometric Linear Attention for multi-timescale propagation.

axioms (2)

domain assumption Streaming geometry is temporally heterogeneous, with evidence ranging from short-lived correspondences to persistent global scale.
This is presented as the root cause of failures in existing methods.
ad hoc to paper Geometric propagation can be formalized as an evidence influence kernel that can be explicitly factorized into long-range temporal and short-range spatial factors.
This formalization is the stated foundation for the HorizonStream design.

invented entities (3)

Geometric Linear Attention no independent evidence
purpose: Enable bounded multi-timescale propagation of geometric evidence
New attention mechanism introduced for the long-range factor.
Geometric Local Attention with Spatiotemporal RoPE no independent evidence
purpose: Perform reliable 3D matching while suppressing attention sinks
New attention variant for the short-range factor.
Metric Readout Tokens no independent evidence
purpose: Recover stable scale and rigid pose from persistent geometric state
Introduced component for metric readout.

pith-pipeline@v0.9.0 · 5806 in / 1508 out tokens · 49635 ms · 2026-05-25T04:31:40.693278+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We formalize geometric propagation as an evidence influence kernel... K(t,i)=Kspatial(t,i)·Ktime(t,i). ... channel-wise retention vector γt=σ(Wγxt+bγ)∈(0,1)d, St=diag(γt)St−1+ϕ(kt)ṽ⊤t. ... τ(c)=−1/logγ̄(c)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 28 canonical work pages · 17 internal anchors

[1]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024

Leonardo Brizi, Emanuele Giacomini, Luca Di Giammarino, Simone Ferrari, Omar Salem, Lorenzo De Rebotti, and Giorgio Grisetti. VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024

work page arXiv 2024
[3]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Gomez Rodriguez, Jose M

Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM.IEEE Transactions on Robotics, 37(6), 2021

2021
[5]

Geometric Context Transformer for Streaming 3D Reconstruction

Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D recon- struction as test-time training.arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025

Chong Cheng, Yu Hu, Sicheng Yu, Beizhen Zhao, Zijian Wang, and Hao Wang. Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025. URL https://arxiv.org/ab s/2507.08136

work page arXiv 2025
[8]

Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025

Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2502.17377

work page arXiv 2025
[9]

Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025

Chong Cheng, Zijian Wang, Sicheng Yu, Yu Hu, Nanjie Yao, and Hao Wang. Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025. URL https://arxiv.org/abs/ 2507.18541

work page arXiv 2025
[10]

Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025

Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, and Hao Wang. Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025. URL https://arxiv.org/abs/ 2507.03737

work page arXiv 2025
[11]

LongStream: Long-sequence streaming autoregressive visual geometry

Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026

2026
[12]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. URL https://arxiv.or g/abs/2507.16443. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Are we ready for autonomous driving? The KITTI Vision Benchmark Suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. InCVPR, 2012

2012
[14]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InICLR, 2025

2025
[15]

arXiv preprint arXiv:2511.19971 (2025)

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction, 2025. URL https://arxiv. org/abs/2511.19971

work page arXiv 2025
[16]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InICML, 2020

2020
[17]

Mapanything: Universal feed-forward metric 3d reconstruction,

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruction,
[18]

URLhttps://arxiv.org/abs/2509.13414

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URL https://arxiv.org/abs/2508.10893

work page arXiv 2025
[20]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. InECCV, 2024

2024
[21]

MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. InICCV, 2023

2023
[22]

MegaDepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

2018
[23]

DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR, 2024

2024
[24]

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, and Ruilong Li. Test-time training with kv binding is secretly linear attention, 2026. URLhttps://arxiv.org/abs/2602.21204

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. VGGT-SLAM: Dense RGB SLAM optimized on the SL(4) manifold.arXiv preprint arXiv:2505.12549, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InCVPR, 2025

2025
[27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. InICCV, 2021

2021
[30]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021. 11

2021
[31]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016

2016
[32]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

2017
[33]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free accelera- tion of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Scene coordinate regression forests for camera relocalization in RGB-D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013

2013
[35]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The Replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[36]

A benchmark for the evaluation of RGB-D SLAM systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012

2012
[37]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

2024
[38]

Scalability in perception for autonomous driving: Waymo Open Dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo Open Dataset. InCVPR, 2020

2020
[39]

The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025

Yifu Tao, Miguel Ángel Muñoz-Bañón, Lintong Zhang, Jiahao Wang, Lanke Frank Tarimo Fu, and Maurice Fallon. The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025

2025
[40]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeurIPS, 2021

2021
[41]

Deep patch visual odometry

Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. InNeurIPS, 2023

2023
[42]

3D reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025

2025
[43]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

2025
[44]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025
[45]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024

2024
[46]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020

2020
[47]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning, 2025. URLhttps://arxiv.org/abs/2507.13347

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Mapillary street-level sequences: A dataset for lifelong place recognition

Frederik Warburg, Søren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In CVPR, 2020

2020
[49]

Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 12

work page arXiv 2025
[50]

RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos

Hongchi Xia, Yang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos. InCVPR, 2024

2024
[51]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

2024
[52]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025. URLhttps://arxiv.org/abs/2412.06464

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

BlendedMVS: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020

2020
[54]

ScanNet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023

2023
[55]

Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025

Sicheng Yu, Chong Cheng, Yifan Zhou, Xiaojun Yang, and Hao Wang. Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025. URL https://arxiv.org/abs/2502.1 5633

2025
[56]

InfiniteVGGT: Visual geometry grounded transformer for endless streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026
[57]

MonST3R: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. InICLR, 2025

2025
[58]

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory, 2026. URLhttps://arxiv.org/abs/2603.03269

work page internal anchor Pith review Pith/arXiv arXiv 2026
[59]

Kimi Linear: An Expressive, Efficient Attention Architecture

Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J

Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. PointOdyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023

2023
[61]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. URL https://arxiv. org/abs/2509.12201

work page arXiv 2025
[62]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539. 13 A Geometric Attention Dilution We formalize why causal softmax attention cannot serve as long-range cross-window memory in streaming 3D reconstruction. Let Ωt denote the set of 3D points visible at t...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real- world dataset for 3d indoor scene understanding using mobile rgb-d data.arXiv preprint arXiv:2111.08897, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024

Leonardo Brizi, Emanuele Giacomini, Luca Di Giammarino, Simone Ferrari, Omar Salem, Lorenzo De Rebotti, and Giorgio Grisetti. VBR: A vision benchmark in rome.arXiv preprint arXiv:2404.11322, 2024

work page arXiv 2024

[3] [3]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual KITTI 2.arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[4] [4]

Gomez Rodriguez, Jose M

Carlos Campos, Richard Elvira, Juan J. Gomez Rodriguez, Jose M. M. Montiel, and Juan D. Tardos. ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM.IEEE Transactions on Robotics, 37(6), 2021

2021

[5] [5]

Geometric Context Transformer for Streaming 3D Reconstruction

Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, et al. Geometric context transformer for streaming 3d reconstruction.arXiv preprint arXiv:2604.14141, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. TTT3R: 3D recon- struction as test-time training.arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025

Chong Cheng, Yu Hu, Sicheng Yu, Beizhen Zhao, Zijian Wang, and Hao Wang. Reggs: Unposed sparse views gaussian splatting with 3dgs registration, 2025. URL https://arxiv.org/ab s/2507.08136

work page arXiv 2025

[8] [8]

Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025

Chong Cheng, Gaochao Song, Yiyang Yao, Qinzheng Zhou, Gangjian Zhang, and Hao Wang. Graph-guided scene reconstruction from images with 3d gaussian splatting, 2025. URL https: //arxiv.org/abs/2502.17377

work page arXiv 2025

[9] [9]

Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025

Chong Cheng, Zijian Wang, Sicheng Yu, Yu Hu, Nanjie Yao, and Hao Wang. Unposed 3dgs reconstruction with probabilistic procrustes mapping, 2025. URL https://arxiv.org/abs/ 2507.18541

work page arXiv 2025

[10] [10]

Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025

Chong Cheng, Sicheng Yu, Zijian Wang, Yifan Zhou, and Hao Wang. Outdoor monocular slam with global scale-consistent 3d gaussian pointmaps, 2025. URL https://arxiv.org/abs/ 2507.03737

work page arXiv 2025

[11] [11]

LongStream: Long-sequence streaming autoregressive visual geometry

Chong Cheng, Xianda Chen, Tao Xie, Wei Yin, Weiqiang Ren, Qian Zhang, Xiaoyang Guo, and Hao Wang. LongStream: Long-sequence streaming autoregressive visual geometry. InCVPR, 2026

2026

[12] [12]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences, 2025. URL https://arxiv.or g/abs/2507.16443. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Are we ready for autonomous driving? The KITTI Vision Benchmark Suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI Vision Benchmark Suite. InCVPR, 2012

2012

[14] [14]

When attention sink emerges in language models: An empirical view

Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. InICLR, 2025

2025

[15] [15]

arXiv preprint arXiv:2511.19971 (2025)

Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, and Hao Wang. Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction, 2025. URL https://arxiv. org/abs/2511.19971

work page arXiv 2025

[16] [16]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InICML, 2020

2020

[17] [17]

Mapanything: Universal feed-forward metric 3d reconstruction,

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. Mapanything: Universal feed-forward metric 3d reconstruction,

[18] [18]

URLhttps://arxiv.org/abs/2509.13414

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer, 2025. URL https://arxiv.org/abs/2508.10893

work page arXiv 2025

[20] [20]

Grounding image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3D with MASt3R. InECCV, 2024

2024

[21] [21]

MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond

Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. MatrixCity: A large-scale city dataset for city-scale neural rendering and beyond. InICCV, 2023

2023

[22] [22]

MegaDepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. MegaDepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

2018

[23] [23]

DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR, 2024

2024

[24] [24]

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, and Ruilong Li. Test-time training with kv binding is secretly linear attention, 2026. URLhttps://arxiv.org/abs/2602.21204

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. VGGT-SLAM: Dense RGB SLAM optimized on the SL(4) manifold.arXiv preprint arXiv:2505.12549, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors. InCVPR, 2025

2025

[27] [27]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3D: Large-scale learning and evaluation of real-life 3D category reconstruction. InICCV, 2021

2021

[30] [30]

Susskind

Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021. 11

2021

[31] [31]

Structure-from-motion revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016

2016

[32] [32]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

2017

[33] [33]

FastVGGT: Training-Free Acceleration of Visual Geometry Transformer

You Shen, Zhipeng Zhang, Yansong Qu, and Liujuan Cao. FastVGGT: Training-free accelera- tion of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Scene coordinate regression forests for camera relocalization in RGB-D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013

2013

[35] [35]

The Replica Dataset: A Digital Replica of Indoor Spaces

Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J. Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The Replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906

[36] [36]

A benchmark for the evaluation of RGB-D SLAM systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of RGB-D SLAM systems. InIROS, 2012

2012

[37] [37]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568, 2024

2024

[38] [38]

Scalability in perception for autonomous driving: Waymo Open Dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo Open Dataset. InCVPR, 2020

2020

[39] [39]

The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025

Yifu Tao, Miguel Ángel Muñoz-Bañón, Lintong Zhang, Jiahao Wang, Lanke Frank Tarimo Fu, and Maurice Fallon. The Oxford Spires dataset: Benchmarking large-scale LiDAR-visual localisation, reconstruction and radiance field methods.International Journal of Robotics Research, 2025

2025

[40] [40]

DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras

Zachary Teed and Jia Deng. DROID-SLAM: Deep visual SLAM for monocular, stereo, and RGB-D cameras. InNeurIPS, 2021

2021

[41] [41]

Deep patch visual odometry

Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. InNeurIPS, 2023

2023

[42] [42]

3D reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025

2025

[43] [43]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025

2025

[44] [44]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025

[45] [45]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024

2024

[46] [46]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020

2020

[47] [47]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning, 2025. URLhttps://arxiv.org/abs/2507.13347

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Mapillary street-level sequences: A dataset for lifelong place recognition

Frederik Warburg, Søren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In CVPR, 2020

2020

[49] [49]

Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3R: Streaming 3D reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 12

work page arXiv 2025

[50] [50]

RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos

Hongchi Xia, Yang Fu, Sifei Liu, and Xiaolong Wang. RGBD objects in the wild: Scaling real-world 3D object learning from RGB-D videos. InCVPR, 2024

2024

[51] [51]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InICLR, 2024

2024

[52] [52]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule, 2025. URLhttps://arxiv.org/abs/2412.06464

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

BlendedMVS: A large-scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large-scale dataset for generalized multi-view stereo networks. In CVPR, 2020

2020

[54] [54]

ScanNet++: A high-fidelity dataset of 3D indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. ScanNet++: A high-fidelity dataset of 3D indoor scenes. InICCV, 2023

2023

[55] [55]

Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025

Sicheng Yu, Chong Cheng, Yifan Zhou, Xiaojun Yang, and Hao Wang. Rgb-only gaussian splatting slam for unbounded outdoor scenes, 2025. URL https://arxiv.org/abs/2502.1 5633

2025

[56] [56]

InfiniteVGGT: Visual geometry grounded transformer for endless streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. InfiniteVGGT: Visual geometry grounded transformer for endless streams. arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026

[57] [57]

MonST3R: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A simple approach for estimating geometry in the presence of motion. InICLR, 2025

2025

[58] [58]

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory, 2026. URLhttps://arxiv.org/abs/2603.03269

work page internal anchor Pith review Pith/arXiv arXiv 2026

[59] [59]

Kimi Linear: An Expressive, Efficient Attention Architecture

Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J

Yang Zheng, Adam W. Harley, Bokui Shen, Gordon Wetzstein, and Leonidas J. Guibas. PointOdyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023

2023

[61] [61]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, and Tong He. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling, 2025. URL https://arxiv. org/abs/2509.12201

work page arXiv 2025

[62] [62]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer, 2026. URLhttps://arxiv.org/abs/2507.11539. 13 A Geometric Attention Dilution We formalize why causal softmax attention cannot serve as long-range cross-window memory in streaming 3D reconstruction. Let Ωt denote the set of 3D points visible at t...

work page internal anchor Pith review Pith/arXiv arXiv 2026