LIST3R: Long-sequence Instance-aware 3D Reconstruction

Feiran Wang; Jing Gao; Wei Wang; Yan Yan

arxiv: 2607.00375 · v1 · pith:SU3BVDZOnew · submitted 2026-07-01 · 💻 cs.CV

LIST3R: Long-sequence Instance-aware 3D Reconstruction

Jing Gao , Wei Wang , Feiran Wang , Yan Yan This is my paper

Pith reviewed 2026-07-02 15:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords long-sequence 3D reconstructioninstance-aware reconstructionpersistent anchorsvideo fragment alignmentobject librarysubsequence matchingglobal consistency

0 comments

The pith

Persistent instance anchors reconnect video subsequences to produce consistent global 3D reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LIST3R partitions long input videos into overlapping subsequences and maintains a local instance library for each, where stable objects serve as trackable anchors carrying semantic and geometric evidence. These anchors are matched across subsequences to recover revisited regions and supply object-aware constraints that align the fragments into one coherent scene. The local libraries are then progressively merged into a single global 3D instance library as new geometric evidence arrives. A sympathetic reader would care because conventional reconstruction pipelines accumulate drift over long horizons and struggle to close loops reliably; anchoring the process to persistent objects offers a way to organize memory and correct alignment without relying solely on low-level feature matching.

Core claim

Given a long video, the method partitions it into overlapping subsequences, builds structured local instance libraries that keep persistent trackable anchors with semantic and geometric evidence, matches those anchors across subsequences to recover revisited regions and enforce object-aware alignment constraints, and progressively updates the libraries until they form a unified global 3D instance library that yields a single consistent reconstruction.

What carries the argument

Persistent instance anchors: stable objects that carry semantic and geometric evidence to enable cross-subsequence matching and object-aware fragment alignment.

If this is right

More accurate camera trajectories result from the additional object-aware constraints supplied by matched anchors.
Higher-quality 3D reconstructions are obtained because local observations are consolidated around consistent instance identities rather than drifting independently.
Revisited regions are reconnected without requiring exhaustive global optimization at every step.
Local instance libraries evolve into a unified global library that maintains object-level organization throughout the process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Object-centric constraints may scale better than purely geometric loop closure when sequences grow to hours rather than minutes.
The same anchor-matching logic could be tested on dynamic scenes by allowing anchors to update their geometric descriptors over time.
Robotics navigation systems that already maintain object maps might integrate this alignment step to reduce map fragmentation in extended environments.

Load-bearing premise

Instance anchors can be reliably detected, persistently tracked, and correctly matched across subsequences without introducing alignment errors that propagate through the global reconstruction.

What would settle it

Running the method on standard long-sequence benchmarks and finding no improvement in trajectory accuracy or reconstruction quality relative to non-instance-aware baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2607.00375 by Feiran Wang, Jing Gao, Wei Wang, Yan Yan.

**Figure 2.** Figure 2: Method overview. LIST3R first builds a local instance library for each subsequence to capture recognizable object cues. These cues are then used to connect subsequences and assemble fragmented local reconstructions into a coherent global scene. As reconstruction progresses, instance evidence is continuously updated and finally consolidated into a global instance library. ible multi-view inputs, substantial… view at source ↗

**Figure 3.** Figure 3: Instance-guided Cross-subsequence Association. Given local instance libraries and per-subsequence reconstruction outputs, LIST3R establishes reliable cross-subsequence connections through long-range revisit discovery and instance-aware subsequence merging. The resulting adjacent and loop edges are optimized in a global association graph to obtain globally aligned subsequences. discovery, following the desi… view at source ↗

**Figure 4.** Figure 4: Global Instance Library. Based on local instance libraries and reconstructed geometry, LIST3R updates instance evidence and organizes local instance records into a global 3D instance library. Each global instance record maintains object-level geometry, global location, and spatial relations, providing a persistent scene-level representation. is still defined locally within its subsequence. We therefore mer… view at source ↗

**Figure 5.** Figure 5: Visualization of Estimated Long-Sequence Camera Trajectories. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results for long-sequence 3D reconstruction. On TUM and ETH3D, LIST3R also achieves the best RTE, showing that the recovered long-range constraints further improve relative translation consistency after subsequence fusion. For RTE and RRE, some baselines can still be favored by local trajectory characteristics. These metrics evaluate relative motion over local trajectory segments, and thus can … view at source ↗

**Figure 7.** Figure 7: Comparison of local instance anchor initialization strategies. We compare three strategies for constructing local instance anchors across nearby frames: directly using EntitySAM predictions, directly propagating all discovered masks with SAM3, and our final anchor-enhanced initialization with prompt selection. From top to bottom, the three rows show the results of these strategies, respectively. frames. Di… view at source ↗

**Figure 8.** Figure 8: Representative long-range revisit discovery results. We compare LIST3R with the baseline frame-level loop detection strategy on several challenging revisit pairs. For each pair, we report the frame-pair ID, frame-level similarity, ground-truth relation, and whether the revisit is detected by the baseline and LIST3R. Although these pairs correspond to the same scene, the baseline can miss them due to viewpo… view at source ↗

**Figure 9.** Figure 9: Explanation of instance-aware subsequence merging. The left block shows scenes with dynamic objects, while the right block shows mostly static scenes. In each block, the columns show the overlapping frame, the alignment supports selected by the baseline, and the results of LIST3R. Yellow points show the supports originally selected by the baseline, red regions mark dynamic points that are associated with t… view at source ↗

**Figure 10.** Figure 10: Additional visual comparisons of camera trajectory estimation results. We present per-scene trajectory comparisons between the estimated camera trajectories and the ground truth. Gray curves denote the ground-truth trajectories, while colored curves denote the predictions of different methods. The zoomed regions highlight challenging trajectory segments where LIST3R better preserves global consistency and… view at source ↗

**Figure 11.** Figure 11: Additional visual comparisons of point cloud reconstruction results, Part I. We present more per-scene visual comparisons of point cloud reconstruction. These examples highlight cases where LIST3R recovers more complete geometry and preserves clearer scene structures. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Additional visual comparisons of point cloud reconstruction results, Part II. We present more per-scene visual comparisons of point cloud reconstruction. These examples highlight cases where LIST3R recovers more complete geometry and preserves clearer scene structures. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Additional visual comparisons of point cloud reconstruction results, Part III. We present more per-scene visual comparisons of point cloud reconstruction. These examples highlight cases where LIST3R recovers more complete geometry and preserves clearer scene structures. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

read the original abstract

We present LIST3R, an instance-aware framework for long-sequence 3D reconstruction inspired by the way humans organize spatial memory around stable and recognizable objects. LIST3R organizes long-sequence reconstruction around instance anchors, using them to reconnect fragmented subsequences and consolidate local observations into a coherent global 3D scene. Given a long video, our approach partitions it into overlapping subsequences and builds a structured local instance library for each partial reconstruction, maintaining persistent trackable anchors with semantic and geometric evidence. These anchors are matched across subsequences to recover revisited regions and provide object-aware constraints for fragment alignment, producing a consistent global reconstruction. During this process, the evolving geometric evidence updates the local instance libraries and progressively organizes them into a unified global 3D instance library. Experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions, highlighting the effectiveness of persistent instance anchors for organizing long-horizon 3D reconstruction. Our code is available on the project page: https://yixn965.github.io/LIST3R/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LIST3R frames long-sequence reconstruction around persistent instance anchors to reconnect subsequences, but the matching step lacks the quantitative checks needed to confirm it avoids error buildup.

read the letter

The core idea here is using stable object instances as anchors to stitch together overlapping video subsequences into one coherent 3D model. The method splits a long video, builds local instance libraries with semantic and geometric cues for each chunk, matches anchors across chunks to recover overlaps, and folds everything into a global library. This object-centric memory approach is the main novelty, and it gives a concrete way to reduce fragmentation that standard SLAM pipelines often hit on extended runs.

The paper does release code, which is useful for anyone wanting to reproduce or extend the work. The abstract reports better trajectory accuracy and reconstruction quality on long-sequence benchmarks, so there is at least an empirical claim to evaluate.

The soft spot is the cross-subsequence matching. The description stays high-level and does not include the actual matching algorithm, match accuracy numbers, or tests for how wrong matches would affect the global bundle adjustment. If a bad anchor link gets accepted, it would inject incorrect constraints that could degrade the very trajectory improvements claimed. The stress-test concern lands because the abstract supplies no failure-mode data on this load-bearing piece.

This is for computer vision groups working on scalable video-based 3D reconstruction, especially those already using instance segmentation or object-aware SLAM. A reader who needs a practical handle on long-horizon consistency would get value from the framing and the released code.

The work is coherent enough on its own terms to deserve referee time rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LIST3R, an instance-aware framework for long-sequence 3D reconstruction. It partitions input videos into overlapping subsequences, constructs local instance libraries maintaining persistent trackable anchors with semantic and geometric evidence, matches anchors across subsequences to recover revisited regions and supply object-aware alignment constraints, and progressively consolidates local libraries into a unified global 3D instance library. The central claim is that this organization around persistent instance anchors yields more accurate trajectories and higher-quality 3D reconstructions on long-sequence benchmarks, with code released at the project page.

Significance. If the performance claims are substantiated, the approach could meaningfully advance long-horizon reconstruction by using stable object instances to mitigate fragmentation and supply additional consistency constraints beyond pure geometric features. The public code release is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: the claim that 'experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions' supplies no quantitative results, baselines, error bars, or experimental details, so the data-to-claim link cannot be evaluated.
[Abstract] Abstract (pipeline description): the cross-subsequence anchor matching step is described only at a high level ('matched across subsequences to recover revisited regions and provide object-aware constraints') with no algorithm, similarity metric, threshold, or failure-mode analysis; because an incorrect match would inject erroneous relative-pose or object-consistency terms into the global optimization, this assumption is load-bearing for the trajectory-accuracy claim yet remains unverified.

minor comments (1)

[Abstract] Abstract: the sentence on updating local libraries and organizing them into a global library could be clarified to distinguish what is updated versus what is newly created.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'experiments on long-sequence benchmarks show that our method produces more accurate trajectories and higher-quality 3D reconstructions' supplies no quantitative results, baselines, error bars, or experimental details, so the data-to-claim link cannot be evaluated.

Authors: We agree that the abstract would be strengthened by including specific quantitative support for the performance claims. In the revised version, we will incorporate key results (e.g., relative trajectory error reductions and reconstruction metrics versus baselines) with references to the corresponding tables and figures. revision: yes
Referee: [Abstract] Abstract (pipeline description): the cross-subsequence anchor matching step is described only at a high level ('matched across subsequences to recover revisited regions and provide object-aware constraints') with no algorithm, similarity metric, threshold, or failure-mode analysis; because an incorrect match would inject erroneous relative-pose or object-consistency terms into the global optimization, this assumption is load-bearing for the trajectory-accuracy claim yet remains unverified.

Authors: Abstracts conventionally provide high-level pipeline summaries. The full algorithm for anchor matching—including the combined semantic-geometric similarity metric, adaptive thresholds, and robust outlier rejection via the global optimization—is detailed in Section 3.2. Section 4 presents quantitative validation on long-sequence benchmarks together with ablations that isolate the contribution of cross-subsequence matching; these results directly support the trajectory-accuracy claim. We can add a concise reference to the matching criteria in the abstract if the referee prefers. revision: partial

Circularity Check

0 steps flagged

No circularity: algorithmic framework with experimental validation only

full rationale

The paper describes an instance-aware pipeline for partitioning videos, maintaining local instance libraries, matching anchors across subsequences, and producing global reconstructions. No equations, fitted parameters presented as predictions, self-definitional steps, or load-bearing self-citations appear in the abstract or described method. Effectiveness claims rest on benchmark experiments rather than any derivation that reduces to its own inputs by construction. The central assumption about reliable anchor matching is an empirical premise, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5715 in / 1016 out tokens · 26852 ms · 2026-07-02T15:06:31.200221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 8 canonical work pages · 6 internal anchors

[1]

Robust map optimization using dynamic covariance scaling

Pratik Agarwal, Gian Diego Tipaldi, Luciano Spinello, Cyrill Stachniss, and Wolfram Burgard. Robust map optimization using dynamic covariance scaling. In2013 IEEE international conference on robotics and automation, pages 62–69. Ieee, 2013

2013
[2]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

2022
[3]

Cognitive mapping style relates to posterior–anterior hippocampal volume ratio.Hippocampus, 29(8):748–754, 2019

Iva K Brunec, Jessica Robin, Eva Zita Patai, Jason D Ozubko, Amir-Homayoun Javadi, Morgan D Barense, Hugo J Spiers, and Morris Moscovitch. Cognitive mapping style relates to posterior–anterior hippocampal volume ratio.Hippocampus, 29(8):748–754, 2019

2019
[4]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

DiCarlo and David D

James J. DiCarlo and David D. Cox. Untangling invariant object recognition.Trends in Cognitive Sciences, 11(8):333–341, aug 2007

2007
[8]

DiCarlo, Davide Zoccolan, and Nicole C

James J. DiCarlo, Davide Zoccolan, and Nicole C. Rust. How does the brain solve visual object recognition? Neuron, 73(3):415–434, feb 2012

2012
[9]

A cortical representation of the local visual environment.Nature, 392(6676):598–601, apr 1998

Russell Epstein and Nancy Kanwisher. A cortical representation of the local visual environment.Nature, 392(6676):598–601, apr 1998

1998
[10]

Selective neural representation of objects relevant for navigation.Nature Neuroscience, 7(6):673–677, jun 2004

Gabriele Janzen and Miranda van Turennout. Selective neural representation of objects relevant for navigation.Nature Neuroscience, 7(6):673–677, jun 2004

2004
[11]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024

2024
[13]

Précis of O’Keefe & Nadel’s The Hippocampus as a Cognitive Map

John O’Keefe and Lynn Nadel. Précis of O’Keefe & Nadel’s The Hippocampus as a Cognitive Map. Behavioral and Brain Sciences, 2(4):487–494, 1979

1979
[14]

Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019

2019
[15]

Bad slam: Bundle adjusted direct rgb-d slam

Thomas Schops, Torsten Sattler, and Marc Pollefeys. Bad slam: Bundle adjusted direct rgb-d slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 134–144, 2019

2019
[16]

A multi-view stereo benchmark with high-resolution images and multi- camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi- camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017
[17]

Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010

Hauke Strasdat, J Montiel, Andrew J Davison, et al. Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010

2010
[18]

A benchmark for the evaluation of rgb-d slam systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012

2012
[19]

Switchable constraints for robust pose graph slam

Niko Sünderhauf and Peter Protzel. Switchable constraints for robust pose graph slam. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1879–1884. IEEE, 2012. 10

2012
[20]

Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

2002
[21]

Cognimap3d: Cognitive 3d mapping and rapid retrieval.arXiv preprint arXiv:2601.08175, 2026

Feiran Wang, Junyi Wu, Dawen Cai, Yuan Hong, and Yan Yan. Cognimap3d: Cognitive 3d mapping and rapid retrieval.arXiv preprint arXiv:2601.08175, 2026

work page arXiv 2026
[22]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[23]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025
[24]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[25]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, et al. Scal3r: Scalable test-time training for large-scale 3d reconstruction.arXiv preprint arXiv:2604.08542, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025
[28]

Entitysam: Segment everything in video

Mingqiao Ye, Seoung Wug Oh, Lei Ke, and Joon-Young Lee. Entitysam: Segment everything in video. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24234–24243, 2025. 11 Appendix Table of Contents A Additional Implementation Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2025

[1] [1]

Robust map optimization using dynamic covariance scaling

Pratik Agarwal, Gian Diego Tipaldi, Luciano Spinello, Cyrill Stachniss, and Wolfram Burgard. Robust map optimization using dynamic covariance scaling. In2013 IEEE international conference on robotics and automation, pages 62–69. Ieee, 2013

2013

[2] [2]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

2022

[3] [3]

Cognitive mapping style relates to posterior–anterior hippocampal volume ratio.Hippocampus, 29(8):748–754, 2019

Iva K Brunec, Jessica Robin, Eva Zita Patai, Jason D Ozubko, Amir-Homayoun Javadi, Morgan D Barense, Hugo J Spiers, and Morris Moscovitch. Cognitive mapping style relates to posterior–anterior hippocampal volume ratio.Hippocampus, 29(8):748–754, 2019

2019

[4] [4]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it–pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

DiCarlo and David D

James J. DiCarlo and David D. Cox. Untangling invariant object recognition.Trends in Cognitive Sciences, 11(8):333–341, aug 2007

2007

[8] [8]

DiCarlo, Davide Zoccolan, and Nicole C

James J. DiCarlo, Davide Zoccolan, and Nicole C. Rust. How does the brain solve visual object recognition? Neuron, 73(3):415–434, feb 2012

2012

[9] [9]

A cortical representation of the local visual environment.Nature, 392(6676):598–601, apr 1998

Russell Epstein and Nancy Kanwisher. A cortical representation of the local visual environment.Nature, 392(6676):598–601, apr 1998

1998

[10] [10]

Selective neural representation of objects relevant for navigation.Nature Neuroscience, 7(6):673–677, jun 2004

Gabriele Janzen and Miranda van Turennout. Selective neural representation of objects relevant for navigation.Nature Neuroscience, 7(6):673–677, jun 2004

2004

[11] [11]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d reconstruction.arXiv preprint arXiv:2509.13414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024

2024

[13] [13]

Précis of O’Keefe & Nadel’s The Hippocampus as a Cognitive Map

John O’Keefe and Lynn Nadel. Précis of O’Keefe & Nadel’s The Hippocampus as a Cognitive Map. Behavioral and Brain Sciences, 2(4):487–494, 1979

1979

[14] [14]

Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019

2019

[15] [15]

Bad slam: Bundle adjusted direct rgb-d slam

Thomas Schops, Torsten Sattler, and Marc Pollefeys. Bad slam: Bundle adjusted direct rgb-d slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 134–144, 2019

2019

[16] [16]

A multi-view stereo benchmark with high-resolution images and multi- camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi- camera videos. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017

[17] [17]

Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010

Hauke Strasdat, J Montiel, Andrew J Davison, et al. Scale drift-aware large scale monocular slam.Robotics: science and Systems VI, 2(3):7, 2010

2010

[18] [18]

A benchmark for the evaluation of rgb-d slam systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012

2012

[19] [19]

Switchable constraints for robust pose graph slam

Niko Sünderhauf and Peter Protzel. Switchable constraints for robust pose graph slam. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1879–1884. IEEE, 2012. 10

2012

[20] [20]

Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002

2002

[21] [21]

Cognimap3d: Cognitive 3d mapping and rapid retrieval.arXiv preprint arXiv:2601.08175, 2026

Feiran Wang, Junyi Wu, Dawen Cai, Yuan Hong, and Yan Yan. Cognimap3d: Cognitive 3d mapping and rapid retrieval.arXiv preprint arXiv:2601.08175, 2026

work page arXiv 2026

[22] [22]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025

[23] [23]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

2025

[24] [24]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024

[25] [25]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, et al. Scal3r: Scalable test-time training for large-scale 3d reconstruction.arXiv preprint arXiv:2604.08542, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025

2025

[28] [28]

Entitysam: Segment everything in video

Mingqiao Ye, Seoung Wug Oh, Lei Ke, and Joon-Young Lee. Entitysam: Segment everything in video. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24234–24243, 2025. 11 Appendix Table of Contents A Additional Implementation Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2025