hub

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

· 2026 · cs.CV · arXiv 2604.08542

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open full Pith review browse 10 citing papers arXiv PDF

abstract

This paper addresses the task of large-scale 3D scene reconstruction from long video sequences. Recent feed-forward reconstruction models have shown promising results by directly regressing 3D geometry from RGB images without explicit 3D priors or geometric constraints. However, these methods often struggle to maintain reconstruction accuracy and consistency over long sequences due to limited memory capacity and the inability to effectively capture global contextual cues. In contrast, humans can naturally exploit the global understanding of the scene to inform local perception. Motivated by this, we propose a novel neural global context representation that efficiently compresses and retains long-range scene information, enabling the model to leverage extensive contextual cues for enhanced reconstruction accuracy and consistency. The context representation is realized through a set of lightweight neural sub-networks that are rapidly adapted during test time via self-supervised objectives, which substantially increases memory capacity without incurring significant computational overhead. The experiments on multiple large-scale benchmarks, including the KITTI Odometry~\cite{Geiger2012CVPR} and Oxford Spires~\cite{tao2025spires} datasets, demonstrate the effectiveness of our approach in handling ultra-large scenes, achieving leading pose accuracy and state-of-the-art 3D reconstruction accuracy while maintaining efficiency. Code is available at https://zju3dv.github.io/scal3r.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

cs.CV · 2026-05-26 · unverdicted · novelty 8.0

SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.

DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

DrivingDepth achieves SOTA metric depth on nuScenes by residual pixel-wise scale correction on frozen foundation models using sparse LiDAR prompts, preserving geometric consistency.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

Distillation aligns compression mechanisms between full-history and recurrent transformers, enabling linear-time recurrent memory that narrows the performance gap for streaming vision and robotics tasks.

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.

LIST3R: Long-sequence Instance-aware 3D Reconstruction

cs.CV · 2026-07-01 · unverdicted · novelty 5.0

LIST3R reconnects fragmented video subsequences using persistent instance anchors with semantic and geometric evidence to produce consistent global 3D reconstructions.

$R^3$: 3D Reconstruction via Relative Regression

cs.CV · 2026-05-26 · unverdicted · novelty 5.0

R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.

MemoryWAM: Efficient World Action Modeling with Persistent Memory

cs.RO · 2026-06-18 · unverdicted · novelty 4.0

MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.

citing papers explorer

Showing 10 of 10 citing papers.

SpatialBench: Is Your Spatial Foundation Model an All-Round Player? cs.CV · 2026-05-26 · unverdicted · none · ref 118 · internal anchor
SpatialBench evaluates 41 spatial foundation models across 6 paradigms and 5 task suites, finds they are not all-round players, and introduces the DA-Next-5M dataset plus DA-Next baseline model.
DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation cs.CV · 2026-06-30 · unverdicted · none · ref 35 · internal anchor
DrivingDepth achieves SOTA metric depth on nuScenes by residual pixel-wise scale correction on frozen foundation models using sparse LiDAR prompts, preserving geometric consistency.
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 33 · internal anchor
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers cs.CV · 2026-06-19 · unverdicted · none · ref 55 · internal anchor
Distillation aligns compression mechanisms between full-history and recurrent transformers, enabling linear-time recurrent memory that narrows the performance gap for streaming vision and robotics tasks.
Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping cs.CV · 2026-06-03 · unverdicted · none · ref 29 · internal anchor
Anchor3R reframes feed-forward 3D reconstruction as current-centric local measurement prediction, using loop-closure and motion averaging to produce coherent global maps from visual streams.
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos cs.CV · 2026-05-17 · unverdicted · none · ref 40 · internal anchor
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction cs.CV · 2026-05-16 · unverdicted · none · ref 25 · internal anchor
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
LIST3R: Long-sequence Instance-aware 3D Reconstruction cs.CV · 2026-07-01 · unverdicted · none · ref 26 · internal anchor
LIST3R reconnects fragmented video subsequences using persistent instance anchors with semantic and geometric evidence to produce consistent global 3D reconstructions.
$R^3$: 3D Reconstruction via Relative Regression cs.CV · 2026-05-26 · unverdicted · none · ref 72 · internal anchor
R³ uses relative regression with confidence-weighted constraints from an MLP to support long-context offline and streaming 3D reconstruction without global coordinate assumptions.
MemoryWAM: Efficient World Action Modeling with Persistent Memory cs.RO · 2026-06-18 · unverdicted · none · ref 49 · internal anchor
MemoryWAM is a world action model with a hybrid memory design using recent frames, anchor frames, and gist tokens for efficient long-horizon robotic manipulation.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer