Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.
hub Canonical reference
TTT3R: 3D Reconstruction as Test-Time Training
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code is available in https://rover-xingyu.github.io/TTT3R
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
RobotPan predicts metric-scaled compact 3D Gaussians from calibrated multi-view inputs via spherical coordinates and hierarchical voxel priors for real-time 360° robotic perception and reconstruction.
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
ViT³ is a Test-Time Training vision model that achieves linear complexity, matches or exceeds other linear models like Mamba on classification, generation, detection and segmentation, and narrows the gap to standard vision Transformers.
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
IVGT implicitly models continuous neural scene representations from pose-free multi-view images to enable coherent surface extraction, novel view synthesis, and related 3D tasks via SDF and color prediction.
citing papers explorer
-
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.
-
Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
-
Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model
A feature-free monocular VINS initialization method that uses feed-forward 3D model point cloud predictions achieves over 90% success rate with under 1.2 seconds of data and performs robustly in degraded environments.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
-
RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception
RobotPan predicts metric-scaled compact 3D Gaussians from calibrated multi-view inputs via spherical coordinates and hierarchical voxel priors for real-time 360° robotic perception and reconstruction.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM3D benchmarks.
-
Learning 3D Reconstruction with Priors in Test Time
Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
-
FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT
FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
-
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
-
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
-
UniT: Unified Geometry Learning with Group Autoregressive Transformer
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
-
LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos
LongDPM introduces an overlap-aware chunk-based framework that registers and fuses local dynamic reconstructions to achieve coherent long-range 4D geometry and tracking from monocular video.
-
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
-
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
-
Linearizing Vision Transformer with Test-Time Training
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while speeding inference 1.32-1.47x.
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
-
ViT$^3$: Unlocking Test-Time Training in Vision
ViT³ is a Test-Time Training vision model that achieves linear complexity, matches or exceeds other linear models like Mamba on classification, generation, detection and segmentation, and narrows the gap to standard vision Transformers.
-
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
-
IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation
IVGT implicitly models continuous neural scene representations from pose-free multi-view images to enable coherent surface extraction, novel view synthesis, and related 3D tasks via SDF and color prediction.
-
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.