hub Canonical reference

arXiv preprint arXiv:2507.02863 (2025)

Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu · 2025 · arXiv 2507.02863

Canonical reference. 100% of citing Pith papers cite this work as background.

21 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 21 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7

citation-polarity summary

background 7

representative citing papers

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

cs.CV · 2026-05-17 · unverdicted · novelty 7.0

Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.

STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction

cs.CV · 2026-03-18 · unverdicted · novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.

FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT

cs.CV · 2026-03-08 · unverdicted · novelty 7.0

FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.

ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training

cs.CV · 2026-03-04 · unverdicted · novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.

3AM: 3egment Anything with Geometric Consistency in Videos

cs.CV · 2026-01-13 · unverdicted · novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

cs.CV · 2025-12-17 · unverdicted · novelty 7.0

MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.

Cambrian-P: Pose-Grounded Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.

UniT: Unified Geometry Learning with Group Autoregressive Transformer

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.

Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.

Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval

cs.CV · 2026-05-10 · unverdicted · novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.

Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.

Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.

Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

cs.CV · 2025-10-20 · unverdicted · novelty 6.0

PAGE-4D is a feedforward extension of VGGT that uses a dynamics-aware aggregator and mask to disentangle pose estimation from geometry reconstruction in videos with moving objects.

HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

TTT3R: 3D Reconstruction as Test-Time Training

cs.CV · 2025-09-30 · unverdicted · novelty 5.0

TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20

citing papers explorer

Showing 21 of 21 citing papers.

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory cs.CV · 2026-05-17 · unverdicted · none · ref 32
Mamba-VGGT introduces a Sliding Window Mamba memory module and Zero-Init Spatial Memory Injector to enable persistent long-range geometric reasoning in VGGT for extended video sequences.
Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training cs.CV · 2026-04-08 · unverdicted · none · ref 63
Mem3R achieves better long-sequence 3D reconstruction by decoupling tracking and mapping with a hybrid memory of TTT-updated MLP and explicit tokens, reducing model size and trajectory errors.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction cs.CV · 2026-03-18 · unverdicted · none · ref 46
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT cs.CV · 2026-03-08 · unverdicted · none · ref 30
FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training cs.CV · 2026-03-04 · unverdicted · none · ref 78
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
3AM: 3egment Anything with Geometric Consistency in Videos cs.CV · 2026-01-13 · unverdicted · none · ref 89
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors cs.CV · 2025-12-17 · unverdicted · none · ref 53
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 103
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
UniT: Unified Geometry Learning with Group Autoregressive Transformer cs.CV · 2026-05-20 · unverdicted · none · ref 33
UniT unifies online and offline 3D geometry perception via a Group Autoregressive Transformer that processes observation groups with anchor-free point map prediction and a scale-adaptive loss.
Rethinking the State Update Gate for Long-Sequence Recurrent 3D Reconstruction cs.CV · 2026-05-16 · unverdicted · none · ref 24
A closed-form scalar frame-level gate α_t derived from internal feature changes extends effective memory in recurrent 3D reconstruction and improves accuracy on long sequences up to 4541 frames.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval cs.CV · 2026-05-10 · unverdicted · none · ref 24
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 26 · 2 links
Spark3R achieves up to 28x speedup on 1000-frame 3D reconstruction inputs by asymmetrically reducing query and key-value tokens in Vision Transformers while keeping competitive quality.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CV · 2026-05-07 · unverdicted · none · ref 42 · 3 links
The paper proposes ray-aware pointer memory with adaptive retain-or-replace updates to improve long-term stability and pose accuracy in streaming 3D reconstruction.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective cs.CV · 2026-04-15 · unverdicted · none · ref 110
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction cs.CV · 2026-04-09 · unverdicted · none · ref 85
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale cs.CV · 2026-04-01 · unverdicted · none · ref 75
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation cs.CV · 2025-10-20 · unverdicted · none · ref 16
PAGE-4D is a feedforward extension of VGGT that uses a dynamics-aware aggregator and mask to disentangle pose estimation from geometry reconstruction in videos with moving objects.
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 49
HorizonStream is a long-horizon Transformer that factorizes geometric evidence influence into channel-wise linear attention for long-range temporal propagation and local spatiotemporal attention for short-range matching, claiming stable generalization from 48-frame training to over 10,000-frame test
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression cs.CV · 2026-04-16 · unverdicted · none · ref 38
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
TTT3R: 3D Reconstruction as Test-Time Training cs.CV · 2025-09-30 · unverdicted · none · ref 95
TTT3R derives a closed-form learning rate from memory-observation alignment confidence to boost length generalization in RNN-based 3D reconstruction by 2x in global pose estimation.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unreviewed · ref 77

arXiv preprint arXiv:2507.02863 (2025)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer