pith. sign in

arxiv: 2605.21472 · v1 · pith:Q5IF6XB5new · submitted 2026-05-20 · 💻 cs.CV

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

Pith reviewed 2026-05-21 04:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming 3D generationevidential memoryview-conditioned generatortemporal consistencytraining-freemonocular videomemory management3D reconstruction
0
0 comments X

The pith

Stream3D turns any frozen view-conditioned 3D generator into a streaming system by keeping a fixed-size evidential memory of past frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

View-conditioned 3D generators produce strong single-frame results but yield inconsistent 3D outputs when applied independently to each frame in a long monocular video stream. Stream3D solves this by maintaining a compact evidential memory that scores and retains only the most informative historical frames while discarding the rest. The memory size stays constant as the stream lengthens, so neither storage nor computation grows with sequence duration. Because the underlying generator remains untouched and no retraining or extra losses are added, the method works with existing models such as SAM 3D or TRELLIS. Experiments on realistic and synthetic streaming benchmarks show gains in both photometric and geometric consistency over simple KV-cache or flow-editing baselines.

Core claim

Stream3D is the first training-free streaming mechanism that converts a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory by maintaining a compact evidential memory that selectively caches the most informative historical frames based on a proposed evidence score mechanism; as the stream progresses the memory dynamically updates to retain a fixed number of frames, preventing linear memory growth and degradation over long sequences without any retraining, architectural modifications, or auxiliary losses.

What carries the argument

A compact evidential memory that scores incoming frames and retains only a fixed number of the highest-scoring ones to supply context to the generator.

If this is right

  • Arbitrarily long monocular streams can be processed without memory footprint growing linearly with length.
  • Temporal consistency is preserved across the entire generated 3D sequence.
  • Any pre-trained view-conditioned 3D generator can be used as-is without retraining or code changes.
  • Photometric and geometric metrics improve over KV-cache reuse and flow-based feature editing on both realistic and synthetic benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective-memory idea could be tested on other sequential generation tasks that currently suffer from context explosion.
  • An online variant might adapt the evidence threshold according to observed scene change rate.
  • Integration with real-time capture pipelines could enable continuous 3D reconstruction for robotics or AR without full history storage.

Load-bearing premise

The evidence score mechanism can reliably pick a fixed set of frames that is sufficient to stop inconsistency from accumulating across arbitrarily long sequences.

What would settle it

Running Stream3D on a very long monocular video and measuring whether geometric or photometric consistency metrics begin to degrade after several hundred frames despite the memory update rule.

Figures

Figures reproduced from arXiv: 2605.21472 by Fangneng Zhan, Kaichen Zhou, Mengyu Wang, Paul Liang, Xinhai Chang, Zeyang Bai.

Figure 1
Figure 1. Figure 1: Stream3D takes streaming input views as additional conditioning signals to improve the performance of pretrained single-view-conditioned 3D generation models. Compared with SAM￾3D, this demo shows that incorporating views from the input stream can substantially improve 3D generation quality. Abstract View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstruc… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of Stream3D. Given a streaming video, Streaming3D processes frames chunk by chunk. A lightweight warmup pass extracts token-wise evidence score from cross-attention, which is stored to vote for informative frames to update the evidential memory, i.e., Adaptive Evidential memory. Then, the top-K informative frames are passed to the frozen 3D generator for multi-view generation based on evidence. B… view at source ↗
Figure 3
Figure 3. Figure 3: Adaptive Evidential Memory. Given streaming input chunks, our memory is updated automatically by retaining the most informative historical views. The color transition from blue to pink indicates increasing frame indices, from earlier to later observations. As the memory accumulates stronger evidence over time, the reconstruction quality progressively improves. view v (with P patch tokens) as below: Hv[q] =… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on GSO and NAVI. Stream3D produces more consistent and geometri￾cally faithful 3D generations than single-view and multi-view diffusion baselines [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative result of our Ablation studies. FlowEdit denotes SAM3D with FlowEdit, and KV-Cache denotes SAM3D with KV-cache reuse. MV-SAM3D denotes MV-SAM3D applied to the last input chunk, while MV-SAM3D(R) denotes MV-SAM3D with K randomly selected views [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

View-conditioned 3D generators such as SAM 3D, TRELLIS and Hunyuan3D produce high-quality object reconstructions from a single view, but real-world visual observation often arrives as long monocular streams. Naively applying these generators to each streaming frame independently leads to severe temporal inconsistency in the generated results. To address this problem, we propose Stream3D, the first training-free streaming mechanism that turns a frozen view-conditioned 3D generator into a streaming generator with constant cross-chunk memory. Stream3D achieves this by maintaining a compact evidential memory, which selectively caches the most informative historical frames based on a proposed evidence score mechanism. As the stream progresses, the memory dynamically updates to retain a fixed number of informative frames, preventing the memory footprint from growing linearly with sequence length. This also prevents degradation over long sequences and keeps the underlying generator completely unchanged without retraining, architectural modifications, or auxiliary losses. Evaluated on both realistic and synthetic streaming benchmarks, Stream3D outperforms latent-transport baselines, including KV-cache reuse and flow-based feature editing, across both photometric and geometric metrics. More details can be found at: https://anonymous-submission-20.github.io/streaming3D.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Stream3D, a training-free streaming mechanism that converts a frozen view-conditioned 3D generator (e.g., SAM 3D, TRELLIS) into a streaming generator for long monocular sequences. It maintains a fixed-size evidential memory that selectively caches the most informative historical frames via a proposed evidence score, dynamically updating to prevent linear memory growth and temporal inconsistency without retraining, architectural changes, or auxiliary losses. The method is evaluated on realistic and synthetic streaming benchmarks, where it reportedly outperforms latent-transport baselines (KV-cache reuse, flow-based feature editing) on photometric and geometric metrics.

Significance. If the evidence score reliably selects frames that sustain consistency without long-term drift, the result would be significant for deploying high-quality 3D generators in streaming settings such as video processing or robotics. The training-free property, constant cross-chunk memory, and preservation of the original generator are clear strengths that address a practical limitation in sequential 3D generation.

major comments (2)
  1. [§4] §4 (Experiments): The reported benchmarks do not include quantitative results or error accumulation analysis on sequences whose length exceeds the fixed memory capacity by an order of magnitude or more. This is load-bearing for the central claim that the evidential memory prevents degradation over arbitrarily long streams, as local evidence scoring may discard frames whose utility emerges only after many steps.
  2. [§3.2] §3.2 (Evidence Score Mechanism): No bound, stability analysis, or ablation is provided showing that the evidence score ranks frames by long-term utility for future views rather than immediate reconstruction quality or feature novelty. Without this, the assumption that a fixed number of retained frames suffices for consistency over unbounded sequences remains unverified.
minor comments (2)
  1. [Abstract] Abstract: The claim of outperformance on photometric and geometric metrics is stated without any numerical values, error bars, dataset sizes, or specific baseline scores, reducing the ability to gauge the practical improvement.
  2. [§3] The manuscript would benefit from a clearer notation table or pseudocode for the memory update rule and evidence score computation to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback and for identifying key areas where additional evidence would strengthen the central claims of the paper. We address each major comment below and commit to revisions that directly respond to the concerns while remaining faithful to the scope and contributions of the work.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The reported benchmarks do not include quantitative results or error accumulation analysis on sequences whose length exceeds the fixed memory capacity by an order of magnitude or more. This is load-bearing for the central claim that the evidential memory prevents degradation over arbitrarily long streams, as local evidence scoring may discard frames whose utility emerges only after many steps.

    Authors: We agree that longer-sequence evaluation is necessary to substantiate the claim of robustness over arbitrarily long streams. Our existing benchmarks already include sequences several times longer than the memory capacity and show stable photometric and geometric metrics, but we will add new quantitative results on sequences exceeding the memory size by an order of magnitude (e.g., 500–1000 frames with memory size 20–50). These will include error-accumulation curves plotted against frame index to demonstrate absence of drift. The revised manuscript will report these experiments in Section 4. revision: yes

  2. Referee: [§3.2] §3.2 (Evidence Score Mechanism): No bound, stability analysis, or ablation is provided showing that the evidence score ranks frames by long-term utility for future views rather than immediate reconstruction quality or feature novelty. Without this, the assumption that a fixed number of retained frames suffices for consistency over unbounded sequences remains unverified.

    Authors: The evidence score is constructed to balance immediate reconstruction quality with forward-looking information gain, which is why it outperforms pure novelty or reconstruction-error baselines in the reported ablations. We will add a targeted ablation in the revised Section 3.2 that measures long-term consistency when frames are selected by the evidence score versus immediate-quality-only or novelty-only alternatives, using held-out future views as the evaluation criterion. A formal stability bound or convergence analysis would require additional theoretical assumptions not developed in the current work; we will explicitly note this limitation and flag it as future work while emphasizing the empirical support from both synthetic and real streaming benchmarks. revision: partial

standing simulated objections not resolved
  • A formal mathematical bound or stability analysis proving that the evidence score ranks frames by long-term utility rather than immediate quality or novelty.

Circularity Check

0 steps flagged

No significant circularity; Stream3D mechanism is an independent algorithmic proposal

full rationale

The paper presents Stream3D as a training-free streaming mechanism that maintains a fixed-size evidential memory updated via a proposed evidence score to cache informative historical frames from a frozen view-conditioned 3D generator. This is described as a novel algorithmic contribution evaluated empirically on external realistic and synthetic benchmarks, with outperformance reported against baselines such as KV-cache reuse. No equations, derivations, or first-principles results are indicated that reduce the claimed consistency or performance to fitted parameters, self-definitions, or self-citation chains by construction. The central claim relies on the independent design of the evidence score and memory update rule rather than any tautological equivalence to inputs, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven effectiveness of the evidence score for frame selection and the assumption that a fixed-size memory suffices for long sequences; no free parameters or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption An evidence score can be computed that reliably ranks historical frames by informativeness for 3D consistency without training.
    This assumption underpins the memory update rule and is invoked to justify keeping memory size constant.
invented entities (1)
  • evidential memory no independent evidence
    purpose: To store a fixed number of the most informative past frames for cross-chunk consistency.
    New data structure introduced to enable streaming without retraining the base generator.

pith-pipeline@v0.9.0 · 5763 in / 1265 out tokens · 28100 ms · 2026-05-21T04:46:43.812967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 18 internal anchors

  1. [1]

    Openlrm: Open-source large reconstruction models

    3DTopia. Openlrm: Open-source large reconstruction models. https://github.com/3DTopia/ OpenLRM, 2023

  2. [2]

    Anciukeviˇcius, Z

    T. Anciukeviˇcius, Z. Xu, M. Fisher, P. Henderson, H. Bilen, N. J. Mitra, and P. Guerrero. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12608–12618, 2023

  3. [3]

    Bahmani, I

    S. Bahmani, I. Skorokhodov, V . Rong, G. Wetzstein, L. Guibas, P. Wonka, S. Tulyakov, J. J. Park, A. Tagliasacchi, and D. B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7996–8006, 2024

  4. [4]

    Bar-Tal, L

    O. Bar-Tal, L. Yariv, Y . Lipman, and T. Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. 2023

  5. [5]

    Campos, R

    C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and J. D. Tardos. Orb-slam3: An accurate open-source library for visual, visual-inertial, and multimap slam.IEEE Transactions on Robotics, 37(6):1874–1890, 2021

  6. [6]

    E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. Guibas, J. Tremblay, S. Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022

  7. [7]

    R. Chen, Y . Chen, N. Jiao, and K. Jia. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22246–22256, 2023

  8. [8]

    X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645, 2025

  9. [9]

    SAM 3D: 3Dfy Anything in Images

    X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, et al. Sam 3d: 3dfy anything in images.arXiv preprint arXiv:2511.16624, 2025

  10. [10]

    Z. Chen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, F. Li, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats.arXiv preprint arXiv:2410.12781, 2024

  11. [11]

    Z. Chen, Y . Wang, F. Wang, Z. Wang, and H. Liu. V3d: Video diffusion models are effective 3d generators. arXiv preprint arXiv:2403.06738, 2024

  12. [12]

    A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse. Monoslam: Real-time single camera slam.IEEE transactions on pattern analysis and machine intelligence, 29(6):1052–1067, 2007

  13. [13]

    Downs, A

    L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022

  14. [14]

    Engel, V

    J. Engel, V . Koltun, and D. Cremers. Direct sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625, 2018

  15. [15]

    Engel, T

    J. Engel, T. Schöps, and D. Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conference on computer vision, pages 834–849. Springer, 2014

  16. [16]

    Forster, M

    C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014

  17. [17]

    J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. InAdvances in Neural Information Processing Systems, volume 35, pages 31841–31854, 2022

  18. [18]

    Henschel, L

    R. Henschel, L. Khachatryan, D. Hayrapetyan, H. Poghosyan, V . Tadevosyan, Z. Wang, S. Navasardyan, and H. Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text.arXiv preprint arXiv:2403.14773, 2024

  19. [19]

    Y . Hong, K. Zhang, J. Gu, S. Bi, Y . Zhou, D. Liu, F. Liu, K. Sunkavalli, T. Bui, and H. Tan. Lrm: Large reconstruction model for single image to 3d.arXiv preprint arXiv:2311.04400, 2023. 11

  20. [20]

    Huang, L

    H. Huang, L. Li, H. Cheng, and S.-K. Yeung. Photo-slam: Real-time simultaneous localization and photorealistic mapping for monocular, stereo, and rgb-d cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21584–21593, 2024

  21. [21]

    Huang, M

    Z. Huang, M. Boss, A. Vasishta, J. M. Rehg, and V . Jampani. Spar3d: Stable point-aware reconstruction of 3d objects from single images. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16860–16870, 2025

  22. [22]

    Jampani, K.-K

    V . Jampani, K.-K. Maninis, A. Engelhardt, A. Karpur, K. Truong, K. Sargent, S. Popov, A. Araujo, R. Martin Brualla, K. Patel, et al. Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations.Advances in Neural Information Processing Systems, 36:76061–76084, 2023

  23. [23]

    Jiang, L

    Y . Jiang, L. Zhang, J. Gao, W. Hu, and Y . Yao. Consistent4d: Consistent 360 dynamic object generation from monocular video. InInternational Conference on Learning Representations, 2024

  24. [24]

    Shap-E: Generating Conditional 3D Implicit Functions

    H. Jun and A. Nichol. Shap-e: Generating conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023

  25. [25]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimkühler, G. Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

  26. [26]

    J. Kim, J. Kang, J. Choi, and B. Han. Fifo-diffusion: Generating infinite videos from text without training. InAdvances in Neural Information Processing Systems, 2024

  27. [27]

    Klein and D

    G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In2007 6th IEEE and ACM international symposium on mixed and augmented reality, pages 225–234. IEEE, 2007

  28. [28]

    X. Kong, S. Liu, X. Lyu, M. Taher, X. Qi, and A. J. Davison. Eschernet: A generative model for scalable view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9503–9513, 2024

  29. [29]

    Kulikov, M

    V . Kulikov, M. Kleiner, I. Huberman-Spiegelglas, and T. Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19721–19730, 2025

  30. [30]

    Y . Lan, Y . Luo, F. Hong, S. Zhou, H. Chen, Z. Lyu, S. Yang, B. Dai, C. C. Loy, and X. Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025

  31. [31]

    Leroy, Y

    V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

  32. [32]

    B. Li, D. Wu, J. Li, S. Zhou, Z. Zeng, L. Li, and H. Zha. Mv-sam3d: Adaptive multi-view fusion for layout-aware 3d generation.arXiv preprint arXiv:2603.11633, 2026

  33. [33]

    J. Li, H. Tan, K. Zhang, Z. Xu, F. Luan, Y . Xu, Y . Hong, K. Sunkavalli, G. Shakhnarovich, and S. Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model.arXiv preprint arXiv:2311.06214, 2023

  34. [34]

    P. Li, Y . Liu, X. Long, F. Zhang, C. Lin, M. Li, X. Qi, S. Zhang, W. Luo, P. Tan, W. Wang, Q. Liu, and Y . Guo. Era3d: High-resolution multiview diffusion using efficient row-wise attention.arXiv preprint arXiv:2405.11616, 2024

  35. [35]

    W. Li, J. Liu, H. Yan, R. Chen, Y . Liang, X. Chen, P. Tan, and X. Long. Craftsman3d: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024

  36. [36]

    Li, Z.-X

    Y . Li, Z.-X. Zou, Z. Liu, D. Wang, Y . Liang, Z. Yu, X. Liu, Y .-C. Guo, D. Liang, W. Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  37. [37]

    C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y . Liu, and T.-Y . Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023

  38. [38]

    H. Lin, S. Chen, J. Liew, D. Y . Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  39. [39]

    M. Liu, C. Xu, H. Jin, L. Chen, M. Varma T, Z. Xu, and H. Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 12

  40. [40]

    R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023

  41. [41]

    Y . Liu, S. Dong, S. Wang, Y . Yin, Y . Yang, Q. Fan, and B. Chen. Slam3r: Real-time dense scene reconstruction from monocular rgb videos. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16651–16662, 2025

  42. [42]

    Y . Liu, C. Lin, Z. Zeng, X. Long, L. Liu, T. Komura, and W. Wang. Syncdreamer: Generating multiview- consistent images from a single-view image.arXiv preprint arXiv:2309.03453, 2023

  43. [43]

    Long, Y .-C

    X. Long, Y .-C. Guo, C. Lin, Y . Liu, Z. Dou, L. Liu, Y . Ma, S.-H. Zhang, M. Habermann, C. Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024

  44. [44]

    VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold

    D. Maggio, H. Lim, and L. Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

  45. [45]

    Matsuki, R

    H. Matsuki, R. Murai, P. H. J. Kelly, and A. J. Davison. Gaussian splatting slam. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18039–18048, 2024

  46. [46]

    Metzer, E

    G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023

  47. [47]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEuropean Conference on Computer Vision, pages 405–421, 2020

  48. [48]

    Mur-Artal, J

    R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: A versatile and accurate monocular slam system.IEEE transactions on robotics, 31(5):1147–1163, 2015

  49. [49]

    Mur-Artal and J

    R. Mur-Artal and J. D. Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras.IEEE transactions on robotics, 33(5):1255–1262, 2017

  50. [50]

    R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in real-time. In 2011 international conference on computer vision, pages 2320–2327. IEEE, 2011

  51. [51]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022

  52. [52]

    DreamFusion: Text-to-3D using 2D Diffusion

    B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion.arXiv preprint arXiv:2209.14988, 2022

  53. [53]

    G. Qian, J. Mai, A. Hamdi, J. Ren, A. Siarohin, B. Li, H.-Y . Lee, I. Skorokhodov, P. Wonka, S. Tulyakov, and B. Ghanem. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors.arXiv preprint arXiv:2306.17843, 2023

  54. [54]

    L. Qiu, G. Chen, X. Gu, Q. Zuo, M. Xu, Y . Wu, W. Yuan, Z. Dong, L. Bo, and X. Han. Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9914–9925, 2024

  55. [55]

    J. Ren, K. Xie, A. Mirzaei, H. Liang, X. Zeng, K. Kreis, Z. Liu, A. Torralba, S. Fidler, S. W. Kim, et al. L4gm: Large 4d gaussian reconstruction model.Advances in Neural Information Processing Systems, 37:56828–56858, 2024

  56. [56]

    R. Shi, H. Chen, Z. Zhang, M. Liu, C. Xu, X. Wei, L. Chen, C. Zeng, and H. Su. Zero123++: A single image to consistent multi-view diffusion base model.arXiv preprint arXiv:2310.15110, 2023

  57. [57]

    Y . Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023

  58. [58]

    Singer, S

    U. Singer, S. Sheynin, A. Polyak, O. Ashual, I. Makarov, F. Kokkinos, N. Goyal, A. Vedaldi, D. Parikh, J. Johnson, and Y . Taigman. Text-to-4d dynamic scene generation. InInternational Conference on Machine Learning, pages 31915–31929, 2023

  59. [59]

    J. Sun, Y . Xie, L. Chen, X. Zhou, and H. Bao. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15598–15607, 2021. 13

  60. [60]

    J. Sun, B. Zhang, R. Shao, L. Wang, W. Liu, Z. Xie, and Y . Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior.arXiv preprint arXiv:2310.16818, 2023

  61. [61]

    J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

  62. [62]

    J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation.arXiv preprint arXiv:2309.16653, 2023

  63. [63]

    J. Tang, T. Wang, B. Zhang, T. Zhang, R. Yi, L. Ma, and D. Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22819–22829, 2023

  64. [64]

    TripoSR: Fast 3D Object Reconstruction from a Single Image

    D. Tochilkin, D. Pankratz, Z. Liu, Z. Huang, A. Letts, Y . Li, D. Liang, C. Laforte, V . Jampani, and Y .-P. Cao. Triposr: Fast 3d object reconstruction from a single image.arXiv preprint arXiv:2403.02151, 2024

  65. [65]

    V oleti, C.-H

    V . V oleti, C.-H. Yao, M. Boss, A. Letts, D. Pankratz, D. Tochilkin, C. Laforte, R. Rombach, and V . Jampani. Sv3d: Novel multi-view synthesis and 3d generation from a single image using latent video diffusion. arXiv preprint arXiv:2403.12008, 2024

  66. [66]

    Wang and L

    H. Wang and L. Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025

  67. [67]

    H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12619–12629, 2023

  68. [68]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer.arXiv preprint arXiv:2503.11651, 2025

  69. [69]

    P. Wang, H. Tan, S. Bi, Y . Xu, F. Luan, K. Sunkavalli, W. Wang, Z. Xu, and K. Zhang. Pf-lrm: Pose-free large reconstruction model for joint pose and shape prediction.arXiv preprint arXiv:2311.12024, 2023

  70. [70]

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  71. [71]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  72. [72]

    Z. Wang, C. Lu, Y . Wang, F. Bao, C. Li, H. Su, and J. Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation.Advances in neural information processing systems, 36:8406–8441, 2023

  73. [73]

    Z. Wang, Y . Wang, Y . Chen, C. Xiang, S. Chen, D. Yu, C. Li, H. Su, and J. Zhu. Crm: Single image to 3d textured mesh with convolutional reconstruction model.arXiv preprint arXiv:2403.05034, 2024

  74. [74]

    X. Wei, K. Zhang, S. Bi, H. Tan, F. Luan, V . Deschaintre, K. Sunkavalli, H. Su, and Z. Xu. Meshlrm: Large reconstruction model for high-quality meshes.arXiv preprint arXiv:2404.12385, 2024

  75. [75]

    K. Wu, F. Liu, Z. Cai, R. Yan, H. Wang, Y . Hu, Y . Duan, and K. Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image.Advances in Neural Information Processing Systems, 37:125116–125141, 2024

  76. [76]

    S. Wu, Y . Lin, F. Zhang, Y . Zeng, J. Xu, P. Torr, X. Cao, and Y . Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer.Advances in Neural Information Processing Systems, 37:121859–121881, 2024

  77. [77]

    Y . Wu, W. Zheng, J. Zhou, and J. Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

  78. [78]

    Native and Compact Structured Latents for 3D Generation

    J. Xiang, X. Chen, S. Xu, R. Wang, Z. Lv, Y . Deng, H. Zhu, Y . Dong, H. Zhao, N. J. Yuan, et al. Native and compact structured latents for 3d generation.arXiv preprint arXiv:2512.14692, 2025

  79. [79]

    Xiang, Z

    J. Xiang, Z. Lv, S. Xu, Y . Deng, R. Wang, B. Zhang, D. Chen, X. Tong, and J. Yang. Structured 3d latents for scalable and versatile 3d generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21469–21480, 2025. 14

  80. [80]

    J. Xu, W. Cheng, Y . Gao, X. Wang, S. Gao, and Y . Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models.arXiv preprint arXiv:2404.07191, 2024

Showing first 80 references.