pith. sign in

arxiv: 2604.13905 · v1 · submitted 2026-04-15 · 💻 cs.CV

Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords image-to-3D generationsparse queries3D Gaussian primitivesinput-view biasefficient 3D modelingrectified flowanchor queriesexpansion operator
0
0 comments X

The pith

Sparse learned 3D anchor queries expanded into local Gaussians replace dense grids for faster image-to-3D generation with lower input-view bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SparseGen to model 3D scenes from single images using a compact sparse set of learned 3D anchor queries instead of dense volumetric grids or triplanes. Each query is transformed and decoded by a learned expansion operator into a small local collection of 3D Gaussian primitives. The system trains end-to-end on 2D images only under a rectified-flow objective, learning to place capacity where geometry and appearance are needed. Experiments show this yields large drops in memory and runtime while maintaining multi-view consistency and reducing overfitting to the input images.

Core claim

SparseGen models scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, the model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity.

What carries the argument

Sparse set-latent expansion, in which a small set of learned 3D anchor queries is decoded by a learned operator into local clusters of 3D Gaussian primitives.

If this is right

  • Memory footprint and inference time drop substantially relative to dense volumetric or triplane methods.
  • Overfitting to the single conditioning view is measurably reduced.
  • Representation capacity is concentrated automatically on regions that matter for geometry and appearance.
  • Multi-view fidelity remains comparable to dense baselines despite the sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchor-plus-expansion pattern could be tested on text-conditioned or video-conditioned 3D generation without changing the core machinery.
  • The introduced metrics for input-view bias and utilization could serve as standard evaluation tools for future 3D generators.
  • If the expansion operator generalizes, it may allow scaling to higher-resolution outputs by simply increasing the number of anchor queries rather than grid density.

Load-bearing premise

A compact sparse set of learned 3D anchor queries plus a learned expansion operator can capture sufficient geometry and appearance for complex real-world scenes without dense representations or explicit 3D supervision.

What would settle it

Quantitative comparison on scenes with fine surface detail or heavy occlusion showing that the sparse model produces lower multi-view PSNR or visible artifacts compared with a dense triplane or grid baseline trained to the same compute budget.

Figures

Figures reproduced from arXiv: 2604.13905 by Chenfeng Xu, Chensheng Peng, Jiuming Liu, Masayoshi Tomizuka, Yuxin Chen, Zhiyuan Xu.

Figure 1
Figure 1. Figure 1: Comparison with Prior 3D Generation Paradigms. Left: iterative diffusion generation (e.g., Viewset Diffusion) synthesizes multi-view images but requires multiple denoising steps, suffering from poor efficiency. Middle: deterministic feed-forward reconstruction methods (e.g., Splatter Image, LRM, pixelNeRF), demonstrate good synthesis quality on views close to the input ones yet degrade on held￾out novel vi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SPARSEGEN. Given V input views (clean and/or noisy) with known camera poses, an image encoder (with adaLN timesteps) and a 3D position encoder generate position-aware image features. A sparse set of learnable 3D anchor queries attends to these fused features in a transformer-based expansion network and is decoded into a compact set of 3D Gaussians. Finally, the generated Gaussians are rendered … view at source ↗
Figure 3
Figure 3. Figure 3: 3D Position-Aware Encoder. Pixels are unprojected along the camera frustum to fixed depths and mapped by a small Conv–ReLU–Conv into per-pixel 3D features. These 3D features are merged with DINO-extracted image features and flattened into 3D position-aware tokens for the following transformer. depth samples. These 3D points are then encoded using a 1×1 convolutional neural network to align with the feature… view at source ↗
Figure 4
Figure 4. Figure 4: Back-view conditioning qualitative example. The in￾put view observes the object from the back, providing limited in￾formation about the front. SPARSEGEN produces plausible novel views, while deterministic feed-forward baselines fail. PSNR, SSIM, LPIPS, and FID, revealing that these anchors furnish critical spatial priors that guide coherent Gaussian generation. In summary, sparse learnable anchors provide … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results on ShapeNet-SRN Cars. We show examples under one-view conditioning. SPARSEGEN yields sharper details, cleaner boundaries, with fastest generation speed. ter Image [33]) exhibit significantly larger gaps, while SPARSEGEN achieves the smallest gaps across all metrics, which indicates more view-unbiased generation: it main￾tains comparable fidelity on viewpoints far from the condi￾tioning … view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative visualization of query utilization and lo￾cality. (a) Projection of per-query mean decoded Gaussian centers onto the image plane. (b) Corresponding RGB image. (c) A sub￾set of decoded Gaussian centers, colored by the anchor query that generated them, illustrating query-induced locality. late into effective capacity rather than redundant tokens. Utilization of representation. Beyond scaling, we … view at source ↗
Figure 7
Figure 7. Figure 7: Opacity/density utilization. Histogram comparison across methods. Splatter Image produces many near-transparent Gaussians, whereas SPARSEGEN yields predominantly non-trivial opacities. Viewset Diffusion has many zero-density voxels. For Viewset Diffusion, densities are passed through a sigmoid and rescaled to [0,1] for visualization. view, whereas SPARSEGEN can synthesize plausible novel views ( [PITH_FUL… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results on CO3D subsets. We show examples under one-view conditioning settings. Compared to other methods, SPARSEGEN yields better visual quality, with significantly faster speed and smaller representation size. shown, our SPARSEGEN effectively generates high-quality novel views with fine details while maintaining ultra-fast inference speed compared to prior methods. A.7. Qualitative Results on… view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative results on the ShapeNet-SRN dataset. Each group features one input image (left) and two novel view renderings (right) from all tested methods. Generally, deterministic feed-forward methods (i.e., Splatter Image [33]) tend to produce unsatisfactory details on regions not well observed in the input view, while generative methods with iterative diffusion (i.e., Viewset Diffusion [32]) … view at source ↗
read the original abstract

We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SparseGen, a framework for image-to-3D generation that represents scenes via a compact sparse set of learned 3D anchor queries. A learned expansion operator decodes each query into a small local set of 3D Gaussian primitives. The model is trained end-to-end under a rectified-flow reconstruction objective using only 2D supervision, with the goal of reducing input-view bias, lowering memory and inference costs relative to dense grids or triplanes, and adaptively allocating capacity. New quantitative metrics for input-view bias and representation utilization are proposed to support these claims.

Significance. If the empirical results and new metrics hold up under scrutiny, the work offers a practical alternative to dense volumetric or triplane representations for 3D generation. The emphasis on sparse learned anchors, 2D-only training, and explicit bias/utilization measures addresses real efficiency and generalization issues in the field. Credit is due for avoiding explicit 3D supervision and for attempting to quantify input-view bias, which could influence subsequent work on capacity-efficient 3D models.

major comments (2)
  1. [Abstract] Abstract: the central efficiency and bias-reduction claims are stated quantitatively ('significant reductions in memory and inference time', 'low input-view bias', 'preserving multi-view fidelity') yet no numerical values, baseline comparisons, ablation tables, or error bars are supplied. Without these data the load-bearing assertions cannot be evaluated.
  2. [Method] Method section (anchor-query and expansion operator): the assumption that a small fixed number of learned 3D anchors plus a learned local expansion operator suffices for complex real-world geometry and appearance is load-bearing for the 'principled alternative' claim. The manuscript should provide ablations on anchor count, scene complexity, and failure cases to test capacity limits.
minor comments (2)
  1. [Evaluation] The new bias and utilization metrics should be given explicit mathematical definitions (equations) and pseudocode for reproducibility.
  2. [Figures] Figure captions and axis labels for any qualitative multi-view results should explicitly state the number of anchor queries and the conditioning views used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, indicating where revisions will be made to strengthen the presentation of our claims and experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central efficiency and bias-reduction claims are stated quantitatively ('significant reductions in memory and inference time', 'low input-view bias', 'preserving multi-view fidelity') yet no numerical values, baseline comparisons, ablation tables, or error bars are supplied. Without these data the load-bearing assertions cannot be evaluated.

    Authors: We agree that the abstract would benefit from concrete numbers to support its claims. In the revised manuscript we will update the abstract to include specific quantitative results drawn from our experiments and tables, such as the observed reductions in memory footprint and inference time relative to triplane and volumetric baselines, along with the measured improvement in the input-view bias metric. These additions will be kept concise while directing readers to the supporting tables and figures. revision: yes

  2. Referee: [Method] Method section (anchor-query and expansion operator): the assumption that a small fixed number of learned 3D anchors plus a learned local expansion operator suffices for complex real-world geometry and appearance is load-bearing for the 'principled alternative' claim. The manuscript should provide ablations on anchor count, scene complexity, and failure cases to test capacity limits.

    Authors: The current manuscript already reports experiments on varying anchor counts in Section 4.3 and the supplement, demonstrating that performance saturates beyond a modest number of anchors for the evaluated scenes. We acknowledge, however, that more explicit discussion of capacity limits for complex geometry is needed. We will add a new subsection with additional ablations on scene complexity (including higher-detail subsets) and a dedicated analysis of failure cases, such as thin structures or fine textures, to better substantiate the capacity claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents SparseGen as using a compact sparse set of learned 3D anchor queries decoded by a learned expansion operator into local 3D Gaussians, trained end-to-end under a rectified-flow reconstruction objective with no 3D supervision. The central claim that this yields efficient capacity allocation and reduced input-view bias is supported by new quantitative metrics for bias and utilization, plus reported gains in memory, speed, and multi-view fidelity. No equations, self-citations, or fitted parameters are shown that reduce any prediction or uniqueness claim to a tautology by construction; the training objective and evaluation criteria remain externally grounded and independent of the target result. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on a small number of architectural choices and one domain assumption about supervision; no new physical entities are postulated.

free parameters (1)
  • number of anchor queries
    Sparsity level is an architectural hyperparameter that controls capacity and must be chosen to balance efficiency against reconstruction quality.
axioms (1)
  • domain assumption Rectified-flow reconstruction from multiple rendered views is sufficient to learn accurate 3D structure without explicit 3D labels
    Invoked in the training description; if false, the model may overfit 2D appearance without consistent 3D geometry.
invented entities (1)
  • learned 3D anchor queries no independent evidence
    purpose: Compact scene representation that allocates capacity where needed
    Core new representational primitive introduced by the method

pith-pipeline@v0.9.0 · 5472 in / 1306 out tokens · 37524 ms · 2026-05-10T13:59:04.534251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Brebin, Loren Carpenter, and Pat Hanrahan

    Robert A. Brebin, Loren Carpenter, and Pat Hanrahan. V ol- ume rendering. InSeminal Graphics: Pioneering Efforts That Shaped the Field, Volume 1, pages 363–372. Associ- ation for Computing Machinery, New York, NY , USA, 1998. 3

  2. [2]

    End- to-End Object Detection with Transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-End Object Detection with Transformers. InComputer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing. 3

  3. [3]

    Computer display of curved surfaces

    Edwin Catmull. Computer display of curved surfaces. InSeminal Graphics: Pioneering Efforts That Shaped the Field, Volume 1, pages 35–41. Association for Computing Machinery. 3

  4. [4]

    Objaverse: A Universe of Annotated 3D Objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A Universe of Annotated 3D Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 13

  5. [5]

    McHugh, and Vincent Vanhoucke

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google Scanned Objects: A High- Quality Dataset of 3D Scanned Household Items. In2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560, Philadelphia, PA, USA, 2022. IEEE. 13

  6. [6]

    Haskell and Arun N

    Barry G. Haskell and Arun N. Netravali.Digital Pictures: Representation, Compression, and Standards. Perseus Pub- lishing, 2nd edition, 1997. 5

  7. [7]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2017. 5

  8. [8]

    ViewDiff: 3D-Consistent Image Genera- tion with Text-to-Image Models, 2024

    Lukas H ¨ollein, Aljaˇz Boˇziˇc, Norman M¨uller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollh ¨ofer, and Matthias Nießner. ViewDiff: 3D-Consistent Image Genera- tion with Text-to-Image Models, 2024. 2

  9. [9]

    LRM: Large Reconstruction Model for Single Image to 3D

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. InThe Twelfth International Conference on Learning Representations, 2023. 2, 3, 5, 6, 7, 8

  10. [10]

    Surface reconstruction from un- organized points

    Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDon- ald, and Werner Stuetzle. Surface reconstruction from un- organized points. InProceedings of the 19th Annual Con- ference on Computer Graphics and Interactive Techniques, pages 71–78. Association for Computing Machinery. 3

  11. [11]

    Planning-oriented Autonomous Driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented Autonomous Driving. In2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 17853–17862,

  12. [12]

    LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias. InThe Thirteenth International Confer- ence on Learning Representations, 2024. 3, 8

  13. [13]

    Perceptual Losses for Real-Time Style Transfer and Super-Resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. InComputer Vision – ECCV 2016, pages 694–711, Cham,

  14. [14]

    Springer International Publishing. 5

  15. [15]

    Gen2sim: Scaling up robot learning in simulation with gen- erative models

    Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with gen- erative models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6679. IEEE,

  16. [16]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. 42(4):139:1–139:14. 2, 3

  17. [17]

    Ground- ing Image Matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3D with MASt3R. InComputer Vi- sion – ECCV 2024, pages 71–91, Cham, 2025. Springer Na- ture Switzerland. 3

  18. [18]

    LucidDreamer: Towards High- Fidelity Text-to-3D Generation via Interval Score Matching

    Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. LucidDreamer: Towards High- Fidelity Text-to-3D Generation via Interval Score Matching. pages 6517–6526. 2 9

  19. [19]

    Magic3D: High-Resolution Text-to-3D Content Creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. pages 300–309. 2

  20. [20]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow Matching for Genera- tive Modeling. InThe Eleventh International Conference on Learning Representations, 2022. 3

  21. [21]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022. 3

  22. [22]

    PETR: Position Embedding Transformation for Multi-view 3D Object Detection

    Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position Embedding Transformation for Multi-view 3D Object Detection. InComputer Vision – ECCV 2022, pages 531–548. Springer Nature Switzerland, Cham, 2022. 3

  23. [23]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis. 65(1):99–106. 2, 3

  24. [24]

    DINOv2: Learning Robust Visual Features without Supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  25. [25]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4172–4182, 2023. 3

  26. [26]

    Barron, and Ben Milden- hall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. DreamFusion: Text-to-3D using 2D Diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2022. 2

  27. [27]

    Compositing digital images

    Thomas Porter and Tom Duff. Compositing digital images. InProceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, pages 253–259, New York, NY , USA, 1984. Association for Computing Machin- ery. 4

  28. [28]

    Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10901–10911, 2021. 5

  29. [29]

    Mehdi S.M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In2022 IEEE/CVF Conference on Co...

  30. [30]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. InThe Twelfth International Conference on Learning Representations, 2023. 2

  31. [31]

    Scene Representation Networks: Continuous 3D- Structure-Aware Neural Scene Representations

    Vincent Sitzmann, Michael Zollhoefer, and Gordon Wet- zstein. Scene Representation Networks: Continuous 3D- Structure-Aware Neural Scene Representations. InAdvances in Neural Information Processing Systems. Curran Asso- ciates, Inc., 2019. 5

  32. [32]

    Toward realistic 3d avatar generation with dy- namic 3d gaussian splatting for ar/vr communication

    Hail Song. Toward realistic 3d avatar generation with dy- namic 3d gaussian splatting for ar/vr communication. In 2024 IEEE Conference on Virtual Reality and 3D User In- terfaces Abstracts and Workshops (VRW), pages 869–870. IEEE, 2024. 2

  33. [33]

    Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data. In2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 8829–8839, 2023. 2, 3, 5, 6, 7, 8, 14

  34. [34]

    Splatter Image: Ultra-Fast Single-View 3D Recon- struction

    Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter Image: Ultra-Fast Single-View 3D Recon- struction. In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10208–10217,

  35. [35]

    2, 3, 5, 6, 7, 8, 14

  36. [36]

    DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. 2

  37. [37]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10510–10522, 2025. 3

  38. [38]

    DUSt3R: Geometric 3D Vision Made Easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. In2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20697– 20709, 2024. 2, 3

  39. [39]

    DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

    Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. InProceedings of the 5th Conference on Robot Learning, pages 180–191. PMLR, 2022. 3

  40. [40]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5

  41. [41]

    MeshLRM: Large Reconstruction Model for High- Quality Meshes, 2025

    Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zexi- ang Xu. MeshLRM: Large Reconstruction Model for High- Quality Meshes, 2025. 3

  42. [42]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21924–21935, 2025. 3

  43. [43]

    Holodeck: Language guided gen- eration of 3d embodied ai environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of 10 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 2

  44. [44]

    pixelNeRF: Neural Radiance Fields from One or Few Im- ages

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Im- ages. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2021. 2

  45. [45]

    GS-LRM: Large Re- construction Model for 3D Gaussian Splatting, 2024

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Re- construction Model for 3D Gaussian Splatting, 2024. 3

  46. [46]

    Efros, Eli Shecht- man, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, Salt Lake City, UT, 2018. IEEE. 5

  47. [47]

    Free3D: Consis- tent Novel View Synthesis Without 3D Representation

    Chuanxia Zheng and Andrea Vedaldi. Free3D: Consis- tent Novel View Synthesis Without 3D Representation. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9720–9731, Seattle, W A, USA, 2024. IEEE. 2 11 Appendix A. Implementation Details In this section, we provide additional implementation de- tails of our method. A.1. ...