Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

Chenfeng Xu; Chensheng Peng; Jiuming Liu; Masayoshi Tomizuka; Yuxin Chen; Zhiyuan Xu

arxiv: 2604.13905 · v1 · submitted 2026-04-15 · 💻 cs.CV

Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

Zhiyuan Xu , Jiuming Liu , Yuxin Chen , Masayoshi Tomizuka , Chenfeng Xu , Chensheng Peng This is my paper

Pith reviewed 2026-05-10 13:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords image-to-3D generationsparse queries3D Gaussian primitivesinput-view biasefficient 3D modelingrectified flowanchor queriesexpansion operator

0 comments

The pith

Sparse learned 3D anchor queries expanded into local Gaussians replace dense grids for faster image-to-3D generation with lower input-view bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SparseGen to model 3D scenes from single images using a compact sparse set of learned 3D anchor queries instead of dense volumetric grids or triplanes. Each query is transformed and decoded by a learned expansion operator into a small local collection of 3D Gaussian primitives. The system trains end-to-end on 2D images only under a rectified-flow objective, learning to place capacity where geometry and appearance are needed. Experiments show this yields large drops in memory and runtime while maintaining multi-view consistency and reducing overfitting to the input images.

Core claim

SparseGen models scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, the model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity.

What carries the argument

Sparse set-latent expansion, in which a small set of learned 3D anchor queries is decoded by a learned operator into local clusters of 3D Gaussian primitives.

If this is right

Memory footprint and inference time drop substantially relative to dense volumetric or triplane methods.
Overfitting to the single conditioning view is measurably reduced.
Representation capacity is concentrated automatically on regions that matter for geometry and appearance.
Multi-view fidelity remains comparable to dense baselines despite the sparsity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchor-plus-expansion pattern could be tested on text-conditioned or video-conditioned 3D generation without changing the core machinery.
The introduced metrics for input-view bias and utilization could serve as standard evaluation tools for future 3D generators.
If the expansion operator generalizes, it may allow scaling to higher-resolution outputs by simply increasing the number of anchor queries rather than grid density.

Load-bearing premise

A compact sparse set of learned 3D anchor queries plus a learned expansion operator can capture sufficient geometry and appearance for complex real-world scenes without dense representations or explicit 3D supervision.

What would settle it

Quantitative comparison on scenes with fine surface detail or heavy occlusion showing that the sparse model produces lower multi-view PSNR or visible artifacts compared with a dense triplane or grid baseline trained to the same compute budget.

Figures

Figures reproduced from arXiv: 2604.13905 by Chenfeng Xu, Chensheng Peng, Jiuming Liu, Masayoshi Tomizuka, Yuxin Chen, Zhiyuan Xu.

**Figure 1.** Figure 1: Comparison with Prior 3D Generation Paradigms. Left: iterative diffusion generation (e.g., Viewset Diffusion) synthesizes multi-view images but requires multiple denoising steps, suffering from poor efficiency. Middle: deterministic feed-forward reconstruction methods (e.g., Splatter Image, LRM, pixelNeRF), demonstrate good synthesis quality on views close to the input ones yet degrade on heldout novel vi… view at source ↗

**Figure 2.** Figure 2: Overview of SPARSEGEN. Given V input views (clean and/or noisy) with known camera poses, an image encoder (with adaLN timesteps) and a 3D position encoder generate position-aware image features. A sparse set of learnable 3D anchor queries attends to these fused features in a transformer-based expansion network and is decoded into a compact set of 3D Gaussians. Finally, the generated Gaussians are rendered … view at source ↗

**Figure 3.** Figure 3: 3D Position-Aware Encoder. Pixels are unprojected along the camera frustum to fixed depths and mapped by a small Conv–ReLU–Conv into per-pixel 3D features. These 3D features are merged with DINO-extracted image features and flattened into 3D position-aware tokens for the following transformer. depth samples. These 3D points are then encoded using a 1×1 convolutional neural network to align with the feature… view at source ↗

**Figure 4.** Figure 4: Back-view conditioning qualitative example. The input view observes the object from the back, providing limited information about the front. SPARSEGEN produces plausible novel views, while deterministic feed-forward baselines fail. PSNR, SSIM, LPIPS, and FID, revealing that these anchors furnish critical spatial priors that guide coherent Gaussian generation. In summary, sparse learnable anchors provide … view at source ↗

**Figure 5.** Figure 5: Qualitative results on ShapeNet-SRN Cars. We show examples under one-view conditioning. SPARSEGEN yields sharper details, cleaner boundaries, with fastest generation speed. ter Image [33]) exhibit significantly larger gaps, while SPARSEGEN achieves the smallest gaps across all metrics, which indicates more view-unbiased generation: it maintains comparable fidelity on viewpoints far from the conditioning … view at source ↗

**Figure 8.** Figure 8: Qualitative visualization of query utilization and locality. (a) Projection of per-query mean decoded Gaussian centers onto the image plane. (b) Corresponding RGB image. (c) A subset of decoded Gaussian centers, colored by the anchor query that generated them, illustrating query-induced locality. late into effective capacity rather than redundant tokens. Utilization of representation. Beyond scaling, we … view at source ↗

**Figure 7.** Figure 7: Opacity/density utilization. Histogram comparison across methods. Splatter Image produces many near-transparent Gaussians, whereas SPARSEGEN yields predominantly non-trivial opacities. Viewset Diffusion has many zero-density voxels. For Viewset Diffusion, densities are passed through a sigmoid and rescaled to [0,1] for visualization. view, whereas SPARSEGEN can synthesize plausible novel views ( [PITH_FUL… view at source ↗

**Figure 9.** Figure 9: Qualitative results on CO3D subsets. We show examples under one-view conditioning settings. Compared to other methods, SPARSEGEN yields better visual quality, with significantly faster speed and smaller representation size. shown, our SPARSEGEN effectively generates high-quality novel views with fine details while maintaining ultra-fast inference speed compared to prior methods. A.7. Qualitative Results on… view at source ↗

**Figure 10.** Figure 10: Additional qualitative results on the ShapeNet-SRN dataset. Each group features one input image (left) and two novel view renderings (right) from all tested methods. Generally, deterministic feed-forward methods (i.e., Splatter Image [33]) tend to produce unsatisfactory details on regions not well observed in the input view, while generative methods with iterative diffusion (i.e., Viewset Diffusion [32]) … view at source ↗

read the original abstract

We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SparseGen trades dense 3D reps for a small set of learned anchor queries that expand into Gaussians, delivering efficiency and lower input bias under 2D-only rectified flow training.

read the letter

The core move here is replacing dense grids or triplanes with a compact set of learned 3D anchor queries. Each query gets decoded by a learned expansion operator into a local handful of Gaussian primitives. Training uses rectified flow on 2D images alone, no explicit 3D supervision, and the model learns to place capacity where geometry and appearance actually matter. They also define new quantitative scores for input-view bias and representation utilization, which lets them show the sparse setup reduces overfitting to the conditioning views compared with denser baselines. That combination is the main novelty relative to prior volumetric or pixel-aligned work. The efficiency claims are the strongest part. If the reported drops in memory and inference time hold in the full experiments, this setup could matter for anyone running image-to-3D pipelines on modest hardware. The bias metrics are a practical addition; they give a concrete way to measure whether the output is genuinely using the sparse latent or just echoing the input view. The soft spot is capacity for complex scenes. A fixed small number of anchors plus learned expansion can allocate adaptively, but it is not obvious this will preserve fine detail or handle multi-object clutter without the expansion operator becoming a bottleneck. The paper would be tighter with more failure-case analysis or ablations that isolate how many queries are truly needed versus how much the flow objective is doing the heavy lifting. Overall this is aimed at people building deployable 3D generators who already work with Gaussians or flow models. It is worth a serious referee because the framing is clean, the evaluation targets a real practical problem, and the central efficiency argument is testable rather than circular.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SparseGen, a framework for image-to-3D generation that represents scenes via a compact sparse set of learned 3D anchor queries. A learned expansion operator decodes each query into a small local set of 3D Gaussian primitives. The model is trained end-to-end under a rectified-flow reconstruction objective using only 2D supervision, with the goal of reducing input-view bias, lowering memory and inference costs relative to dense grids or triplanes, and adaptively allocating capacity. New quantitative metrics for input-view bias and representation utilization are proposed to support these claims.

Significance. If the empirical results and new metrics hold up under scrutiny, the work offers a practical alternative to dense volumetric or triplane representations for 3D generation. The emphasis on sparse learned anchors, 2D-only training, and explicit bias/utilization measures addresses real efficiency and generalization issues in the field. Credit is due for avoiding explicit 3D supervision and for attempting to quantify input-view bias, which could influence subsequent work on capacity-efficient 3D models.

major comments (2)

[Abstract] Abstract: the central efficiency and bias-reduction claims are stated quantitatively ('significant reductions in memory and inference time', 'low input-view bias', 'preserving multi-view fidelity') yet no numerical values, baseline comparisons, ablation tables, or error bars are supplied. Without these data the load-bearing assertions cannot be evaluated.
[Method] Method section (anchor-query and expansion operator): the assumption that a small fixed number of learned 3D anchors plus a learned local expansion operator suffices for complex real-world geometry and appearance is load-bearing for the 'principled alternative' claim. The manuscript should provide ablations on anchor count, scene complexity, and failure cases to test capacity limits.

minor comments (2)

[Evaluation] The new bias and utilization metrics should be given explicit mathematical definitions (equations) and pseudocode for reproducibility.
[Figures] Figure captions and axis labels for any qualitative multi-view results should explicitly state the number of anchor queries and the conditioning views used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, indicating where revisions will be made to strengthen the presentation of our claims and experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the central efficiency and bias-reduction claims are stated quantitatively ('significant reductions in memory and inference time', 'low input-view bias', 'preserving multi-view fidelity') yet no numerical values, baseline comparisons, ablation tables, or error bars are supplied. Without these data the load-bearing assertions cannot be evaluated.

Authors: We agree that the abstract would benefit from concrete numbers to support its claims. In the revised manuscript we will update the abstract to include specific quantitative results drawn from our experiments and tables, such as the observed reductions in memory footprint and inference time relative to triplane and volumetric baselines, along with the measured improvement in the input-view bias metric. These additions will be kept concise while directing readers to the supporting tables and figures. revision: yes
Referee: [Method] Method section (anchor-query and expansion operator): the assumption that a small fixed number of learned 3D anchors plus a learned local expansion operator suffices for complex real-world geometry and appearance is load-bearing for the 'principled alternative' claim. The manuscript should provide ablations on anchor count, scene complexity, and failure cases to test capacity limits.

Authors: The current manuscript already reports experiments on varying anchor counts in Section 4.3 and the supplement, demonstrating that performance saturates beyond a modest number of anchors for the evaluated scenes. We acknowledge, however, that more explicit discussion of capacity limits for complex geometry is needed. We will add a new subsection with additional ablations on scene complexity (including higher-detail subsets) and a dedicated analysis of failure cases, such as thin structures or fine textures, to better substantiate the capacity claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents SparseGen as using a compact sparse set of learned 3D anchor queries decoded by a learned expansion operator into local 3D Gaussians, trained end-to-end under a rectified-flow reconstruction objective with no 3D supervision. The central claim that this yields efficient capacity allocation and reduced input-view bias is supported by new quantitative metrics for bias and utilization, plus reported gains in memory, speed, and multi-view fidelity. No equations, self-citations, or fitted parameters are shown that reduce any prediction or uniqueness claim to a tautology by construction; the training objective and evaluation criteria remain externally grounded and independent of the target result. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on a small number of architectural choices and one domain assumption about supervision; no new physical entities are postulated.

free parameters (1)

number of anchor queries
Sparsity level is an architectural hyperparameter that controls capacity and must be chosen to balance efficiency against reconstruction quality.

axioms (1)

domain assumption Rectified-flow reconstruction from multiple rendered views is sufficient to learn accurate 3D structure without explicit 3D labels
Invoked in the training description; if false, the model may overfit 2D appearance without consistent 3D geometry.

invented entities (1)

learned 3D anchor queries no independent evidence
purpose: Compact scene representation that allocates capacity where needed
Core new representational primitive introduced by the method

pith-pipeline@v0.9.0 · 5472 in / 1306 out tokens · 37524 ms · 2026-05-10T13:59:04.534251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Brebin, Loren Carpenter, and Pat Hanrahan

Robert A. Brebin, Loren Carpenter, and Pat Hanrahan. V ol- ume rendering. InSeminal Graphics: Pioneering Efforts That Shaped the Field, Volume 1, pages 363–372. Associ- ation for Computing Machinery, New York, NY , USA, 1998. 3

work page 1998
[2]

End- to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-End Object Detection with Transformers. InComputer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing. 3

work page 2020
[3]

Computer display of curved surfaces

Edwin Catmull. Computer display of curved surfaces. InSeminal Graphics: Pioneering Efforts That Shaped the Field, Volume 1, pages 35–41. Association for Computing Machinery. 3

work page
[4]

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A Universe of Annotated 3D Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 13

work page 2023
[5]

McHugh, and Vincent Vanhoucke

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google Scanned Objects: A High- Quality Dataset of 3D Scanned Household Items. In2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560, Philadelphia, PA, USA, 2022. IEEE. 13

work page 2022
[6]

Haskell and Arun N

Barry G. Haskell and Arun N. Netravali.Digital Pictures: Representation, Compression, and Standards. Perseus Pub- lishing, 2nd edition, 1997. 5

work page 1997
[7]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2017. 5

work page 2017
[8]

ViewDiff: 3D-Consistent Image Genera- tion with Text-to-Image Models, 2024

Lukas H ¨ollein, Aljaˇz Boˇziˇc, Norman M¨uller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollh ¨ofer, and Matthias Nießner. ViewDiff: 3D-Consistent Image Genera- tion with Text-to-Image Models, 2024. 2

work page 2024
[9]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. InThe Twelfth International Conference on Learning Representations, 2023. 2, 3, 5, 6, 7, 8

work page 2023
[10]

Surface reconstruction from un- organized points

Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDon- ald, and Werner Stuetzle. Surface reconstruction from un- organized points. InProceedings of the 19th Annual Con- ference on Computer Graphics and Interactive Techniques, pages 71–78. Association for Computing Machinery. 3

work page
[11]

Planning-oriented Autonomous Driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented Autonomous Driving. In2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 17853–17862,

work page
[12]

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias. InThe Thirteenth International Confer- ence on Learning Representations, 2024. 3, 8

work page 2024
[13]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. InComputer Vision – ECCV 2016, pages 694–711, Cham,

work page 2016
[14]

Springer International Publishing. 5

work page
[15]

Gen2sim: Scaling up robot learning in simulation with gen- erative models

Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with gen- erative models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6679. IEEE,

work page
[16]

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. 42(4):139:1–139:14. 2, 3

work page
[17]

Ground- ing Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3D with MASt3R. InComputer Vi- sion – ECCV 2024, pages 71–91, Cham, 2025. Springer Na- ture Switzerland. 3

work page 2024
[18]

LucidDreamer: Towards High- Fidelity Text-to-3D Generation via Interval Score Matching

Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. LucidDreamer: Towards High- Fidelity Text-to-3D Generation via Interval Score Matching. pages 6517–6526. 2 9

work page
[19]

Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. pages 300–309. 2

work page
[20]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow Matching for Genera- tive Modeling. InThe Eleventh International Conference on Learning Representations, 2022. 3

work page 2022
[21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022. 3

work page 2022
[22]

PETR: Position Embedding Transformation for Multi-view 3D Object Detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position Embedding Transformation for Multi-view 3D Object Detection. InComputer Vision – ECCV 2022, pages 531–548. Springer Nature Switzerland, Cham, 2022. 3

work page 2022
[23]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis. 65(1):99–106. 2, 3

work page
[24]

DINOv2: Learning Robust Visual Features without Supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024
[25]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4172–4182, 2023. 3

work page 2023
[26]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. DreamFusion: Text-to-3D using 2D Diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2022. 2

work page 2022
[27]

Compositing digital images

Thomas Porter and Tom Duff. Compositing digital images. InProceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, pages 253–259, New York, NY , USA, 1984. Association for Computing Machin- ery. 4

work page 1984
[28]

Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10901–10911, 2021. 5

work page 2021
[29]

Mehdi S.M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In2022 IEEE/CVF Conference on Co...

work page 2022
[30]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. InThe Twelfth International Conference on Learning Representations, 2023. 2

work page 2023
[31]

Scene Representation Networks: Continuous 3D- Structure-Aware Neural Scene Representations

Vincent Sitzmann, Michael Zollhoefer, and Gordon Wet- zstein. Scene Representation Networks: Continuous 3D- Structure-Aware Neural Scene Representations. InAdvances in Neural Information Processing Systems. Curran Asso- ciates, Inc., 2019. 5

work page 2019
[32]

Toward realistic 3d avatar generation with dy- namic 3d gaussian splatting for ar/vr communication

Hail Song. Toward realistic 3d avatar generation with dy- namic 3d gaussian splatting for ar/vr communication. In 2024 IEEE Conference on Virtual Reality and 3D User In- terfaces Abstracts and Workshops (VRW), pages 869–870. IEEE, 2024. 2

work page 2024
[33]

Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data. In2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 8829–8839, 2023. 2, 3, 5, 6, 7, 8, 14

work page 2023
[34]

Splatter Image: Ultra-Fast Single-View 3D Recon- struction

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter Image: Ultra-Fast Single-View 3D Recon- struction. In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10208–10217,

work page
[35]

2, 3, 5, 6, 7, 8, 14

work page
[36]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. 2

work page
[37]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10510–10522, 2025. 3

work page 2025
[38]

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. In2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20697– 20709, 2024. 2, 3

work page 2024
[39]

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. InProceedings of the 5th Conference on Robot Learning, pages 180–191. PMLR, 2022. 3

work page 2022
[40]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5

work page 2004
[41]

MeshLRM: Large Reconstruction Model for High- Quality Meshes, 2025

Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zexi- ang Xu. MeshLRM: Large Reconstruction Model for High- Quality Meshes, 2025. 3

work page 2025
[42]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21924–21935, 2025. 3

work page 2025
[43]

Holodeck: Language guided gen- eration of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of 10 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 2

work page 2024
[44]

pixelNeRF: Neural Radiance Fields from One or Few Im- ages

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Im- ages. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2021. 2

work page 2021
[45]

GS-LRM: Large Re- construction Model for 3D Gaussian Splatting, 2024

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Re- construction Model for 3D Gaussian Splatting, 2024. 3

work page 2024
[46]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, Salt Lake City, UT, 2018. IEEE. 5

work page 2018
[47]

Free3D: Consis- tent Novel View Synthesis Without 3D Representation

Chuanxia Zheng and Andrea Vedaldi. Free3D: Consis- tent Novel View Synthesis Without 3D Representation. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9720–9731, Seattle, W A, USA, 2024. IEEE. 2 11 Appendix A. Implementation Details In this section, we provide additional implementation de- tails of our method. A.1. ...

work page 2024

[1] [1]

Brebin, Loren Carpenter, and Pat Hanrahan

Robert A. Brebin, Loren Carpenter, and Pat Hanrahan. V ol- ume rendering. InSeminal Graphics: Pioneering Efforts That Shaped the Field, Volume 1, pages 363–372. Associ- ation for Computing Machinery, New York, NY , USA, 1998. 3

work page 1998

[2] [2]

End- to-End Object Detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-End Object Detection with Transformers. InComputer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing. 3

work page 2020

[3] [3]

Computer display of curved surfaces

Edwin Catmull. Computer display of curved surfaces. InSeminal Graphics: Pioneering Efforts That Shaped the Field, Volume 1, pages 35–41. Association for Computing Machinery. 3

work page

[4] [4]

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A Universe of Annotated 3D Objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 13

work page 2023

[5] [5]

McHugh, and Vincent Vanhoucke

Laura Downs, Anthony Francis, Nate Koenig, Brandon Kin- man, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google Scanned Objects: A High- Quality Dataset of 3D Scanned Household Items. In2022 In- ternational Conference on Robotics and Automation (ICRA), pages 2553–2560, Philadelphia, PA, USA, 2022. IEEE. 13

work page 2022

[6] [6]

Haskell and Arun N

Barry G. Haskell and Arun N. Netravali.Digital Pictures: Representation, Compression, and Standards. Perseus Pub- lishing, 2nd edition, 1997. 5

work page 1997

[7] [7]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., 2017. 5

work page 2017

[8] [8]

ViewDiff: 3D-Consistent Image Genera- tion with Text-to-Image Models, 2024

Lukas H ¨ollein, Aljaˇz Boˇziˇc, Norman M¨uller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollh ¨ofer, and Matthias Nießner. ViewDiff: 3D-Consistent Image Genera- tion with Text-to-Image Models, 2024. 2

work page 2024

[9] [9]

LRM: Large Reconstruction Model for Single Image to 3D

Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large Reconstruction Model for Single Image to 3D. InThe Twelfth International Conference on Learning Representations, 2023. 2, 3, 5, 6, 7, 8

work page 2023

[10] [10]

Surface reconstruction from un- organized points

Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDon- ald, and Werner Stuetzle. Surface reconstruction from un- organized points. InProceedings of the 19th Annual Con- ference on Computer Graphics and Interactive Techniques, pages 71–78. Association for Computing Machinery. 3

work page

[11] [11]

Planning-oriented Autonomous Driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented Autonomous Driving. In2023 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 17853–17862,

work page

[12] [12]

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias. InThe Thirteenth International Confer- ence on Learning Representations, 2024. 3, 8

work page 2024

[13] [13]

Perceptual Losses for Real-Time Style Transfer and Super-Resolution

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. InComputer Vision – ECCV 2016, pages 694–711, Cham,

work page 2016

[14] [14]

Springer International Publishing. 5

work page

[15] [15]

Gen2sim: Scaling up robot learning in simulation with gen- erative models

Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with gen- erative models. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6672–6679. IEEE,

work page

[16] [16]

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. 42(4):139:1–139:14. 2, 3

work page

[17] [17]

Ground- ing Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3D with MASt3R. InComputer Vi- sion – ECCV 2024, pages 71–91, Cham, 2025. Springer Na- ture Switzerland. 3

work page 2024

[18] [18]

LucidDreamer: Towards High- Fidelity Text-to-3D Generation via Interval Score Matching

Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. LucidDreamer: Towards High- Fidelity Text-to-3D Generation via Interval Score Matching. pages 6517–6526. 2 9

work page

[19] [19]

Magic3D: High-Resolution Text-to-3D Content Creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-Resolution Text-to-3D Content Creation. pages 300–309. 2

work page

[20] [20]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow Matching for Genera- tive Modeling. InThe Eleventh International Conference on Learning Representations, 2022. 3

work page 2022

[21] [21]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022. 3

work page 2022

[22] [22]

PETR: Position Embedding Transformation for Multi-view 3D Object Detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position Embedding Transformation for Multi-view 3D Object Detection. InComputer Vision – ECCV 2022, pages 531–548. Springer Nature Switzerland, Cham, 2022. 3

work page 2022

[23] [23]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis. 65(1):99–106. 2, 3

work page

[24] [24]

DINOv2: Learning Robust Visual Features without Supervision, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

work page 2024

[25] [25]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. In2023 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4172–4182, 2023. 3

work page 2023

[26] [26]

Barron, and Ben Milden- hall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Milden- hall. DreamFusion: Text-to-3D using 2D Diffusion. InThe Eleventh International Conference on Learning Representa- tions, 2022. 2

work page 2022

[27] [27]

Compositing digital images

Thomas Porter and Tom Duff. Compositing digital images. InProceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques, pages 253–259, New York, NY , USA, 1984. Association for Computing Machin- ery. 4

work page 1984

[28] [28]

Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon Objects in 3D: Large-Scale Learning and Evaluation of Real-Life 3D Category Reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 10901–10911, 2021. 5

work page 2021

[29] [29]

Mehdi S.M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. In2022 IEEE/CVF Conference on Co...

work page 2022

[30] [30]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. MVDream: Multi-view Diffusion for 3D Generation. InThe Twelfth International Conference on Learning Representations, 2023. 2

work page 2023

[31] [31]

Scene Representation Networks: Continuous 3D- Structure-Aware Neural Scene Representations

Vincent Sitzmann, Michael Zollhoefer, and Gordon Wet- zstein. Scene Representation Networks: Continuous 3D- Structure-Aware Neural Scene Representations. InAdvances in Neural Information Processing Systems. Curran Asso- ciates, Inc., 2019. 5

work page 2019

[32] [32]

Toward realistic 3d avatar generation with dy- namic 3d gaussian splatting for ar/vr communication

Hail Song. Toward realistic 3d avatar generation with dy- namic 3d gaussian splatting for ar/vr communication. In 2024 IEEE Conference on Virtual Reality and 3D User In- terfaces Abstracts and Workshops (VRW), pages 869–870. IEEE, 2024. 2

work page 2024

[33] [33]

Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data. In2023 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 8829–8839, 2023. 2, 3, 5, 6, 7, 8, 14

work page 2023

[34] [34]

Splatter Image: Ultra-Fast Single-View 3D Recon- struction

Stanislaw Szymanowicz, Christian Rupprecht, and Andrea Vedaldi. Splatter Image: Ultra-Fast Single-View 3D Recon- struction. In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 10208–10217,

work page

[35] [35]

2, 3, 5, 6, 7, 8, 14

work page

[36] [36]

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. 2

work page

[37] [37]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D Perception Model with Persistent State. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10510–10522, 2025. 3

work page 2025

[38] [38]

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. In2024 IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20697– 20709, 2024. 2, 3

work page 2024

[39] [39]

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. InProceedings of the 5th Conference on Robot Learning, pages 180–191. PMLR, 2022. 3

work page 2022

[40] [40]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4): 600–612, 2004. 5

work page 2004

[41] [41]

MeshLRM: Large Reconstruction Model for High- Quality Meshes, 2025

Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, and Zexi- ang Xu. MeshLRM: Large Reconstruction Model for High- Quality Meshes, 2025. 3

work page 2025

[42] [42]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21924–21935, 2025. 3

work page 2025

[43] [43]

Holodeck: Language guided gen- eration of 3d embodied ai environments

Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Al- varo Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, et al. Holodeck: Language guided gen- eration of 3d embodied ai environments. InProceedings of 10 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024. 2

work page 2024

[44] [44]

pixelNeRF: Neural Radiance Fields from One or Few Im- ages

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields from One or Few Im- ages. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4576–4585, 2021. 2

work page 2021

[45] [45]

GS-LRM: Large Re- construction Model for 3D Gaussian Splatting, 2024

Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. GS-LRM: Large Re- construction Model for 3D Gaussian Splatting, 2024. 3

work page 2024

[46] [46]

Efros, Eli Shecht- man, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht- man, and Oliver Wang. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595, Salt Lake City, UT, 2018. IEEE. 5

work page 2018

[47] [47]

Free3D: Consis- tent Novel View Synthesis Without 3D Representation

Chuanxia Zheng and Andrea Vedaldi. Free3D: Consis- tent Novel View Synthesis Without 3D Representation. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9720–9731, Seattle, W A, USA, 2024. IEEE. 2 11 Appendix A. Implementation Details In this section, we provide additional implementation de- tails of our method. A.1. ...

work page 2024