Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu; Chin-Teng Lin; Da Xiao; Nguyen Thanh Trung Le; Srinath Sridhar; Thomas Do; Yu-Cheng Chang; Zhuoli Zhuang

arxiv: 2508.01014 · v4 · pith:LEUQHCP2new · submitted 2025-08-01 · 💻 cs.RO · cs.CV

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Cheng-You Lu , Zhuoli Zhuang , Nguyen Thanh Trung Le , Da Xiao , Yu-Cheng Chang , Thomas Do , Srinath Sridhar , Chin-teng Lin This is my paper

Pith reviewed 2026-05-22 00:06 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords next best view3D reconstructionviewpoint planninghierarchical methodsvoxel representationreinforcement learningreal-time systemsrobotic perception

0 comments

The pith

Hestia achieves at least a 4% gain in coverage ratio and halves Chamfer distance for 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hestia as a way to plan the next best camera view for building 3D models without needing online learning or manual paths. It does this by training on a diverse set of data, using a hierarchical search to handle complex choices, applying a close-greedy tactic to skip misleading patterns, and designing the system to notice voxel faces for better geometry capture. These changes lead to stronger results than earlier methods, including higher coverage of the object surface and much lower distance errors between the model and reality. The improvements hold even when only a few images are allowed and when the object is moved around.

Core claim

Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. This allows the system to predict five-degree-of-freedom viewpoints that yield efficient and robust 3D reconstruction in real time.

What carries the argument

Voxel-face-aware hierarchical next-best-view acquisition, which uses a structured search over viewpoints while accounting for the faces of voxels in the reconstruction volume to guide selection.

Load-bearing premise

The assumption that the four proposed components can be combined without introducing new failure modes that cancel the reported gains, and that the evaluation metrics on the chosen test objects and budgets generalize beyond the specific experimental setup.

What would settle it

Evaluating the planner on a broader range of shapes, including very irregular or symmetric objects, and checking whether the reported gains in coverage and error reduction still appear.

Figures

Figures reproduced from arXiv: 2508.01014 by Cheng-You Lu, Chin-Teng Lin, Da Xiao, Nguyen Thanh Trung Le, Srinath Sridhar, Thomas Do, Yu-Cheng Chang, Zhuoli Zhuang.

**Figure 1.** Figure 1: Data collection system. Hestia is an intelligent data collection system that integrates a next-best-view planner with a drone-based real-world platform, enabling robust viewpoint prediction even under spatial shifts of objects. Please refer to our demonstration video for further details. Abstract: Advances in 3D reconstruction and novel view synthesis have enabled efficient, photorealistic rendering, but t… view at source ↗

**Figure 2.** Figure 2: A voxel is worth more than a ray. Unlike the RL-based generalizable method [43], Hestia treats each voxel as a cube by considering its six faces, rather than a point. This reduces the information loss inherent in point approximations, ensuring a more accurate representation of the voxel. In recent decades, next-best-view planners have been developed for active capture, demonstrating promising potential for… view at source ↗

**Figure 3.** Figure 3: Hierarchical structure of Hestia. Hestia first predicts the camera’s look-at point Lt using a proposal neural network that takes grid information Gt processed from the depth image Dt and the camera pose as input. Next, Hestia employs a grid encoder to encode the grid information Gt and performs trilinear interpolation to extract corresponding features from the encoded grid at different layers based on Lt. … view at source ↗

**Figure 4.** Figure 4: Proposed data collection system and processes. Hestia goes beyond prior works by introducing an intelligent system using a drone with an RGB camera for data capture, Lighthouse base stations and Crazyflie for localization, and a Wifi router for wireless communication. of ground truth uncaptured faces within voxel vi where F gt vi is the set of ground truth uncaptured faces associated with voxel vi . Thus, … view at source ↗

**Figure 5.** Figure 5: Reconstruction on a complex scene. Hestia and Hestia-H3K capture the scene well. baseline [43] by at least 3% in both CR and AUC, and their CDs are nearly half of those of the baseline [43] across all three datasets, demonstrating a substantial improvement. Surprisingly, Hestia achieves slightly better CD than Hestia-H3K, even when the latter is evaluated on its in-distribution dataset (Houses3K), further… view at source ↗

**Figure 6.** Figure 6: Failure cases of Hestia. Hestia may occasionally fail to capture finer 3D structures, highly self-occluded parts, nearly vertical bottom-up views, and small details on coarse object surfaces. Due to hardware limitations (e.g., the absence of an RGB-D camera), Hestia cannot fully exhibit its potential in real-world scenarios. However, since our system integrates a depth estimator to convert RGB images into … view at source ↗

read the original abstract

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hestia layers hierarchy, close-greedy selection, and voxel-face awareness onto existing generalizable NBV planners and reports clear gains in coverage and Chamfer distance, though the numbers rest on limited details about test variation.

read the letter

This paper's key point is that Hestia improves next-best-view selection for 3D reconstruction by combining a hierarchical search, close-greedy picking, and voxel-face awareness on top of a diverse dataset. These tweaks address limitations in prior RL-based planners and produce at least 4% better coverage and 50% lower Chamfer distance, plus stronger results when only five images are allowed.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Hestia, a generalizable next-best-view (NBV) planner for efficient 3D reconstruction that extends reinforcement-learning approaches for 5-DoF viewpoint selection. It proposes four components—a more diverse training dataset, a hierarchical search structure over the continuous action space, a close-greedy selection strategy, and a voxel-face-aware geometric prior—to improve robustness and performance. Experiments report at least 4% higher coverage ratio, 50% lower Chamfer distance, real-time inference, at least 12% coverage gain under a 5-image budget, robustness to object placement changes, and real-world feasibility.

Significance. If the empirical gains are reproducible and generalize, the work would provide a practical advance in autonomous 3D scanning by delivering a non-learning-online NBV method that handles varied object geometries better than prior RL baselines while remaining computationally lightweight. The combination of hierarchical planning with geometric awareness and the reported real-time capability are particularly relevant for robotic deployment.

major comments (3)

[Experimental results] Experimental results section: the reported improvements (≥4% coverage, 50% Chamfer reduction, ≥12% at 5-image budget) are presented without standard deviations, number of independent runs, or statistical significance tests, and without explicit confirmation that baselines and hyper-parameters were fixed prior to evaluating the test set; this directly affects whether the central performance claims can be considered load-bearing.
[Results and real-world demonstration] Robustness and real-world evaluation: the abstract and results claim robustness to object placement variations and real-world feasibility, yet no quantitative details are given on the number or range of placement variations tested, the diversity of test objects relative to the training distribution, or metrics capturing sensor noise and calibration error; these omissions leave the generalization of the headline gains unverified.
[Method and experiments] Ablation or component analysis: while four components are introduced, the manuscript does not present controlled ablations demonstrating that their joint use does not introduce new failure modes that offset the individual contributions; without such evidence the attribution of the observed gains to the proposed design remains incomplete.

minor comments (2)

[Abstract] The abstract states 'five-degree-of-freedom viewpoint prediction' without clarifying whether roll is included or how the action space is discretized in the hierarchical search.
[Figures] Figure captions and axis labels in the quantitative comparison plots should explicitly state the exact baselines and the number of objects or scenes averaged.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point-by-point below. We agree that the suggested additions will strengthen the manuscript and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Experimental results] Experimental results section: the reported improvements (≥4% coverage, 50% Chamfer reduction, ≥12% at 5-image budget) are presented without standard deviations, number of independent runs, or statistical significance tests, and without explicit confirmation that baselines and hyper-parameters were fixed prior to evaluating the test set; this directly affects whether the central performance claims can be considered load-bearing.

Authors: We agree that reporting standard deviations, the number of runs, and statistical tests would improve the robustness of our claims. In the revised manuscript we will present all quantitative results as means over 5 independent runs with different random seeds, include standard deviations, and add paired t-tests for statistical significance between Hestia and each baseline. We will also add an explicit statement confirming that all baselines and hyperparameters were frozen before any test-set evaluation, consistent with the experimental protocol already described in Section 4. revision: yes
Referee: [Results and real-world demonstration] Robustness and real-world evaluation: the abstract and results claim robustness to object placement variations and real-world feasibility, yet no quantitative details are given on the number or range of placement variations tested, the diversity of test objects relative to the training distribution, or metrics capturing sensor noise and calibration error; these omissions leave the generalization of the headline gains unverified.

Authors: We acknowledge that additional quantitative details are needed to substantiate the robustness and real-world claims. In the revision we will expand the corresponding subsection to report: (i) results over 20 distinct random object placements spanning a translation range of ±15 cm and rotation range of ±20°, (ii) the composition of the 50 test objects (including that 35 % belong to shape categories absent from the training set), and (iii) reconstruction metrics obtained under added Gaussian sensor noise (σ = 1–3 mm) and calibration perturbations up to 2 mm. These numbers and metrics will be added to both the main results and the real-world demonstration. revision: yes
Referee: [Method and experiments] Ablation or component analysis: while four components are introduced, the manuscript does not present controlled ablations demonstrating that their joint use does not introduce new failure modes that offset the individual contributions; without such evidence the attribution of the observed gains to the proposed design remains incomplete.

Authors: We agree that controlled ablations are required to properly attribute performance gains. We will add a dedicated ablation subsection that evaluates each of the four components in isolation (diverse dataset, hierarchical search, close-greedy strategy, face-aware voxel prior) by training and testing variants with the component removed or disabled. All variants will be evaluated on the same metrics and test objects; we will also report any observed failure modes or performance regressions when components are combined, thereby clarifying that the joint design does not introduce offsetting drawbacks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains shown via held-out comparisons

full rationale

The paper's central claims consist of measured performance improvements (coverage ratio gains, Chamfer distance reductions) obtained by training and evaluating a next-best-view planner on a diverse dataset against prior methods. These results are produced by direct experimental comparison on held-out objects and budgets rather than any derivation that reduces to fitted parameters or self-referential definitions. No equations or first-principles steps are presented that equate outputs to inputs by construction. Self-citations to earlier RL baselines serve only as external reference points for comparison and do not carry the load of proving the new components' effectiveness. The architecture (hierarchical structure, face-aware design, etc.) is validated independently through ablation-style experiments, keeping the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions from reinforcement learning and voxel-based reconstruction pipelines; no new physical entities are introduced. Training relies on a curated diverse dataset whose construction details are not fully specified in the abstract.

free parameters (2)

hierarchical search depth and branching factors
Chosen to manage the continuous 5-DoF action space; values affect both speed and coverage performance.
close-greedy threshold
Hyper-parameter controlling how strictly the planner avoids spurious high-reward views.

axioms (2)

domain assumption Voxel grid representation accurately captures unobserved geometry for face-aware scoring.
Invoked when the face-aware design is used to avoid overlooking surfaces.
domain assumption The training distribution of object shapes is sufficiently representative for robustness to placement variations.
Underlies the claim of robustness to object placement changes.

pith-pipeline@v0.9.0 · 5825 in / 1630 out tokens · 35948 ms · 2026-05-22T00:06:11.202304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 5 internal anchors

[1]

Agarwal, Y

S. Agarwal, Y . Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011

work page 2011
[2]

Jiang, H

H. Jiang, H. Liu, P. Tan, G. Zhang, and H. Bao. 3d reconstruction of dynamic scenes with multiple handheld cameras. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pages 601–615. Springer, 2012

work page 2012
[3]

Kim and A

H. Kim and A. Hilton. 3d scene reconstruction from multiple spherical stereo pairs. Interna- tional journal of computer vision, 104:94–116, 2013

work page 2013
[4]

Nießner, M

M. Nießner, M. Zollh ¨ofer, S. Izadi, and M. Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

work page 2013
[5]

J. Xie, R. Girshick, and A. Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 842–857. Springer, 2016

work page 2016
[6]

J. L. Sch ¨onberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016

work page 2016
[7]

Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured multi- view stereo. In Proceedings of the European conference on computer vision (ECCV) , pages 767–783, 2018

work page 2018
[8]

H. Xie, H. Yao, X. Sun, S. Zhou, and S. Zhang. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2690–2698, 2019

work page 2019
[9]

Murez, T

Z. Murez, T. Van As, J. Bartolozzi, A. Sinha, V . Badrinarayanan, and A. Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020

work page 2020
[10]

D. Wang, X. Cui, X. Chen, Z. Zou, T. Shi, S. Salcudean, Z. J. Wang, and R. Ward. Multi-view 3d reconstruction with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 5722–5731, 2021

work page 2021
[11]

Sayed, J

M. Sayed, J. Gibson, J. Watson, V . Prisacariu, M. Firman, and C. Godard. Simplerecon: 3d reconstruction without 3d convolutions. In European Conference on Computer Vision, pages 1–19. Springer, 2022

work page 2022
[12]

H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[13]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 11

work page 2024
[14]

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. CoRR, 2024

work page 2024
[15]

L. Pan, D. Barath, M. Pollefeys, and J. L. Sch ¨onberger. Global Structure-from-Motion Revis- ited. In European Conference on Computer Vision (ECCV), 2024

work page 2024
[16]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

work page 2020
[17]

A. Yu, V . Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021

work page 2021
[18]

Fridovich-Keil, A

S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa. Plenoxels: Radi- ance fields without neural networks. In CVPR, 2022

work page 2022
[19]

A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su. Tensorf: Tensorial radiance fields. In European conference on computer vision, pages 333–350. Springer, 2022

work page 2022
[20]

Instant neural graph- ics primitives with a multiresolution hash encoding.ACM Transactions on Graphics (Proc

T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. , 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127

work page doi:10.1145/3528223.3530127 2022
[21]

Sabour, S

S. Sabour, S. V ora, D. Duckworth, I. Krasin, D. J. Fleet, and A. Tagliasacchi. Robustnerf: Ignoring distractors with robust losses. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20626–20636, 2023

work page 2023
[22]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[23]

Zheng, B

S. Zheng, B. Zhou, R. Shao, B. Liu, S. Zhang, L. Nie, and Y . Liu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[24]

Zhang, S

K. Zhang, S. Bi, H. Tan, Y . Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. European Conference on Computer Vision , 2024

work page 2024
[25]

W. Ren, Z. Zhu, B. Sun, J. Chen, M. Pollefeys, and S. Peng. Nerf on-the-go: Exploiting uncertainty for distractor-free nerfs in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[26]

Charatan, S

D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024

work page 2024
[27]

Sabour, L

S. Sabour, L. Goli, G. Kopanas, M. Matthews, D. Lagun, L. Guibas, A. Jacobson, D. J. Fleet, and A. Tagliasacchi. Spotlesssplats: Ignoring distractors in 3d gaussian splatting. arXiv preprint arXiv:2406.20055, 2024

work page arXiv 2024
[28]

Flynn, M

J. Flynn, M. Broxton, L. Murmann, L. Chai, M. DuVall, C. Godard, K. Heal, S. Kaza, S. Lom- bardi, X. Luo, et al. Quark: Real-time, high-resolution, and general neural view synthesis. ACM Transactions on Graphics (TOG), 43(6):1–20, 2024

work page 2024
[29]

Ziwen, H

C. Ziwen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, L. Fuxin, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 12

work page arXiv 2024
[30]

Y . Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai. Mvs- plat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2025

work page 2025
[31]

Wallingford, A

M. Wallingford, A. Bhattad, A. Kusupati, V . Ramanujan, M. Deitke, A. Kembhavi, R. Mot- taghi, W.-C. Ma, and A. Farhadi. From an image to a scene: Learning to imagine the world from a million 360° videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page
[32]

T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, and Y . Liu. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021

work page 2021
[33]

Jensen, A

R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014

work page 2014
[34]

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

A. Liu, R. Tucker, V . Jampani, A. Makadia, N. Snavely, and A. Kanazawa. Infinite nature: Per- petual view generation of natural scenes from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14458–14467, 2021

work page 2021
[36]

L. Xu, V . Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bul `o, L. Porzi, P. Kontschieder, A. Boˇziˇc, et al. Vr-nerf: High-fidelity virtualized walkable spaces. In SIG- GRAPH Asia 2023 Conference Papers, pages 1–12, 2023

work page 2023
[37]

Broxton, J

M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hedman, M. DuVall, J. Dourgarian, J. Busch, M. Whalen, and P. Debevec. Immersive light field video with a layered mesh repre- sentation. 39(4):86:1–86:15, 2020

work page 2020
[38]

K.-E. Lin, L. Xiao, F. Liu, G. Yang, and R. Ramamoorthi. Deep 3d mask volume for view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1758, 2021

work page 2021
[39]

J. S. Yoon, K. Kim, O. Gallo, H. S. Park, and J. Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020

work page 2020
[40]

T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe, et al. Neural 3d video synthesis from multi-view video. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5521–5531, 2022

work page 2022
[41]

C.-Y . Lu, P. Zhou, A. Xing, C. Pokhariya, A. Dey, I. N. Shah, R. Mavidipalli, D. Hu, A. I. Comport, K. Chen, et al. Diva-360: The dynamic visual dataset for immersive neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22466–22476, 2024

work page 2024
[42]

S. Peng, Y . Zhang, Y . Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neu- ral representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9054–9063, 2021

work page 2021
[43]

X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang. Gennbv: Generalizable next-best-view policy for active 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16436–16445, 2024. 13

work page 2024
[44]

Monica and J

R. Monica and J. Aleotti. Contour-based next-best view planning from point cloud segmenta- tion of unknown objects. Autonomous Robots, 42:443–458, 2018

work page 2018
[45]

H. Zha, K. Morooka, and T. Hasegawa. Next best viewpoint (nbv) planning for active object modeling based on a learning-by-showing approach. In Computer Vision—ACCV’98: Third Asian Conference on Computer Vision Hong Kong, China, January 8–10, 1998 Proceedings, Volume II 3, pages 185–192. Springer, 1997

work page 1998
[46]

L. Liu, X. Xia, H. Sun, Q. Shen, J. Xu, B. Chen, H. Huang, and K. Xu. Object-aware guidance for autonomous scene reconstruction. ACM Transactions on Graphics (TOG) , 37(4):1–12, 2018

work page 2018
[47]

Hardouin, F

G. Hardouin, F. Morbidi, J. Moras, J. Marzat, and E. M. Mouaddib. Surface-driven next-best- view planning for exploration of large-scale 3d environments. IFAC-PapersOnLine, 53(2): 15501–15507, 2020

work page 2020
[48]

Hardouin, J

G. Hardouin, J. Moras, F. Morbidi, J. Marzat, and E. M. Mouaddib. Next-best-view plan- ning for surface reconstruction of large-scale 3d environments with multiple uavs. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 1567–

work page 2020
[49]

Lin and B

K. Lin and B. Yi. Active view planning for radiance fields. In Robotics Science and Systems, 2022

work page 2022
[50]

X. Pan, Z. Lai, S. Song, and G. Huang. Activenerf: Learning where to see with uncertainty estimation. In European Conference on Computer Vision, pages 230–246. Springer, 2022

work page 2022
[51]

S. Lee, L. Chen, J. Wang, A. Liniger, S. Kumar, and F. Yu. Uncertainty guided policy for active robotic 3d reconstruction using neural radiance fields. IEEE Robotics and Automation Letters, 7(4):12070–12077, 2022

work page 2022
[52]

H. Zhan, J. Zheng, Y . Xu, I. Reid, and H. Rezatofighi. Activermap: Radiance field for active mapping and planning. arXiv preprint arXiv:2211.12656, 2022

work page arXiv 2022
[53]

L. Jin, X. Chen, J. R ¨uckin, and M. Popovi ´c. Neu-nbv: Next best view planning using uncer- tainty estimation in image-based neural rendering. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11305–11312. IEEE, 2023

work page 2023
[54]

S ¨underhauf, J

N. S ¨underhauf, J. Abou-Chakra, and D. Miller. Density-aware nerf ensembles: Quantifying predictive uncertainty in neural radiance fields. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9370–9376. IEEE, 2023

work page 2023
[55]

Peralta, J

D. Peralta, J. Casimiro, A. M. Nilles, J. A. Aguilar, R. Atienza, and R. Cajote. Next-best view policy for 3d reconstruction. arXiv preprint arXiv:2008.12664, 2020

work page arXiv 2008
[56]

Y . Ran, J. Zeng, S. He, J. Chen, L. Li, Y . Chen, G. Lee, and Q. Ye. Neurar: Neural uncertainty for autonomous 3d reconstruction with implicit neural representations. IEEE Robotics and Automation Letters, 8(2):1125–1132, 2023

work page 2023
[57]

Gu ´edon, T

A. Gu ´edon, T. Monnier, P. Monasse, and V . Lepetit. Macarons: Mapping and coverage an- ticipation with rgb online self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 940–951, 2023

work page 2023
[58]

Jiang, B

W. Jiang, B. Lei, and K. Daniilidis. Fisherrf: Active view selection and uncertainty quantifica- tion for radiance fields using fisher information. arXiv preprint arXiv:2311.17874, 2023

work page arXiv 2023
[59]

Boneh and M

A. Boneh and M. Hofri. The coupon-collector problem revisited—a survey of engineering problems and computational methods. Stochastic Models, 13(1):39–66, 1997. 14

work page 1997
[60]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, June 2023

work page 2023
[61]

Deitke, R

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[62]

A. C. INTERPRETATION. Spurious correlation: A causal interpretation herbert a. simon. Causal Models in the Social Sciences, page 5, 1971

work page 1971
[63]

Y . Kim, S. Mo, M. Kim, K. Lee, J. Lee, and J. Shin. Discovering and mitigating visual biases through keyword explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11082–11092, 2024

work page 2024
[64]

J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019

work page 2019
[67]

Z.-X. Zou, Z. Yu, Y .-C. Guo, Y . Li, D. Liang, Y .-P. Cao, and S.-H. Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10324–10335, 2024

work page 2024
[68]

M ¨uller, A

N. M ¨uller, A. Simonelli, L. Porzi, S. R. Bul`o, M. Nießner, and P. Kontschieder. Autorf: Learn- ing 3d object radiance fields from single view observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3971–3980, 2022

work page 2022
[69]

Y .-L. Liu, C. Gao, A. Meuleman, H.-Y . Tseng, A. Saraf, C. Kim, Y .-Y . Chuang, J. Kopf, and J.-B. Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023

work page 2023
[70]

Q. Wang, Z. Wang, K. Genova, P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021

work page 2021
[71]

H. Lin, S. Peng, Z. Xu, Y . Yan, Q. Shuai, H. Bao, and X. Zhou. Efficient neural radiance fields for interactive free-viewpoint video. In SIGGRAPH Asia Conference Proceedings, 2022

work page 2022
[72]

A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su. Mvsnerf: Fast generaliz- able radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021

work page 2021
[73]

H. Gao, R. Li, S. Tulsiani, B. Russell, and A. Kanazawa. Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems, 35:33768–33780, 2022

work page 2022
[74]

K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla. Nerfies: Deformable neural radiance fields. ICCV, 2021

work page 2021
[75]

K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), dec 2021. 15

work page 2021
[76]

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

work page 2024
[77]

C. Lu, F. Yin, X. Chen, W. Liu, T. Chen, G. Yu, and J. Fan. A large-scale outdoor multi- modal dataset and benchmark for novel view synthesis and implicit scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7557– 7567, 2023

work page 2023
[78]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

work page 2024
[79]

H. Chen, Y . Hou, C. Qu, I. Testini, X. Hong, and J. Jiao. 360+x: A panoptic multi-modal scene understanding dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[80]

L. Li, Z. Shen, Z. Wang, L. Shen, and P. Tan. Streaming radiance fields for 3d video synthesis. Advances in Neural Information Processing Systems, 35:13485–13498, 2022

work page 2022

Showing first 80 references.

[1] [1]

Agarwal, Y

S. Agarwal, Y . Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011

work page 2011

[2] [2]

Jiang, H

H. Jiang, H. Liu, P. Tan, G. Zhang, and H. Bao. 3d reconstruction of dynamic scenes with multiple handheld cameras. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pages 601–615. Springer, 2012

work page 2012

[3] [3]

Kim and A

H. Kim and A. Hilton. 3d scene reconstruction from multiple spherical stereo pairs. Interna- tional journal of computer vision, 104:94–116, 2013

work page 2013

[4] [4]

Nießner, M

M. Nießner, M. Zollh ¨ofer, S. Izadi, and M. Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

work page 2013

[5] [5]

J. Xie, R. Girshick, and A. Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 842–857. Springer, 2016

work page 2016

[6] [6]

J. L. Sch ¨onberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016

work page 2016

[7] [7]

Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured multi- view stereo. In Proceedings of the European conference on computer vision (ECCV) , pages 767–783, 2018

work page 2018

[8] [8]

H. Xie, H. Yao, X. Sun, S. Zhou, and S. Zhang. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2690–2698, 2019

work page 2019

[9] [9]

Murez, T

Z. Murez, T. Van As, J. Bartolozzi, A. Sinha, V . Badrinarayanan, and A. Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020

work page 2020

[10] [10]

D. Wang, X. Cui, X. Chen, Z. Zou, T. Shi, S. Salcudean, Z. J. Wang, and R. Ward. Multi-view 3d reconstruction with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 5722–5731, 2021

work page 2021

[11] [11]

Sayed, J

M. Sayed, J. Gibson, J. Watson, V . Prisacariu, M. Firman, and C. Godard. Simplerecon: 3d reconstruction without 3d convolutions. In European Conference on Computer Vision, pages 1–19. Springer, 2022

work page 2022

[12] [12]

H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023

[13] [13]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 11

work page 2024

[14] [14]

B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. CoRR, 2024

work page 2024

[15] [15]

L. Pan, D. Barath, M. Pollefeys, and J. L. Sch ¨onberger. Global Structure-from-Motion Revis- ited. In European Conference on Computer Vision (ECCV), 2024

work page 2024

[16] [16]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

work page 2020

[17] [17]

A. Yu, V . Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021

work page 2021

[18] [18]

Fridovich-Keil, A

S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa. Plenoxels: Radi- ance fields without neural networks. In CVPR, 2022

work page 2022

[19] [19]

A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su. Tensorf: Tensorial radiance fields. In European conference on computer vision, pages 333–350. Springer, 2022

work page 2022

[20] [20]

Instant neural graph- ics primitives with a multiresolution hash encoding.ACM Transactions on Graphics (Proc

T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. , 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127

work page doi:10.1145/3528223.3530127 2022

[21] [21]

Sabour, S

S. Sabour, S. V ora, D. Duckworth, I. Krasin, D. J. Fleet, and A. Tagliasacchi. Robustnerf: Ignoring distractors with robust losses. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20626–20636, 2023

work page 2023

[22] [22]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

work page 2023

[23] [23]

Zheng, B

S. Zheng, B. Zhou, R. Shao, B. Liu, S. Zhang, L. Nie, and Y . Liu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[24] [24]

Zhang, S

K. Zhang, S. Bi, H. Tan, Y . Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. European Conference on Computer Vision , 2024

work page 2024

[25] [25]

W. Ren, Z. Zhu, B. Sun, J. Chen, M. Pollefeys, and S. Peng. Nerf on-the-go: Exploiting uncertainty for distractor-free nerfs in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[26] [26]

Charatan, S

D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024

work page 2024

[27] [27]

Sabour, L

S. Sabour, L. Goli, G. Kopanas, M. Matthews, D. Lagun, L. Guibas, A. Jacobson, D. J. Fleet, and A. Tagliasacchi. Spotlesssplats: Ignoring distractors in 3d gaussian splatting. arXiv preprint arXiv:2406.20055, 2024

work page arXiv 2024

[28] [28]

Flynn, M

J. Flynn, M. Broxton, L. Murmann, L. Chai, M. DuVall, C. Godard, K. Heal, S. Kaza, S. Lom- bardi, X. Luo, et al. Quark: Real-time, high-resolution, and general neural view synthesis. ACM Transactions on Graphics (TOG), 43(6):1–20, 2024

work page 2024

[29] [29]

Ziwen, H

C. Ziwen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, L. Fuxin, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 12

work page arXiv 2024

[30] [30]

Y . Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai. Mvs- plat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2025

work page 2025

[31] [31]

Wallingford, A

M. Wallingford, A. Bhattad, A. Kusupati, V . Ramanujan, M. Deitke, A. Kembhavi, R. Mot- taghi, W.-C. Ma, and A. Farhadi. From an image to a scene: Learning to imagine the world from a million 360° videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

work page

[32] [32]

T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, and Y . Liu. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021

work page 2021

[33] [33]

Jensen, A

R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014

work page 2014

[34] [34]

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[35] [35]

A. Liu, R. Tucker, V . Jampani, A. Makadia, N. Snavely, and A. Kanazawa. Infinite nature: Per- petual view generation of natural scenes from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14458–14467, 2021

work page 2021

[36] [36]

L. Xu, V . Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bul `o, L. Porzi, P. Kontschieder, A. Boˇziˇc, et al. Vr-nerf: High-fidelity virtualized walkable spaces. In SIG- GRAPH Asia 2023 Conference Papers, pages 1–12, 2023

work page 2023

[37] [37]

Broxton, J

M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hedman, M. DuVall, J. Dourgarian, J. Busch, M. Whalen, and P. Debevec. Immersive light field video with a layered mesh repre- sentation. 39(4):86:1–86:15, 2020

work page 2020

[38] [38]

K.-E. Lin, L. Xiao, F. Liu, G. Yang, and R. Ramamoorthi. Deep 3d mask volume for view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1758, 2021

work page 2021

[39] [39]

J. S. Yoon, K. Kim, O. Gallo, H. S. Park, and J. Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020

work page 2020

[40] [40]

T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe, et al. Neural 3d video synthesis from multi-view video. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5521–5531, 2022

work page 2022

[41] [41]

C.-Y . Lu, P. Zhou, A. Xing, C. Pokhariya, A. Dey, I. N. Shah, R. Mavidipalli, D. Hu, A. I. Comport, K. Chen, et al. Diva-360: The dynamic visual dataset for immersive neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22466–22476, 2024

work page 2024

[42] [42]

S. Peng, Y . Zhang, Y . Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neu- ral representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9054–9063, 2021

work page 2021

[43] [43]

X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang. Gennbv: Generalizable next-best-view policy for active 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16436–16445, 2024. 13

work page 2024

[44] [44]

Monica and J

R. Monica and J. Aleotti. Contour-based next-best view planning from point cloud segmenta- tion of unknown objects. Autonomous Robots, 42:443–458, 2018

work page 2018

[45] [45]

H. Zha, K. Morooka, and T. Hasegawa. Next best viewpoint (nbv) planning for active object modeling based on a learning-by-showing approach. In Computer Vision—ACCV’98: Third Asian Conference on Computer Vision Hong Kong, China, January 8–10, 1998 Proceedings, Volume II 3, pages 185–192. Springer, 1997

work page 1998

[46] [46]

L. Liu, X. Xia, H. Sun, Q. Shen, J. Xu, B. Chen, H. Huang, and K. Xu. Object-aware guidance for autonomous scene reconstruction. ACM Transactions on Graphics (TOG) , 37(4):1–12, 2018

work page 2018

[47] [47]

Hardouin, F

G. Hardouin, F. Morbidi, J. Moras, J. Marzat, and E. M. Mouaddib. Surface-driven next-best- view planning for exploration of large-scale 3d environments. IFAC-PapersOnLine, 53(2): 15501–15507, 2020

work page 2020

[48] [48]

Hardouin, J

G. Hardouin, J. Moras, F. Morbidi, J. Marzat, and E. M. Mouaddib. Next-best-view plan- ning for surface reconstruction of large-scale 3d environments with multiple uavs. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 1567–

work page 2020

[49] [49]

Lin and B

K. Lin and B. Yi. Active view planning for radiance fields. In Robotics Science and Systems, 2022

work page 2022

[50] [50]

X. Pan, Z. Lai, S. Song, and G. Huang. Activenerf: Learning where to see with uncertainty estimation. In European Conference on Computer Vision, pages 230–246. Springer, 2022

work page 2022

[51] [51]

S. Lee, L. Chen, J. Wang, A. Liniger, S. Kumar, and F. Yu. Uncertainty guided policy for active robotic 3d reconstruction using neural radiance fields. IEEE Robotics and Automation Letters, 7(4):12070–12077, 2022

work page 2022

[52] [52]

H. Zhan, J. Zheng, Y . Xu, I. Reid, and H. Rezatofighi. Activermap: Radiance field for active mapping and planning. arXiv preprint arXiv:2211.12656, 2022

work page arXiv 2022

[53] [53]

L. Jin, X. Chen, J. R ¨uckin, and M. Popovi ´c. Neu-nbv: Next best view planning using uncer- tainty estimation in image-based neural rendering. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11305–11312. IEEE, 2023

work page 2023

[54] [54]

S ¨underhauf, J

N. S ¨underhauf, J. Abou-Chakra, and D. Miller. Density-aware nerf ensembles: Quantifying predictive uncertainty in neural radiance fields. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9370–9376. IEEE, 2023

work page 2023

[55] [55]

Peralta, J

D. Peralta, J. Casimiro, A. M. Nilles, J. A. Aguilar, R. Atienza, and R. Cajote. Next-best view policy for 3d reconstruction. arXiv preprint arXiv:2008.12664, 2020

work page arXiv 2008

[56] [56]

Y . Ran, J. Zeng, S. He, J. Chen, L. Li, Y . Chen, G. Lee, and Q. Ye. Neurar: Neural uncertainty for autonomous 3d reconstruction with implicit neural representations. IEEE Robotics and Automation Letters, 8(2):1125–1132, 2023

work page 2023

[57] [57]

Gu ´edon, T

A. Gu ´edon, T. Monnier, P. Monasse, and V . Lepetit. Macarons: Mapping and coverage an- ticipation with rgb online self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 940–951, 2023

work page 2023

[58] [58]

Jiang, B

W. Jiang, B. Lei, and K. Daniilidis. Fisherrf: Active view selection and uncertainty quantifica- tion for radiance fields using fisher information. arXiv preprint arXiv:2311.17874, 2023

work page arXiv 2023

[59] [59]

Boneh and M

A. Boneh and M. Hofri. The coupon-collector problem revisited—a survey of engineering problems and computational methods. Stochastic Models, 13(1):39–66, 1997. 14

work page 1997

[60] [60]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, June 2023

work page 2023

[61] [61]

Deitke, R

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36, 2024

work page 2024

[62] [62]

A. C. INTERPRETATION. Spurious correlation: A causal interpretation herbert a. simon. Causal Models in the Social Sciences, page 5, 1971

work page 1971

[63] [63]

Y . Kim, S. Mo, M. Kim, K. Lee, J. Lee, and J. Shin. Discovering and mitigating visual biases through keyword explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11082–11092, 2024

work page 2024

[64] [64]

J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[66] [66]

J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019

work page 2019

[67] [67]

Z.-X. Zou, Z. Yu, Y .-C. Guo, Y . Li, D. Liang, Y .-P. Cao, and S.-H. Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10324–10335, 2024

work page 2024

[68] [68]

M ¨uller, A

N. M ¨uller, A. Simonelli, L. Porzi, S. R. Bul`o, M. Nießner, and P. Kontschieder. Autorf: Learn- ing 3d object radiance fields from single view observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3971–3980, 2022

work page 2022

[69] [69]

Y .-L. Liu, C. Gao, A. Meuleman, H.-Y . Tseng, A. Saraf, C. Kim, Y .-Y . Chuang, J. Kopf, and J.-B. Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023

work page 2023

[70] [70]

Q. Wang, Z. Wang, K. Genova, P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021

work page 2021

[71] [71]

H. Lin, S. Peng, Z. Xu, Y . Yan, Q. Shuai, H. Bao, and X. Zhou. Efficient neural radiance fields for interactive free-viewpoint video. In SIGGRAPH Asia Conference Proceedings, 2022

work page 2022

[72] [72]

A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su. Mvsnerf: Fast generaliz- able radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021

work page 2021

[73] [73]

H. Gao, R. Li, S. Tulsiani, B. Russell, and A. Kanazawa. Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems, 35:33768–33780, 2022

work page 2022

[74] [74]

K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla. Nerfies: Deformable neural radiance fields. ICCV, 2021

work page 2021

[75] [75]

K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), dec 2021. 15

work page 2021

[76] [76]

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

work page 2024

[77] [77]

C. Lu, F. Yin, X. Chen, W. Liu, T. Chen, G. Yu, and J. Fan. A large-scale outdoor multi- modal dataset and benchmark for novel view synthesis and implicit scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7557– 7567, 2023

work page 2023

[78] [78]

Grauman, A

K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

work page 2024

[79] [79]

H. Chen, Y . Hou, C. Qu, I. Testini, X. Hong, and J. Jiao. 360+x: A panoptic multi-modal scene understanding dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024

[80] [80]

L. Li, Z. Shen, Z. Wang, L. Shen, and P. Tan. Streaming radiance fields for 3d video synthesis. Advances in Neural Information Processing Systems, 35:13485–13498, 2022

work page 2022