pith. sign in

arxiv: 2508.01014 · v4 · pith:LEUQHCP2new · submitted 2025-08-01 · 💻 cs.RO · cs.CV

Hestia: Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction

Pith reviewed 2026-05-22 00:06 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords next best view3D reconstructionviewpoint planninghierarchical methodsvoxel representationreinforcement learningreal-time systemsrobotic perception
0
0 comments X

The pith

Hestia achieves at least a 4% gain in coverage ratio and halves Chamfer distance for 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Hestia as a way to plan the next best camera view for building 3D models without needing online learning or manual paths. It does this by training on a diverse set of data, using a hierarchical search to handle complex choices, applying a close-greedy tactic to skip misleading patterns, and designing the system to notice voxel faces for better geometry capture. These changes lead to stronger results than earlier methods, including higher coverage of the object surface and much lower distance errors between the model and reality. The improvements hold even when only a few images are allowed and when the object is moved around.

Core claim

Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. This allows the system to predict five-degree-of-freedom viewpoints that yield efficient and robust 3D reconstruction in real time.

What carries the argument

Voxel-face-aware hierarchical next-best-view acquisition, which uses a structured search over viewpoints while accounting for the faces of voxels in the reconstruction volume to guide selection.

Load-bearing premise

The assumption that the four proposed components can be combined without introducing new failure modes that cancel the reported gains, and that the evaluation metrics on the chosen test objects and budgets generalize beyond the specific experimental setup.

What would settle it

Evaluating the planner on a broader range of shapes, including very irregular or symmetric objects, and checking whether the reported gains in coverage and error reduction still appear.

Figures

Figures reproduced from arXiv: 2508.01014 by Cheng-You Lu, Chin-Teng Lin, Da Xiao, Nguyen Thanh Trung Le, Srinath Sridhar, Thomas Do, Yu-Cheng Chang, Zhuoli Zhuang.

Figure 1
Figure 1. Figure 1: Data collection system. Hestia is an intelligent data collection system that integrates a next-best-view planner with a drone-based real-world platform, enabling robust viewpoint prediction even under spatial shifts of objects. Please refer to our demonstration video for further details. Abstract: Advances in 3D reconstruction and novel view synthesis have enabled efficient, photorealistic rendering, but t… view at source ↗
Figure 2
Figure 2. Figure 2: A voxel is worth more than a ray. Unlike the RL-based generalizable method [43], Hestia treats each voxel as a cube by considering its six faces, rather than a point. This reduces the information loss inherent in point approximations, ensuring a more accurate representation of the voxel. In recent decades, next-best-view planners have been developed for active capture, demonstrating promising potential for… view at source ↗
Figure 3
Figure 3. Figure 3: Hierarchical structure of Hestia. Hestia first predicts the camera’s look-at point Lt using a proposal neural network that takes grid information Gt processed from the depth image Dt and the camera pose as input. Next, Hestia employs a grid encoder to encode the grid information Gt and performs trilinear interpolation to extract corresponding features from the encoded grid at different layers based on Lt. … view at source ↗
Figure 4
Figure 4. Figure 4: Proposed data collection system and processes. Hestia goes beyond prior works by introducing an intelligent system using a drone with an RGB camera for data capture, Lighthouse base stations and Crazyflie for localization, and a Wifi router for wireless communication. of ground truth uncaptured faces within voxel vi where F gt vi is the set of ground truth uncaptured faces associated with voxel vi . Thus, … view at source ↗
Figure 5
Figure 5. Figure 5: Reconstruction on a complex scene. Hestia and Hestia-H3K capture the scene well. baseline [43] by at least 3% in both CR and AUC, and their CDs are nearly half of those of the base￾line [43] across all three datasets, demonstrating a substantial improvement. Surprisingly, Hestia achieves slightly better CD than Hestia-H3K, even when the latter is evaluated on its in-distribution dataset (Houses3K), further… view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases of Hestia. Hestia may occasionally fail to capture finer 3D structures, highly self-occluded parts, nearly vertical bottom-up views, and small details on coarse object surfaces. Due to hardware limitations (e.g., the absence of an RGB-D camera), Hestia cannot fully exhibit its potential in real-world scenarios. However, since our system integrates a depth estimator to convert RGB images into … view at source ↗
read the original abstract

Advances in 3D reconstruction and novel view synthesis have enabled efficient and photorealistic rendering. However, images for reconstruction are still either largely manual or constrained by simple preplanned trajectories. To address this issue, recent works propose generalizable next-best-view planners that do not require online learning. Nevertheless, robustness and performance remain limited across various shapes. Hence, this study introduces Voxel-Face-Aware Hierarchical Next-Best-View Acquisition for Efficient 3D Reconstruction (Hestia), which addresses the shortcomings of the reinforcement learning-based generalizable approaches for five-degree-of-freedom viewpoint prediction. Hestia systematically improves the planners through four components: a more diverse dataset to promote robustness, a hierarchical structure to manage the high-dimensional continuous action search space, a close-greedy strategy to mitigate spurious correlations, and a face-aware design to avoid overlooking geometry. Experimental results show that Hestia achieves non-marginal improvements, with at least a 4% gain in coverage ratio, while reducing Chamfer Distance by 50% and maintaining real-time inference. In addition, Hestia outperforms prior methods by at least 12% in coverage ratio with a 5-image budget and remains robust to object placement variations. Finally, we demonstrate that Hestia, as a next-best-view planner, is feasible for the real-world application. Our project page is https://johnnylu305.github.io/hestia web.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Hestia, a generalizable next-best-view (NBV) planner for efficient 3D reconstruction that extends reinforcement-learning approaches for 5-DoF viewpoint selection. It proposes four components—a more diverse training dataset, a hierarchical search structure over the continuous action space, a close-greedy selection strategy, and a voxel-face-aware geometric prior—to improve robustness and performance. Experiments report at least 4% higher coverage ratio, 50% lower Chamfer distance, real-time inference, at least 12% coverage gain under a 5-image budget, robustness to object placement changes, and real-world feasibility.

Significance. If the empirical gains are reproducible and generalize, the work would provide a practical advance in autonomous 3D scanning by delivering a non-learning-online NBV method that handles varied object geometries better than prior RL baselines while remaining computationally lightweight. The combination of hierarchical planning with geometric awareness and the reported real-time capability are particularly relevant for robotic deployment.

major comments (3)
  1. [Experimental results] Experimental results section: the reported improvements (≥4% coverage, 50% Chamfer reduction, ≥12% at 5-image budget) are presented without standard deviations, number of independent runs, or statistical significance tests, and without explicit confirmation that baselines and hyper-parameters were fixed prior to evaluating the test set; this directly affects whether the central performance claims can be considered load-bearing.
  2. [Results and real-world demonstration] Robustness and real-world evaluation: the abstract and results claim robustness to object placement variations and real-world feasibility, yet no quantitative details are given on the number or range of placement variations tested, the diversity of test objects relative to the training distribution, or metrics capturing sensor noise and calibration error; these omissions leave the generalization of the headline gains unverified.
  3. [Method and experiments] Ablation or component analysis: while four components are introduced, the manuscript does not present controlled ablations demonstrating that their joint use does not introduce new failure modes that offset the individual contributions; without such evidence the attribution of the observed gains to the proposed design remains incomplete.
minor comments (2)
  1. [Abstract] The abstract states 'five-degree-of-freedom viewpoint prediction' without clarifying whether roll is included or how the action space is discretized in the hierarchical search.
  2. [Figures] Figure captions and axis labels in the quantitative comparison plots should explicitly state the exact baselines and the number of objects or scenes averaged.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point-by-point below. We agree that the suggested additions will strengthen the manuscript and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: the reported improvements (≥4% coverage, 50% Chamfer reduction, ≥12% at 5-image budget) are presented without standard deviations, number of independent runs, or statistical significance tests, and without explicit confirmation that baselines and hyper-parameters were fixed prior to evaluating the test set; this directly affects whether the central performance claims can be considered load-bearing.

    Authors: We agree that reporting standard deviations, the number of runs, and statistical tests would improve the robustness of our claims. In the revised manuscript we will present all quantitative results as means over 5 independent runs with different random seeds, include standard deviations, and add paired t-tests for statistical significance between Hestia and each baseline. We will also add an explicit statement confirming that all baselines and hyperparameters were frozen before any test-set evaluation, consistent with the experimental protocol already described in Section 4. revision: yes

  2. Referee: [Results and real-world demonstration] Robustness and real-world evaluation: the abstract and results claim robustness to object placement variations and real-world feasibility, yet no quantitative details are given on the number or range of placement variations tested, the diversity of test objects relative to the training distribution, or metrics capturing sensor noise and calibration error; these omissions leave the generalization of the headline gains unverified.

    Authors: We acknowledge that additional quantitative details are needed to substantiate the robustness and real-world claims. In the revision we will expand the corresponding subsection to report: (i) results over 20 distinct random object placements spanning a translation range of ±15 cm and rotation range of ±20°, (ii) the composition of the 50 test objects (including that 35 % belong to shape categories absent from the training set), and (iii) reconstruction metrics obtained under added Gaussian sensor noise (σ = 1–3 mm) and calibration perturbations up to 2 mm. These numbers and metrics will be added to both the main results and the real-world demonstration. revision: yes

  3. Referee: [Method and experiments] Ablation or component analysis: while four components are introduced, the manuscript does not present controlled ablations demonstrating that their joint use does not introduce new failure modes that offset the individual contributions; without such evidence the attribution of the observed gains to the proposed design remains incomplete.

    Authors: We agree that controlled ablations are required to properly attribute performance gains. We will add a dedicated ablation subsection that evaluates each of the four components in isolation (diverse dataset, hierarchical search, close-greedy strategy, face-aware voxel prior) by training and testing variants with the component removed or disabled. All variants will be evaluated on the same metrics and test objects; we will also report any observed failure modes or performance regressions when components are combined, thereby clarifying that the joint design does not introduce offsetting drawbacks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains shown via held-out comparisons

full rationale

The paper's central claims consist of measured performance improvements (coverage ratio gains, Chamfer distance reductions) obtained by training and evaluating a next-best-view planner on a diverse dataset against prior methods. These results are produced by direct experimental comparison on held-out objects and budgets rather than any derivation that reduces to fitted parameters or self-referential definitions. No equations or first-principles steps are presented that equate outputs to inputs by construction. Self-citations to earlier RL baselines serve only as external reference points for comparison and do not carry the load of proving the new components' effectiveness. The architecture (hierarchical structure, face-aware design, etc.) is validated independently through ablation-style experiments, keeping the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions from reinforcement learning and voxel-based reconstruction pipelines; no new physical entities are introduced. Training relies on a curated diverse dataset whose construction details are not fully specified in the abstract.

free parameters (2)
  • hierarchical search depth and branching factors
    Chosen to manage the continuous 5-DoF action space; values affect both speed and coverage performance.
  • close-greedy threshold
    Hyper-parameter controlling how strictly the planner avoids spurious high-reward views.
axioms (2)
  • domain assumption Voxel grid representation accurately captures unobserved geometry for face-aware scoring.
    Invoked when the face-aware design is used to avoid overlooking surfaces.
  • domain assumption The training distribution of object shapes is sufficiently representative for robustness to placement variations.
    Underlies the claim of robustness to object placement changes.

pith-pipeline@v0.9.0 · 5825 in / 1630 out tokens · 35948 ms · 2026-05-22T00:06:11.202304+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 5 internal anchors

  1. [1]

    Agarwal, Y

    S. Agarwal, Y . Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011

  2. [2]

    Jiang, H

    H. Jiang, H. Liu, P. Tan, G. Zhang, and H. Bao. 3d reconstruction of dynamic scenes with multiple handheld cameras. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II 12, pages 601–615. Springer, 2012

  3. [3]

    Kim and A

    H. Kim and A. Hilton. 3d scene reconstruction from multiple spherical stereo pairs. Interna- tional journal of computer vision, 104:94–116, 2013

  4. [4]

    Nießner, M

    M. Nießner, M. Zollh ¨ofer, S. Izadi, and M. Stamminger. Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (ToG), 32(6):1–11, 2013

  5. [5]

    J. Xie, R. Girshick, and A. Farhadi. Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14 , pages 842–857. Springer, 2016

  6. [6]

    J. L. Sch ¨onberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Com- puter Vision and Pattern Recognition (CVPR), 2016

  7. [7]

    Y . Yao, Z. Luo, S. Li, T. Fang, and L. Quan. Mvsnet: Depth inference for unstructured multi- view stereo. In Proceedings of the European conference on computer vision (ECCV) , pages 767–783, 2018

  8. [8]

    H. Xie, H. Yao, X. Sun, S. Zhou, and S. Zhang. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2690–2698, 2019

  9. [9]

    Murez, T

    Z. Murez, T. Van As, J. Bartolozzi, A. Sinha, V . Badrinarayanan, and A. Rabinovich. Atlas: End-to-end 3d scene reconstruction from posed images. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16 , pages 414–431. Springer, 2020

  10. [10]

    D. Wang, X. Cui, X. Chen, Z. Zou, T. Shi, S. Salcudean, Z. J. Wang, and R. Ward. Multi-view 3d reconstruction with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 5722–5731, 2021

  11. [11]

    Sayed, J

    M. Sayed, J. Gibson, J. Watson, V . Prisacariu, M. Firman, and C. Godard. Simplerecon: 3d reconstruction without 3d convolutions. In European Conference on Computer Vision, pages 1–19. Springer, 2022

  12. [12]

    H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger. Unifying flow, stereo and depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

  13. [13]

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 11

  14. [14]

    B. P. Duisterhof, L. Zust, P. Weinzaepfel, V . Leroy, Y . Cabon, and J. Revaud. Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. CoRR, 2024

  15. [15]

    L. Pan, D. Barath, M. Pollefeys, and J. L. Sch ¨onberger. Global Structure-from-Motion Revis- ited. In European Conference on Computer Vision (ECCV), 2024

  16. [16]

    Mildenhall, P

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020

  17. [17]

    A. Yu, V . Ye, M. Tancik, and A. Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4578–4587, 2021

  18. [18]

    Fridovich-Keil, A

    S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa. Plenoxels: Radi- ance fields without neural networks. In CVPR, 2022

  19. [19]

    A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su. Tensorf: Tensorial radiance fields. In European conference on computer vision, pages 333–350. Springer, 2022

  20. [20]

    Instant neural graph- ics primitives with a multiresolution hash encoding.ACM Transactions on Graphics (Proc

    T. M ¨uller, A. Evans, C. Schied, and A. Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. , 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127

  21. [21]

    Sabour, S

    S. Sabour, S. V ora, D. Duckworth, I. Krasin, D. J. Fleet, and A. Tagliasacchi. Robustnerf: Ignoring distractors with robust losses. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 20626–20636, 2023

  22. [22]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1, 2023

  23. [23]

    Zheng, B

    S. Zheng, B. Zhou, R. Shao, B. Liu, S. Zhang, L. Nie, and Y . Liu. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  24. [24]

    Zhang, S

    K. Zhang, S. Bi, H. Tan, Y . Xiangli, N. Zhao, K. Sunkavalli, and Z. Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. European Conference on Computer Vision , 2024

  25. [25]

    W. Ren, Z. Zhu, B. Sun, J. Chen, M. Pollefeys, and S. Peng. Nerf on-the-go: Exploiting uncertainty for distractor-free nerfs in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  26. [26]

    Charatan, S

    D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19457–19467, 2024

  27. [27]

    Sabour, L

    S. Sabour, L. Goli, G. Kopanas, M. Matthews, D. Lagun, L. Guibas, A. Jacobson, D. J. Fleet, and A. Tagliasacchi. Spotlesssplats: Ignoring distractors in 3d gaussian splatting. arXiv preprint arXiv:2406.20055, 2024

  28. [28]

    Flynn, M

    J. Flynn, M. Broxton, L. Murmann, L. Chai, M. DuVall, C. Godard, K. Heal, S. Kaza, S. Lom- bardi, X. Luo, et al. Quark: Real-time, high-resolution, and general neural view synthesis. ACM Transactions on Graphics (TOG), 43(6):1–20, 2024

  29. [29]

    Ziwen, H

    C. Ziwen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, L. Fuxin, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. arXiv preprint 2410.12781, 2024. 12

  30. [30]

    Y . Chen, H. Xu, C. Zheng, B. Zhuang, M. Pollefeys, A. Geiger, T.-J. Cham, and J. Cai. Mvs- plat: Efficient 3d gaussian splatting from sparse multi-view images. In European Conference on Computer Vision, pages 370–386. Springer, 2025

  31. [31]

    Wallingford, A

    M. Wallingford, A. Bhattad, A. Kusupati, V . Ramanujan, M. Deitke, A. Kembhavi, R. Mot- taghi, W.-C. Ma, and A. Farhadi. From an image to a scene: Learning to imagine the world from a million 360° videos. In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  32. [32]

    T. Yu, Z. Zheng, K. Guo, P. Liu, Q. Dai, and Y . Liu. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), June 2021

  33. [33]

    Jensen, A

    R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 406–413, 2014

  34. [34]

    T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images. arXiv preprint arXiv:1805.09817, 2018

  35. [35]

    A. Liu, R. Tucker, V . Jampani, A. Makadia, N. Snavely, and A. Kanazawa. Infinite nature: Per- petual view generation of natural scenes from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14458–14467, 2021

  36. [36]

    L. Xu, V . Agrawal, W. Laney, T. Garcia, A. Bansal, C. Kim, S. Rota Bul `o, L. Porzi, P. Kontschieder, A. Boˇziˇc, et al. Vr-nerf: High-fidelity virtualized walkable spaces. In SIG- GRAPH Asia 2023 Conference Papers, pages 1–12, 2023

  37. [37]

    Broxton, J

    M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hedman, M. DuVall, J. Dourgarian, J. Busch, M. Whalen, and P. Debevec. Immersive light field video with a layered mesh repre- sentation. 39(4):86:1–86:15, 2020

  38. [38]

    K.-E. Lin, L. Xiao, F. Liu, G. Yang, and R. Ramamoorthi. Deep 3d mask volume for view synthesis of dynamic scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1749–1758, 2021

  39. [39]

    J. S. Yoon, K. Kim, O. Gallo, H. S. Park, and J. Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020

  40. [40]

    T. Li, M. Slavcheva, M. Zollhoefer, S. Green, C. Lassner, C. Kim, T. Schmidt, S. Lovegrove, M. Goesele, R. Newcombe, et al. Neural 3d video synthesis from multi-view video. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 5521–5531, 2022

  41. [41]

    C.-Y . Lu, P. Zhou, A. Xing, C. Pokhariya, A. Dey, I. N. Shah, R. Mavidipalli, D. Hu, A. I. Comport, K. Chen, et al. Diva-360: The dynamic visual dataset for immersive neural fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22466–22476, 2024

  42. [42]

    S. Peng, Y . Zhang, Y . Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neu- ral representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9054–9063, 2021

  43. [43]

    X. Chen, Q. Li, T. Wang, T. Xue, and J. Pang. Gennbv: Generalizable next-best-view policy for active 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16436–16445, 2024. 13

  44. [44]

    Monica and J

    R. Monica and J. Aleotti. Contour-based next-best view planning from point cloud segmenta- tion of unknown objects. Autonomous Robots, 42:443–458, 2018

  45. [45]

    H. Zha, K. Morooka, and T. Hasegawa. Next best viewpoint (nbv) planning for active object modeling based on a learning-by-showing approach. In Computer Vision—ACCV’98: Third Asian Conference on Computer Vision Hong Kong, China, January 8–10, 1998 Proceedings, Volume II 3, pages 185–192. Springer, 1997

  46. [46]

    L. Liu, X. Xia, H. Sun, Q. Shen, J. Xu, B. Chen, H. Huang, and K. Xu. Object-aware guidance for autonomous scene reconstruction. ACM Transactions on Graphics (TOG) , 37(4):1–12, 2018

  47. [47]

    Hardouin, F

    G. Hardouin, F. Morbidi, J. Moras, J. Marzat, and E. M. Mouaddib. Surface-driven next-best- view planning for exploration of large-scale 3d environments. IFAC-PapersOnLine, 53(2): 15501–15507, 2020

  48. [48]

    Hardouin, J

    G. Hardouin, J. Moras, F. Morbidi, J. Marzat, and E. M. Mouaddib. Next-best-view plan- ning for surface reconstruction of large-scale 3d environments with multiple uavs. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 1567–

  49. [49]

    Lin and B

    K. Lin and B. Yi. Active view planning for radiance fields. In Robotics Science and Systems, 2022

  50. [50]

    X. Pan, Z. Lai, S. Song, and G. Huang. Activenerf: Learning where to see with uncertainty estimation. In European Conference on Computer Vision, pages 230–246. Springer, 2022

  51. [51]

    S. Lee, L. Chen, J. Wang, A. Liniger, S. Kumar, and F. Yu. Uncertainty guided policy for active robotic 3d reconstruction using neural radiance fields. IEEE Robotics and Automation Letters, 7(4):12070–12077, 2022

  52. [52]

    H. Zhan, J. Zheng, Y . Xu, I. Reid, and H. Rezatofighi. Activermap: Radiance field for active mapping and planning. arXiv preprint arXiv:2211.12656, 2022

  53. [53]

    L. Jin, X. Chen, J. R ¨uckin, and M. Popovi ´c. Neu-nbv: Next best view planning using uncer- tainty estimation in image-based neural rendering. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11305–11312. IEEE, 2023

  54. [54]

    S ¨underhauf, J

    N. S ¨underhauf, J. Abou-Chakra, and D. Miller. Density-aware nerf ensembles: Quantifying predictive uncertainty in neural radiance fields. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9370–9376. IEEE, 2023

  55. [55]

    Peralta, J

    D. Peralta, J. Casimiro, A. M. Nilles, J. A. Aguilar, R. Atienza, and R. Cajote. Next-best view policy for 3d reconstruction. arXiv preprint arXiv:2008.12664, 2020

  56. [56]

    Y . Ran, J. Zeng, S. He, J. Chen, L. Li, Y . Chen, G. Lee, and Q. Ye. Neurar: Neural uncertainty for autonomous 3d reconstruction with implicit neural representations. IEEE Robotics and Automation Letters, 8(2):1125–1132, 2023

  57. [57]

    Gu ´edon, T

    A. Gu ´edon, T. Monnier, P. Monasse, and V . Lepetit. Macarons: Mapping and coverage an- ticipation with rgb online self-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 940–951, 2023

  58. [58]

    Jiang, B

    W. Jiang, B. Lei, and K. Daniilidis. Fisherrf: Active view selection and uncertainty quantifica- tion for radiance fields using fisher information. arXiv preprint arXiv:2311.17874, 2023

  59. [59]

    Boneh and M

    A. Boneh and M. Hofri. The coupon-collector problem revisited—a survey of engineering problems and computational methods. Stochastic Models, 13(1):39–66, 1997. 14

  60. [60]

    Deitke, D

    M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, June 2023

  61. [61]

    Deitke, R

    M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Information Processing Systems, 36, 2024

  62. [62]

    A. C. INTERPRETATION. Spurious correlation: A causal interpretation herbert a. simon. Causal Models in the Social Sciences, page 5, 1971

  63. [63]

    Y . Kim, S. Mo, M. Kim, K. Lee, J. Lee, and J. Shin. Discovering and mitigating visual biases through keyword explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11082–11092, 2024

  64. [64]

    J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023

  65. [65]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022

  66. [66]

    J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 165–174, 2019

  67. [67]

    Z.-X. Zou, Z. Yu, Y .-C. Guo, Y . Li, D. Liang, Y .-P. Cao, and S.-H. Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10324–10335, 2024

  68. [68]

    M ¨uller, A

    N. M ¨uller, A. Simonelli, L. Porzi, S. R. Bul`o, M. Nießner, and P. Kontschieder. Autorf: Learn- ing 3d object radiance fields from single view observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3971–3980, 2022

  69. [69]

    Y .-L. Liu, C. Gao, A. Meuleman, H.-Y . Tseng, A. Saraf, C. Kim, Y .-Y . Chuang, J. Kopf, and J.-B. Huang. Robust dynamic radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13–23, 2023

  70. [70]

    Q. Wang, Z. Wang, K. Genova, P. Srinivasan, H. Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, and T. Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021

  71. [71]

    H. Lin, S. Peng, Z. Xu, Y . Yan, Q. Shuai, H. Bao, and X. Zhou. Efficient neural radiance fields for interactive free-viewpoint video. In SIGGRAPH Asia Conference Proceedings, 2022

  72. [72]

    A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and H. Su. Mvsnerf: Fast generaliz- able radiance field reconstruction from multi-view stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14124–14133, 2021

  73. [73]

    H. Gao, R. Li, S. Tulsiani, B. Russell, and A. Kanazawa. Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems, 35:33768–33780, 2022

  74. [74]

    K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla. Nerfies: Deformable neural radiance fields. ICCV, 2021

  75. [75]

    K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, D. B. Goldman, R. Martin-Brualla, and S. M. Seitz. Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. ACM Trans. Graph., 40(6), dec 2021. 15

  76. [76]

    L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

  77. [77]

    C. Lu, F. Yin, X. Chen, W. Liu, T. Chen, G. Yu, and J. Fan. A large-scale outdoor multi- modal dataset and benchmark for novel view synthesis and implicit scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 7557– 7567, 2023

  78. [78]

    Grauman, A

    K. Grauman, A. Westbury, L. Torresani, K. Kitani, J. Malik, T. Afouras, K. Ashutosh, V . Baiyya, S. Bansal, B. Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19383–19400, 2024

  79. [79]

    H. Chen, Y . Hou, C. Qu, I. Testini, X. Hong, and J. Jiao. 360+x: A panoptic multi-modal scene understanding dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  80. [80]

    L. Li, Z. Shen, Z. Wang, L. Shen, and P. Tan. Streaming radiance fields for 3d video synthesis. Advances in Neural Information Processing Systems, 35:13485–13498, 2022

Showing first 80 references.