pith. sign in

arxiv: 2406.04301 · v4 · submitted 2024-06-06 · 💻 cs.CV

Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

Pith reviewed 2026-05-24 00:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords neural surface reconstructionepipolar geometrysparse viewsgeneralizable reconstructioncost volumeSDFmonocular depth regularizationfeature aggregation
0
0 comments X

The pith

EpiS reconstructs surfaces from sparse multi-view images by guiding fine-grained epipolar feature aggregation with coarse cost-volume features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that explicitly incorporating epipolar geometry into a neural surface reconstruction pipeline overcomes the geometric ambiguity and information loss that plague cost-volume methods when inputs are limited to a few views. It replaces reliance on simple statistics like mean and variance with guided sampling of features along epipolar lines, fusion through an epipolar transformer, ray-wise aggregation into SDF-aware features, and scale-invariant regularization drawn from a pretrained monocular depth model. A sympathetic reader would care because sparse-view capture is the practical norm for many real scenes, yet existing generalizable approaches produce over-smoothed or incomplete surfaces; the new design promises accurate reconstruction without dense imagery or per-scene optimization. If correct, the approach would make high-fidelity surface modeling feasible from ordinary limited photo sets.

Core claim

The authors present EpiS as a generalizable framework that uses coarse cost-volume features to guide aggregation of fine-grained epipolar features sampled along corresponding epipolar lines across source views. An epipolar transformer fuses the multi-view information, followed by ray-wise aggregation to produce SDF-aware features for surface estimation. A geometry regularization strategy that leverages a pretrained monocular depth model through scale-invariant global and local constraints further mitigates information loss under sparse views.

What carries the argument

Epipolar feature aggregation guided by cost-volume features, which samples and fuses view-dependent geometry along epipolar lines before producing SDF-aware outputs.

If this is right

  • Outperforms state-of-the-art generalizable surface reconstruction methods on DTU and BlendedMVS under sparse-view settings.
  • Maintains strong generalization without per-scene optimization.
  • Reduces over-smoothing by preserving view-dependent geometric structure that simple cost-volume statistics discard.
  • Handles occlusions and geometric ambiguity more effectively through explicit epipolar sampling and depth-based regularization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hybrid cost-volume plus epipolar strategy could transfer to other sparse multi-view tasks such as depth estimation or novel-view synthesis.
  • Similar monocular priors might regularize reconstruction in dynamic or non-rigid scenes where epipolar consistency still holds across frames.
  • The design implies that learned priors aligned with epipolar geometry can substitute for additional views in extremely sparse regimes.

Load-bearing premise

Coarse cost-volume features can reliably guide fine-grained epipolar feature aggregation while a pretrained monocular depth model supplies unbiased scale-invariant constraints that align with multi-view epipolar geometry.

What would settle it

On the DTU dataset using three input views, EpiS produces higher Chamfer distance or lower F-score than prior generalizable cost-volume baselines.

Figures

Figures reproduced from arXiv: 2406.04301 by Kaichen Zhou, Xinhai Chang.

Figure 1
Figure 1. Figure 1: Reconstruction results on the DTU dataset. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Pipeline. Given a ray in the target view, it is projected onto source views to extract the epipolar feature and distribution feature (variance and mean) using a cost volume. Subsequently, the distribution features are utilized as queries, while the epipolar features serve as keys and values for cross-attention trans￾formers, facilitating cross-view epipolar feature fusion. This fused fe… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of Our Fine-Tuning Strategy Designs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization results on the DTU dataset. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Reconstruction results on the BlendedMVS dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Reconstructing accurate surfaces from sparse multi-view images remains challenging due to severe geometric ambiguity and occlusions. Existing generalizable neural surface reconstruction methods primarily rely on cost volumes that summarize multi-view features using simple statistics (e.g., mean and variance), which discard critical view-dependent geometric structure and often lead to over-smoothed reconstructions. We propose EpiS, a generalizable neural surface reconstruction framework that explicitly leverages epipolar geometry for sparse-view inputs. Instead of directly regressing geometry from cost-volume statistics, EpiS uses coarse cost-volume features to guide the aggregation of fine-grained epipolar features sampled along corresponding epipolar lines across source views. An epipolar transformer fuses multi-view information, followed by ray-wise aggregation to produce SDF-aware features for surface estimation. To further mitigate information loss under sparse views, we introduce a geometry regularization strategy that leverages a pretrained monocular depth model through scale-invariant global and local constraints. Extensive experiments on DTU and BlendedMVS demonstrate that EpiS significantly outperforms state-of-the-art generalizable surface reconstruction methods under sparse-view settings, while maintaining strong generalization without per-scene optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes EpiS, a generalizable neural surface reconstruction framework for sparse multi-view inputs. It replaces direct regression from cost-volume statistics with coarse cost-volume features guiding aggregation of fine-grained epipolar features sampled along epipolar lines, fused via an epipolar transformer and ray-wise aggregation to produce SDF-aware features. A geometry regularization strategy adds scale-invariant global and local constraints from a pretrained monocular depth model. Experiments on DTU and BlendedMVS report significant outperformance over prior generalizable methods under sparse views without per-scene optimization.

Significance. If the reported gains hold after verification of implementation details and ablations, the explicit use of epipolar geometry for feature aggregation combined with monocular regularization could advance sparse-view surface reconstruction by preserving view-dependent structure that simple cost-volume statistics discard.

major comments (2)
  1. [Abstract] Abstract (geometry regularization strategy paragraph): The central performance claim depends on the monocular depth constraints supplying unbiased signals that align with multi-view epipolar geometry after scale normalization. No analysis or test is described showing that systematic errors in the pretrained model (e.g., in low-texture or view-dependent regions under 3-view DTU/BlendedMVS protocols) do not pull ray-wise SDF features toward inconsistent surfaces, which directly risks undermining the reported gains over cost-volume baselines.
  2. [Method] Method description (epipolar feature aggregation): The claim that coarse cost-volume features reliably guide fine-grained epipolar aggregation is load-bearing for the outperformance result, yet the manuscript provides no quantitative measure (e.g., alignment error or ablation removing the guidance) of how well this guidance functions when the cost volume itself is severely under-constrained by only three views.
minor comments (2)
  1. [Abstract] The abstract and method sections use 'SDF-aware features' without an explicit definition or equation linking the ray-wise aggregation output to the signed distance function used for surface extraction.
  2. [Experiments] Dataset splits, number of views (e.g., exact 3-view protocol), and whether error bars or multiple runs are reported are not mentioned in the provided abstract; these details are needed for reproducibility of the 'significantly outperforms' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting two important aspects of our method that warrant further clarification. We address each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract (geometry regularization strategy paragraph): The central performance claim depends on the monocular depth constraints supplying unbiased signals that align with multi-view epipolar geometry after scale normalization. No analysis or test is described showing that systematic errors in the pretrained model (e.g., in low-texture or view-dependent regions under 3-view DTU/BlendedMVS protocols) do not pull ray-wise SDF features toward inconsistent surfaces, which directly risks undermining the reported gains over cost-volume baselines.

    Authors: We agree that an explicit analysis of potential systematic biases in the pretrained monocular depth model under the 3-view protocols would strengthen the paper. The scale-invariant global and local constraints are intended to reduce sensitivity to absolute scale and local inconsistencies, and the reported gains over pure cost-volume baselines on both DTU and BlendedMVS provide indirect evidence that any residual biases do not dominate. Nevertheless, we will add a dedicated paragraph in the revised manuscript discussing known limitations of monocular depth estimators in low-texture and view-dependent regions, together with qualitative visualizations of the depth predictions used during training on the evaluation scenes. revision: partial

  2. Referee: [Method] Method description (epipolar feature aggregation): The claim that coarse cost-volume features reliably guide fine-grained epipolar aggregation is load-bearing for the outperformance result, yet the manuscript provides no quantitative measure (e.g., alignment error or ablation removing the guidance) of how well this guidance functions when the cost volume itself is severely under-constrained by only three views.

    Authors: The guidance mechanism is indeed central. While the current manuscript does not report a direct alignment-error metric between coarse cost-volume features and the sampled epipolar features, the ablation studies already isolate the contribution of the epipolar transformer and ray-wise aggregation. To directly quantify the guidance quality under three-view sparsity, we will add a new ablation that replaces the learned guidance with uniform or random sampling along epipolar lines and report the resulting surface reconstruction metrics on DTU. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external epipolar geometry and pretrained monocular model without self-referential reduction

full rationale

The paper's central claims rest on standard external components (epipolar geometry for feature aggregation along lines, coarse cost-volume guidance, and scale-invariant constraints from a pretrained monocular depth model) that are not defined in terms of the method's outputs or fitted parameters. No equations, self-citations, or uniqueness theorems are presented that reduce the performance gains or regularization strategy to a fit or renaming of the inputs themselves. The abstract and description treat these as independent priors, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated or derivable from the provided text.

pith-pipeline@v0.9.0 · 5720 in / 1117 out tokens · 25722 ms · 2026-05-24T00:09:03.357112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    EpiS uses coarse cost-volume features to guide the aggregation of fine-grained epipolar features sampled along corresponding epipolar lines... epipolar transformer fuses multi-view information, followed by ray-wise aggregation to produce SDF-aware features... geometry regularization strategy that leverages a pretrained monocular depth model through scale-invariant global and local constraints (global triplet loss, local gradient loss).

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Epipolar & Ray Information Aggregation... Linearized Attention mechanism... Geometry Decoder & Weights Decoder... Lglobal = ((d̂1 − d̂s) × (d̃2 − d̃s) − (d̂2 − d̂s) × (d̃1 − d̃s))², Llocal = (1 − v̂ · ṽ / ||v̂||·||ṽ||)²

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 3 internal anchors

  1. [1]

    International Journal of Computer Vision120, 153–168 (2016) 3, 11

    Aanæs, H., Jensen, R.R., Vogiatzis, G., Tola, E., Dahl, A.B.: Large-scale data for multiple-view stereopsis. International Journal of Computer Vision120, 153–168 (2016) 3, 11

  2. [2]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Bhat,S.F.,Birkl,R.,Wofk,D.,Wonka,P.,Müller,M.:Zoedepth:Zero-shottransfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023) 8

  3. [3]

    Campbell, N.D., Vogiatzis, G., Hernández, C., Cipolla, R.: Using multiple hypothe- sestoimprovedepth-mapsformulti-viewstereo.In:ComputerVision–ECCV2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10. pp. 766–779. Springer (2008) 3

  4. [4]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14124–14133 (2021) 3, 9, 10

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ding, Y., Yuan, W., Zhu, Q., Zhang, H., Liu, X., Wang, Y., Liu, X.: Transmvsnet: Global context-aware multi-view stereo network with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8585–8594 (2022) 1

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Genova, K., Cole, F., Vlasic, D., Sarna, A., Freeman, W.T., Funkhouser, T.: Learn- ing shape templates with structured implicit functions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7154–7164 (2019) 3

  7. [7]

    arXiv preprint arXiv:2002.10099 (2020) 3

    Gropp, A., Yariv, L., Haim, N., Atzmon, M., Lipman, Y.: Implicit geometric reg- ularization for learning shapes. arXiv preprint arXiv:2002.10099 (2020) 3

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2495–2504 (2020) 1, 3

  9. [9]

    In: Proceedings of the IEEE international conference on computer vision

    Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: Surfacenet: An end-to-end 3d neu- ral network for multiview stereopsis. In: Proceedings of the IEEE international conference on computer vision. pp. 2307–2315 (2017) 3

  10. [10]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 43(11), 4078–4093 (2020) 3

    Ji, M., Zhang, J., Dai, Q., Fang, L.: Surfacenet+: An end-to-end 3d neural network for very sparse multi-view stereopsis. IEEE Transactions on Pattern Analysis and Machine Intelligence 43(11), 4078–4093 (2020) 3

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Johari, M.M., Lepoittevin, Y., Fleuret, F.: Geonerf: Generalizing nerf with geome- try priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18365–18375 (2022) 3

  12. [12]

    Advances in neural information processing systems30 (2017) 3

    Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. Advances in neural information processing systems30 (2017) 3

  13. [13]

    In: International conference on machine learning

    Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: International conference on machine learning. pp. 5156–5165. PMLR (2020) 6, 7

  14. [14]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 11

  15. [15]

    International journal of computer vision38, 199–218 (2000) 3

    Kutulakos, K.N., Seitz, S.M.: A theory of shape by space carving. International journal of computer vision38, 199–218 (2000) 3

  16. [16]

    IEEE transactions on pattern analysis and machine intelligence 27(3), 418–433 (2005) 3 16 Kaichen Zhou

    Lhuillier, M., Quan, L.: A quasi-dense approach to surface reconstruction from un- calibrated images. IEEE transactions on pattern analysis and machine intelligence 27(3), 418–433 (2005) 3 16 Kaichen Zhou

  17. [17]

    Advances in Neural Information Processing Systems 36 (2024) 4

    Liang, Y., He, H., Chen, Y.: Retr: Modeling rendering via transformer for gener- alizable neural surface reconstruction. Advances in Neural Information Processing Systems 36 (2024) 4

  18. [18]

    Advances in Neural Information Processing Systems33, 15651–15663 (2020) 3

    Liu, L., Gu, J., Zaw Lin, K., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. Advances in Neural Information Processing Systems33, 15651–15663 (2020) 3

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2019–2028 (2020) 3

  20. [20]

    In: European Conference on Computer Vision

    Long, X., Lin, C., Wang, P., Komura, T., Wang, W.: Sparseneus: Fast generaliz- able neural surface reconstruction from sparse views. In: European Conference on Computer Vision. pp. 210–227. Springer (2022) 1, 2, 3, 4, 9, 10, 11, 12

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4460–4470 (2019) 3

  22. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A.: Im- plicit surface representations as layers in neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4743–4752 (2019) 3

  23. [23]

    In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13

    Middelberg,S.,Sattler,T.,Untzelmann,O.,Kobbelt,L.:Scalable6-doflocalization on mobile devices. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13. pp. 268–283. Springer (2014) 1

  24. [24]

    Commu- nications of the ACM65(1), 99–106 (2021) 1, 5, 7, 12

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021) 1, 5, 7, 12

  25. [25]

    ACM Transactions on Graphics (ToG)41(4), 1– 15 (2022) 3

    Müller,T.,Evans,A.,Schied,C.,Keller,A.:Instantneuralgraphicsprimitiveswith a multiresolution hash encoding. ACM Transactions on Graphics (ToG)41(4), 1– 15 (2022) 3

  26. [26]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Occupancy flow: 4d recon- struction by learning particle dynamics. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 5379–5389 (2019) 3

  27. [27]

    In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

    Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 3504–3515 (2020) 3

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: Learning texture representations in function space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4531–4540 (2019) 3

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Oechsle, M., Peng, S., Geiger, A.: Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5589–5599 (2021) 3, 9, 10

  30. [30]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Park,J.J.,Florence,P.,Straub,J.,Newcombe,R.,Lovegrove,S.:Deepsdf:Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 165– 174 (2019) 3

  31. [31]

    Advances in Neural Information Processing Systems 36 (2024) 2, 4, 10 Abbreviated paper title 17

    Peng, R., Gu, X., Tang, L., Shen, S., Yu, F., Wang, R.: Gens: Generalizable neural surface reconstruction from multi-view images. Advances in Neural Information Processing Systems 36 (2024) 2, 4, 10 Abbreviated paper title 17

  32. [32]

    Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.: Convolutional occupancynetworks.In:ComputerVision–ECCV2020:16thEuropeanConference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. pp. 523–540. Springer (2020) 3

  33. [33]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural ra- diance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10318–10327 (2021) 3

  34. [34]

    Pytorch, A.D.I.: Pytorch (2018) 11

  35. [35]

    IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020) 8

    Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence44(3), 1623–1637 (2020) 8

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ren, Y., Zhang, T., Pollefeys, M., Süsstrunk, S., Wang, F.: Volrecon: Volume ren- dering of signed ray distance functions for generalizable multi-view reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16685–16695 (2023) 2, 4, 11, 14

  37. [37]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016) 1, 3, 9, 10, 12

  38. [38]

    In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14

    Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Computer Vision–ECCV 2016: 14th Euro- pean Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. pp. 501–518. Springer (2016) 3

  39. [39]

    IEEE TRANS- ACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE32(8) (2010) 3

    Stereopsis, R.M.: Accurate, dense, and robust multiview stereopsis. IEEE TRANS- ACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE32(8) (2010) 3

  40. [40]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast conver- gence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 5459–5469 (2022) 3

  41. [41]

    Machine Vision and Applications23, 903–920 (2012) 3

    Tola, E., Strecha, C., Fua, P.: Efficient large-scale multi-view stereo for ultra high- resolution image sets. Machine Vision and Applications23, 903–920 (2012) 3

  42. [42]

    NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction

    Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., Wang, W.: Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689 (2021) 1, 3, 5, 7, 9, 12

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, Q., Wang, Z., Genova, K., Srinivasan, P.P., Zhou, H., Barron, J.T., Martin- Brualla, R., Snavely, N., Funkhouser, T.: Ibrnet: Learning multi-view image-based rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4690–4699 (2021) 3, 5, 9, 10

  44. [44]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

    Wang, Y., Han, Q., Habermann, M., Daniilidis, K., Theobalt, C., Liu, L.: Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 3295– 3306 (2023) 1, 8, 9, 10

  45. [45]

    In: International Con- ference on Learning Representations (ICLR) (2023) 12

    Wu, T., Wang, J., Pan, X., Xu, X., Theobalt, C., Liu, Z., Lin, D.: Voxurf: Voxel- based efficient and accurate neural surface reconstruction. In: International Con- ference on Learning Representations (ICLR) (2023) 12

  46. [46]

    In: Proceedings of the European conference on computer vision (ECCV)

    Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: Mvsnet: Depth inference for unstruc- tured multi-view stereo. In: Proceedings of the European conference on computer vision (ECCV). pp. 767–783 (2018) 1, 3, 6, 9, 10, 11, 12

  47. [47]

    Computer Vision and Pattern Recognition (CVPR) (2020) 11 18 Kaichen Zhou

    Yao, Y., Luo, Z., Li, S., Zhang, J., Ren, Y., Zhou, L., Fang, T., Quan, L.: Blend- edmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR) (2020) 11 18 Kaichen Zhou

  48. [48]

    Advancesin Neural Information ProcessingSystems34, 4805–4815 (2021) 3, 9, 10, 12

    Yariv, L., Gu, J., Kasten, Y., Lipman, Y.: Volume rendering of neural implicit surfaces. Advancesin Neural Information ProcessingSystems34, 4805–4815 (2021) 3, 9, 10, 12

  49. [49]

    Advances in Neural Information Processing Systems33 (2020) 3, 10, 12

    Yariv, L., Kasten, Y., Moran, D., Galun, M., Atzmon, M., Ronen, B., Lipman, Y.: Multiview neural surface reconstruction by disentangling geometry and ap- pearance. Advances in Neural Information Processing Systems33 (2020) 3, 10, 12

  50. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4578–4587 (2021) 3, 9, 10

  51. [51]

    Advances in neural information processing systems35, 25018–25032 (2022) 3

    Yu, Z., Peng, S., Niemeyer, M., Sattler, T., Geiger, A.: Monosdf: Exploring monoc- ular geometric cues for neural implicit surface reconstruction. Advances in neural information processing systems35, 25018–25032 (2022) 3

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, J., Yao, Y., Quan, L.: Learning signed distance field for multi-view sur- face reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6525–6534 (2021) 3, 12

  53. [53]

    In: Proceed- ingsoftheAAAIConferenceonArtificialIntelligence.vol.35,pp.6165–6173(2021) 1

    Zhou, K., Chen, C., Wang, B., Saputra, M.R.U., Trigoni, N., Markham, A.: Vmloc: Variational fusion for learning-based multimodal camera localization. In: Proceed- ingsoftheAAAIConferenceonArtificialIntelligence.vol.35,pp.6165–6173(2021) 1

  54. [54]

    In: European Confer- ence on Computer Vision

    Zhou, K., Hong, L., Chen, C., Xu, H., Ye, C., Hu, Q., Li, Z.: Devnet: Self-supervised monocular depth learning via density volume construction. In: European Confer- ence on Computer Vision. pp. 125–142. Springer (2022) 1, 5

  55. [55]

    Advances in Neural Information Processing Systems 36 (2024) 8

    Zhou, K., Zhong, J.X., Shin, S., Lu, K., Yang, Y., Markham, A., Trigoni, N.: Dyn- point: Dynamic neural point for view synthesis. Advances in Neural Information Processing Systems 36 (2024) 8