pith. sign in

arxiv: 1906.10491 · v1 · pith:6JKQELKXnew · submitted 2019-06-25 · 💻 cs.CV

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Pith reviewed 2026-05-25 16:46 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic 3D reconstructionray potentialsvisibility artifactsQPBO relaxationgraph cutsalpha-expansionvoxel gridreprojection error
0
0 comments X

The pith

Optimizing costs along image rays for the first occupied voxel removes visibility artifacts that unary potentials create in semantic 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard Markov Random Field formulations treat depth and semantic likelihoods as unary potentials over voxels, which fails to enforce correct visibility and produces artifacts such as incorrect labels behind surfaces. It replaces this with an explicit optimization of reprojection error, where each ray contributes a cost based only on the semantic class and depth of its first occupied voxel. The two-label version of this ray-potential problem is converted into a graph-representable form solvable by graph cut under QPBO relaxation, and the multi-label case uses alpha-expansion with the same relaxation at each step. A reader would care because the change yields 3D semantic models without the common visibility errors while running at speeds comparable to prior methods.

Core claim

Formulating dense semantic 3D reconstruction as direct optimization of reprojection error over rays, where the cost for each ray depends on the semantic class and depth of the first occupied voxel, produces solutions free of the visibility artifacts that arise when likelihoods are modelled as unary potentials in a voxel grid MRF. The two-label ray-potential problem is made tractable by a transformation to graph-representable form under QPBO relaxation that permits graph-cut solution; the multi-label version applies alpha-expansion using the identical relaxation inside each move.

What carries the argument

Ray potential defined by the cost on the semantic class and depth of the first occupied voxel along each ray, converted to graph-representable form under QPBO relaxation for graph-cut and alpha-expansion optimization.

If this is right

  • Visibility artifacts that appear when depth and semantic information are encoded as unary potentials are eliminated.
  • The two-label ray-potential problem becomes solvable in polynomial time via graph cut after the QPBO transformation.
  • Multi-label instances remain tractable by repeated alpha-expansion moves that reuse the same relaxed graph construction.
  • Runtime stays comparable to existing unary-potential methods while avoiding their approximation artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ray-cost construction could be applied to non-semantic tasks such as occupancy mapping where visibility constraints dominate.
  • Hybrid pipelines could alternate between the discrete ray optimization and continuous refinement of voxel boundaries.
  • Datasets that annotate both semantics and explicit visibility could serve as direct benchmarks for the claimed artifact reduction.

Load-bearing premise

The conversion of the ray-potential problem into a graph-representable form under QPBO relaxation produces solutions that remove visibility artifacts without introducing new approximation errors.

What would settle it

A controlled comparison on scenes with known ground-truth occlusions that measures whether ray-potential reconstructions assign correct labels behind surfaces while unary-potential baselines do not.

Figures

Figures reproduced from arXiv: 1906.10491 by Christian Haene, Lubor Ladicky, Marc Pollefeys, Nikolay Savinov.

Figure 1
Figure 1. Figure 1: The process of merging for ray of the length 3. Because the optimal assignment of auxiliary variables is the same for every [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The graph construction for ray with 7 voxels. Each vari [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on 6 data sets (from top to bottom): Castle [24], South Building [11], Catania [11], Providence [11], CAB [5] and Opera [5]. Our method successfully reconstructed challenging 3D data with high level of detail. Minor errors in the reconstructions were caused by the combination of errors of the semantic classifier, insufficient amount data from certain viewpoints or errors in the depth pr… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of method with the sate-of-the-art dense volumetric reconstruction [11] on South Building [11] and Castle datasets [24]. Incorporation of true semantic and depth ray likelihoods in the optimization framework led to the corrections of systematic reconstruction artifacts, due to the approximations in the formulation of the problem. As can be seen in the South Building reconstruction - state-of-the… view at source ↗
read the original abstract

Dense semantic 3D reconstruction is typically formulated as a discrete or continuous problem over label assignments in a voxel grid, combining semantic and depth likelihoods in a Markov Random Field framework. The depth and semantic information is incorporated as a unary potential, smoothed by a pairwise regularizer. However, modelling likelihoods as a unary potential does not model the problem correctly leading to various undesirable visibility artifacts. We propose to formulate an optimization problem that directly optimizes the reprojection error of the 3D model with respect to the image estimates, which corresponds to the optimization over rays, where the cost function depends on the semantic class and depth of the first occupied voxel along the ray. The 2-label formulation is made feasible by transforming it into a graph-representable form under QPBO relaxation, solvable using graph cut. The multi-label problem is solved by applying alpha-expansion using the same relaxation in each expansion move. Our method was indeed shown to be feasible in practice, running comparably fast to the competing methods, while not suffering from ray potential approximation artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that unary potentials for semantic and depth likelihoods in voxel-grid MRFs for dense semantic 3D reconstruction produce visibility artifacts. It proposes instead to optimize ray potentials whose cost depends on the semantic class and depth of the first occupied voxel along each ray (directly minimizing reprojection error). The 2-label case is reduced to a graph-representable form via QPBO relaxation and solved by graph cuts; the multi-label case uses alpha-expansion on the same relaxation. The method is reported to run at comparable speed to baselines while avoiding ray-potential approximation artifacts.

Significance. If the QPBO relaxation is shown to be tight for the first-occupied-voxel cost and the recovered labelings match the exact ray costs, the formulation would supply a more faithful discrete model of visibility than standard unary MRFs, with direct implications for semantic reconstruction accuracy.

major comments (1)
  1. [2-label formulation and multi-label extension] The central claim that the QPBO relaxation (and subsequent alpha-expansion) correctly optimizes the first-occupied-voxel ray cost without introducing new approximation errors or leaving visibility artifacts is load-bearing, yet the manuscript supplies no derivation, tightness proof, or bound on the relaxation gap for this specific potential (see the description of the 2-label formulation and the multi-label extension).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of rigorously justifying the QPBO relaxation. We address the single major comment below.

read point-by-point responses
  1. Referee: [2-label formulation and multi-label extension] The central claim that the QPBO relaxation (and subsequent alpha-expansion) correctly optimizes the first-occupied-voxel ray cost without introducing new approximation errors or leaving visibility artifacts is load-bearing, yet the manuscript supplies no derivation, tightness proof, or bound on the relaxation gap for this specific potential (see the description of the 2-label formulation and the multi-label extension).

    Authors: We acknowledge that the current manuscript does not contain an explicit derivation of the QPBO reduction or a formal bound on the relaxation gap. In the revision we will add a dedicated subsection that (i) derives the auxiliary variables and edge weights that encode the first-occupied-voxel cost under the QPBO relaxation, (ii) shows that the resulting pairwise terms remain submodular (hence the relaxation is tight for the 2-label case), and (iii) argues that the same submodularity is preserved inside each alpha-expansion move, so no additional approximation is introduced beyond the standard alpha-expansion guarantee. We will also include a short empirical verification that the recovered labelings match the exact ray costs on small instances where brute-force enumeration is feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: independent modeling change with standard QPBO reduction

full rationale

The paper's central contribution is a reformulation of the objective from unary potentials to explicit ray potentials (cost on semantic class and depth of first occupied voxel). This is presented as a direct modeling correction rather than a derivation from fitted parameters or prior self-citations. The 2-label case is reduced to graph-representable form via QPBO relaxation (a known external technique), and multi-label via alpha-expansion; neither step is shown to reduce by construction to the input data or to a self-citation chain. No equations equate the output to a fitted input, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via citation. The derivation remains self-contained against external benchmarks for MRF optimization.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete free parameters, axioms, or invented entities; the method builds on standard discrete optimization primitives whose details are not supplied.

pith-pipeline@v0.9.0 · 5719 in / 1125 out tokens · 35850 ms · 2026-05-25T16:46:50.201612+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    J. Besag. On the statisical analysis of dirty pictures. Journal of the Royal Statistical Society, 1986

  2. [2]

    Boros and P

    E. Boros and P. Hammer. Pseudo-boolean optimization. Dis- crete Applied Mathematics, 2002

  3. [3]

    Boykov, O

    Y . Boykov, O. Veksler, and R. Zabih. Fast approximate en- ergy minimization via graph cuts. Transactions on Pattern Analysis and Machine Intelligence, 2001

  4. [4]

    G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Seg- mentation and recognition using structure from motion point clouds. In European Conference on Computer Vision, 2008

  5. [5]

    Cohen, C

    A. Cohen, C. Zach, S. N. Sinha, and M. Pollefeys. Discover- ing and exploiting 3d symmetries in structure from motion. In Conference on Computer Vision and Pattern Recognition, 2012

  6. [6]

    Cremers and K

    D. Cremers and K. Kolev. Multiview stereo and silhou- ette consistency via convex functionals over convex domains. Transactions on Pattern Analysis and Machine Intelligence, 2011

  7. [7]

    Curless and M

    B. Curless and M. Levoy. A volumetric method for build- ing complex models from range images. In Conference on Computer graphics and interactive techniques, 1996

  8. [8]

    Delaunoy, M

    A. Delaunoy, M. Pollefeys, et al. Photometric bundle adjust- ment for dense multi-view 3d modeling. In Conference on Computer Vision and Pattern Recognition, 2014

  9. [9]

    Freedman and P

    D. Freedman and P. Drineas. Energy minimization via graph cuts: Settling what is possible. In Conference on Computer Vision and Pattern Recognition, 2005

  10. [10]

    A. V . Goldberg, S. Hed, H. Kaplan, R. E. Tarjan, and R. F. Werneck. Maximum flows by incremental breadth-first search. In Proc. ALGO ESA, 2011

  11. [11]

    C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3D scene reconstruction and class segmentation. In Conference on Computer Vision and Pattern Recognition , 2013

  12. [12]

    Hernandez, G

    C. Hernandez, G. V ogiatzis, and R. Cipolla. Probabilistic visibility for multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2007

  13. [13]

    B.-s. Kim, P. Kohli, and S. Savarese. 3D scene understanding by voxel-CRF. In International Conference on Computer Vision, 2013. Example Image Depth Map Semantic Segmentation Reconstructed 3D model Figure 3. Qualitative results on 6 data sets (from top to bottom): Castle [24], South Building [11], Catania [11], Providence [11], CAB [5] and Opera [5]. Our ...

  14. [14]

    Kolmogorov and Y

    V . Kolmogorov and Y . Boykov. What metrics can be approx- imated by geo-cuts, or global optimization of length/area and flux. In International Conference on Computer Vision, 2005

  15. [15]

    Ladicky, C

    L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associa- tive hierarchical CRFs for object class image segmentation. In International Conference on Computer Vision, 2009

  16. [16]

    Lempitsky and Y

    V . Lempitsky and Y . Boykov. Global optimization for shape fitting. In Conference on Computer Vision and Pattern Recognition, 2007

  17. [17]

    Liu and D

    S. Liu and D. B. Cooper. Ray markov random fields for image-based 3d modeling: Model and efficient inference. In Conference on Computer Vision and Pattern Recognition, 2010

  18. [18]

    Liu and D

    S. Liu and D. B. Cooper. A complete statistical inverse ray tracing approach to multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2011

  19. [19]

    W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. InProceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, 1987

  20. [20]

    Efficient Minimization of Higher Order Submodular Functions using Monotonic Boolean Functions

    S. Ramalingam, C. Russell, L. Ladicky, and P. H. Torr. Efficient minimization of higher order submodular func- tions using monotonic boolean functions. Arxiv preprint arXiv:1109.2304, 2011

  21. [21]

    S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Conference on Com- puter Vision and Pattern Recognition, 2006

  22. [22]

    Shotton, J

    J. Shotton, J. Winn, C. Rother, and A. Criminisi. Texton- Boost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Euro- pean Conference on Computer Vision, 2006

  23. [23]

    S. N. Sinha, P. Mordohai, and M. Pollefeys. Multi-view stereo via graph cuts on the dual of an adaptive tetrahe- dral mesh. In International Conference on Computer Vision, 2007

  24. [24]

    Strecha, W

    C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. InConference on Computer Vision and Pattern Recognition, 2008

  25. [25]

    V ogiatzis, P

    G. V ogiatzis, P. H. Torr, and R. Cipolla. Multi-view stereo via volumetric graph-cuts. In Conference on Computer Vision and Pattern Recognition, 2005

  26. [26]

    H.-H. Vu, R. Keriven, P. Labatut, and J.-P. Pons. Towards high-resolution large-scale multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2009

  27. [27]

    H.-H. Vu, P. Labatut, J.-P. Pons, and R. Keriven. High accu- racy and visibility-consistent dense multiview stereo. Trans- actions on Pattern Analysis and Machine Intelligence, 2012

  28. [28]

    C. Zach. Fast and high quality fusion of depth maps. In 3D Data Processing, Visualization and Transmission, 2008