Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Christian Haene; Lubor Ladicky; Marc Pollefeys; Nikolay Savinov

arxiv: 1906.10491 · v1 · pith:6JKQELKXnew · submitted 2019-06-25 · 💻 cs.CV

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Nikolay Savinov , Lubor Ladicky , Christian Haene , Marc Pollefeys This is my paper

Pith reviewed 2026-05-25 16:46 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic 3D reconstructionray potentialsvisibility artifactsQPBO relaxationgraph cutsalpha-expansionvoxel gridreprojection error

0 comments

The pith

Optimizing costs along image rays for the first occupied voxel removes visibility artifacts that unary potentials create in semantic 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard Markov Random Field formulations treat depth and semantic likelihoods as unary potentials over voxels, which fails to enforce correct visibility and produces artifacts such as incorrect labels behind surfaces. It replaces this with an explicit optimization of reprojection error, where each ray contributes a cost based only on the semantic class and depth of its first occupied voxel. The two-label version of this ray-potential problem is converted into a graph-representable form solvable by graph cut under QPBO relaxation, and the multi-label case uses alpha-expansion with the same relaxation at each step. A reader would care because the change yields 3D semantic models without the common visibility errors while running at speeds comparable to prior methods.

Core claim

Formulating dense semantic 3D reconstruction as direct optimization of reprojection error over rays, where the cost for each ray depends on the semantic class and depth of the first occupied voxel, produces solutions free of the visibility artifacts that arise when likelihoods are modelled as unary potentials in a voxel grid MRF. The two-label ray-potential problem is made tractable by a transformation to graph-representable form under QPBO relaxation that permits graph-cut solution; the multi-label version applies alpha-expansion using the identical relaxation inside each move.

What carries the argument

Ray potential defined by the cost on the semantic class and depth of the first occupied voxel along each ray, converted to graph-representable form under QPBO relaxation for graph-cut and alpha-expansion optimization.

If this is right

Visibility artifacts that appear when depth and semantic information are encoded as unary potentials are eliminated.
The two-label ray-potential problem becomes solvable in polynomial time via graph cut after the QPBO transformation.
Multi-label instances remain tractable by repeated alpha-expansion moves that reuse the same relaxed graph construction.
Runtime stays comparable to existing unary-potential methods while avoiding their approximation artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ray-cost construction could be applied to non-semantic tasks such as occupancy mapping where visibility constraints dominate.
Hybrid pipelines could alternate between the discrete ray optimization and continuous refinement of voxel boundaries.
Datasets that annotate both semantics and explicit visibility could serve as direct benchmarks for the claimed artifact reduction.

Load-bearing premise

The conversion of the ray-potential problem into a graph-representable form under QPBO relaxation produces solutions that remove visibility artifacts without introducing new approximation errors.

What would settle it

A controlled comparison on scenes with known ground-truth occlusions that measures whether ray-potential reconstructions assign correct labels behind surfaces while unary-potential baselines do not.

Figures

Figures reproduced from arXiv: 1906.10491 by Christian Haene, Lubor Ladicky, Marc Pollefeys, Nikolay Savinov.

**Figure 1.** Figure 1: The process of merging for ray of the length 3. Because the optimal assignment of auxiliary variables is the same for every [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: The graph construction for ray with 7 voxels. Each vari [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results on 6 data sets (from top to bottom): Castle [24], South Building [11], Catania [11], Providence [11], CAB [5] and Opera [5]. Our method successfully reconstructed challenging 3D data with high level of detail. Minor errors in the reconstructions were caused by the combination of errors of the semantic classifier, insufficient amount data from certain viewpoints or errors in the depth pr… view at source ↗

**Figure 4.** Figure 4: Comparison of method with the sate-of-the-art dense volumetric reconstruction [11] on South Building [11] and Castle datasets [24]. Incorporation of true semantic and depth ray likelihoods in the optimization framework led to the corrections of systematic reconstruction artifacts, due to the approximations in the formulation of the problem. As can be seen in the South Building reconstruction - state-of-the… view at source ↗

read the original abstract

Dense semantic 3D reconstruction is typically formulated as a discrete or continuous problem over label assignments in a voxel grid, combining semantic and depth likelihoods in a Markov Random Field framework. The depth and semantic information is incorporated as a unary potential, smoothed by a pairwise regularizer. However, modelling likelihoods as a unary potential does not model the problem correctly leading to various undesirable visibility artifacts. We propose to formulate an optimization problem that directly optimizes the reprojection error of the 3D model with respect to the image estimates, which corresponds to the optimization over rays, where the cost function depends on the semantic class and depth of the first occupied voxel along the ray. The 2-label formulation is made feasible by transforming it into a graph-representable form under QPBO relaxation, solvable using graph cut. The multi-label problem is solved by applying alpha-expansion using the same relaxation in each expansion move. Our method was indeed shown to be feasible in practice, running comparably fast to the competing methods, while not suffering from ray potential approximation artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper replaces unary MRFs with ray potentials that cost the first occupied voxel to remove visibility artifacts, solved via QPBO relaxation and alpha-expansion.

read the letter

The main point is that standard unary potentials in voxel MRFs for semantic 3D reconstruction fail to model visibility correctly, since they do not tie the cost to the first surface hit along each ray. This paper reformulates the objective as direct optimization of reprojection error over rays, where the cost depends on the semantic class and depth of the first occupied voxel. That modeling shift is the core new element. They make the binary case tractable by converting it to a graph-representable energy under QPBO relaxation, then solve the multi-label version with alpha-expansion using the same relaxation in each move. The abstract states the method runs at speeds comparable to prior work and avoids the ray-potential artifacts. That is a concrete, targeted improvement over the unary baseline they criticize. The formulation itself looks like a genuine modeling change rather than a routine tweak. The soft spot is the QPBO relaxation. QPBO handles non-submodular terms but can leave variables unlabeled or return energies above the true minimum, so it is not guaranteed to recover the exact first-voxel costs. The stress-test concern is reasonable here; the abstract offers no proof of tightness for this potential and no error analysis. The claim that the method eliminates artifacts therefore rests on the practical results, which are only asserted rather than shown in the provided text. The multi-label alpha-expansion step adds another approximation layer. This work is aimed at computer-vision researchers who already use graph cuts or MRFs for dense semantic reconstruction. Anyone extending those methods to handle visibility more accurately would get direct value from the ray formulation. It deserves a serious referee because the problem it identifies is real and the proposed solver is feasible on paper, even if the relaxation's reliability needs checking in the full experiments and derivations.

Referee Report

1 major / 0 minor

Summary. The paper claims that unary potentials for semantic and depth likelihoods in voxel-grid MRFs for dense semantic 3D reconstruction produce visibility artifacts. It proposes instead to optimize ray potentials whose cost depends on the semantic class and depth of the first occupied voxel along each ray (directly minimizing reprojection error). The 2-label case is reduced to a graph-representable form via QPBO relaxation and solved by graph cuts; the multi-label case uses alpha-expansion on the same relaxation. The method is reported to run at comparable speed to baselines while avoiding ray-potential approximation artifacts.

Significance. If the QPBO relaxation is shown to be tight for the first-occupied-voxel cost and the recovered labelings match the exact ray costs, the formulation would supply a more faithful discrete model of visibility than standard unary MRFs, with direct implications for semantic reconstruction accuracy.

major comments (1)

[2-label formulation and multi-label extension] The central claim that the QPBO relaxation (and subsequent alpha-expansion) correctly optimizes the first-occupied-voxel ray cost without introducing new approximation errors or leaving visibility artifacts is load-bearing, yet the manuscript supplies no derivation, tightness proof, or bound on the relaxation gap for this specific potential (see the description of the 2-label formulation and the multi-label extension).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of rigorously justifying the QPBO relaxation. We address the single major comment below.

read point-by-point responses

Referee: [2-label formulation and multi-label extension] The central claim that the QPBO relaxation (and subsequent alpha-expansion) correctly optimizes the first-occupied-voxel ray cost without introducing new approximation errors or leaving visibility artifacts is load-bearing, yet the manuscript supplies no derivation, tightness proof, or bound on the relaxation gap for this specific potential (see the description of the 2-label formulation and the multi-label extension).

Authors: We acknowledge that the current manuscript does not contain an explicit derivation of the QPBO reduction or a formal bound on the relaxation gap. In the revision we will add a dedicated subsection that (i) derives the auxiliary variables and edge weights that encode the first-occupied-voxel cost under the QPBO relaxation, (ii) shows that the resulting pairwise terms remain submodular (hence the relaxation is tight for the 2-label case), and (iii) argues that the same submodularity is preserved inside each alpha-expansion move, so no additional approximation is introduced beyond the standard alpha-expansion guarantee. We will also include a short empirical verification that the recovered labelings match the exact ray costs on small instances where brute-force enumeration is feasible. revision: yes

Circularity Check

0 steps flagged

No circularity: independent modeling change with standard QPBO reduction

full rationale

The paper's central contribution is a reformulation of the objective from unary potentials to explicit ray potentials (cost on semantic class and depth of first occupied voxel). This is presented as a direct modeling correction rather than a derivation from fitted parameters or prior self-citations. The 2-label case is reduced to graph-representable form via QPBO relaxation (a known external technique), and multi-label via alpha-expansion; neither step is shown to reduce by construction to the input data or to a self-citation chain. No equations equate the output to a fitted input, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via citation. The derivation remains self-contained against external benchmarks for MRF optimization.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete free parameters, axioms, or invented entities; the method builds on standard discrete optimization primitives whose details are not supplied.

pith-pipeline@v0.9.0 · 5719 in / 1125 out tokens · 35850 ms · 2026-05-25T16:46:50.201612+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

[1]

J. Besag. On the statisical analysis of dirty pictures. Journal of the Royal Statistical Society, 1986

work page 1986
[2]

Boros and P

E. Boros and P. Hammer. Pseudo-boolean optimization. Dis- crete Applied Mathematics, 2002

work page 2002
[3]

Boykov, O

Y . Boykov, O. Veksler, and R. Zabih. Fast approximate en- ergy minimization via graph cuts. Transactions on Pattern Analysis and Machine Intelligence, 2001

work page 2001
[4]

G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Seg- mentation and recognition using structure from motion point clouds. In European Conference on Computer Vision, 2008

work page 2008
[5]

Cohen, C

A. Cohen, C. Zach, S. N. Sinha, and M. Pollefeys. Discover- ing and exploiting 3d symmetries in structure from motion. In Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[6]

Cremers and K

D. Cremers and K. Kolev. Multiview stereo and silhou- ette consistency via convex functionals over convex domains. Transactions on Pattern Analysis and Machine Intelligence, 2011

work page 2011
[7]

Curless and M

B. Curless and M. Levoy. A volumetric method for build- ing complex models from range images. In Conference on Computer graphics and interactive techniques, 1996

work page 1996
[8]

Delaunoy, M

A. Delaunoy, M. Pollefeys, et al. Photometric bundle adjust- ment for dense multi-view 3d modeling. In Conference on Computer Vision and Pattern Recognition, 2014

work page 2014
[9]

Freedman and P

D. Freedman and P. Drineas. Energy minimization via graph cuts: Settling what is possible. In Conference on Computer Vision and Pattern Recognition, 2005

work page 2005
[10]

A. V . Goldberg, S. Hed, H. Kaplan, R. E. Tarjan, and R. F. Werneck. Maximum ﬂows by incremental breadth-ﬁrst search. In Proc. ALGO ESA, 2011

work page 2011
[11]

C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3D scene reconstruction and class segmentation. In Conference on Computer Vision and Pattern Recognition , 2013

work page 2013
[12]

Hernandez, G

C. Hernandez, G. V ogiatzis, and R. Cipolla. Probabilistic visibility for multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2007

work page 2007
[13]

B.-s. Kim, P. Kohli, and S. Savarese. 3D scene understanding by voxel-CRF. In International Conference on Computer Vision, 2013. Example Image Depth Map Semantic Segmentation Reconstructed 3D model Figure 3. Qualitative results on 6 data sets (from top to bottom): Castle [24], South Building [11], Catania [11], Providence [11], CAB [5] and Opera [5]. Our ...

work page 2013
[14]

Kolmogorov and Y

V . Kolmogorov and Y . Boykov. What metrics can be approx- imated by geo-cuts, or global optimization of length/area and ﬂux. In International Conference on Computer Vision, 2005

work page 2005
[15]

Ladicky, C

L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associa- tive hierarchical CRFs for object class image segmentation. In International Conference on Computer Vision, 2009

work page 2009
[16]

Lempitsky and Y

V . Lempitsky and Y . Boykov. Global optimization for shape ﬁtting. In Conference on Computer Vision and Pattern Recognition, 2007

work page 2007
[17]

Liu and D

S. Liu and D. B. Cooper. Ray markov random ﬁelds for image-based 3d modeling: Model and efﬁcient inference. In Conference on Computer Vision and Pattern Recognition, 2010

work page 2010
[18]

Liu and D

S. Liu and D. B. Cooper. A complete statistical inverse ray tracing approach to multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2011

work page 2011
[19]

W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. InProceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, 1987

work page 1987
[20]

Efficient Minimization of Higher Order Submodular Functions using Monotonic Boolean Functions

S. Ramalingam, C. Russell, L. Ladicky, and P. H. Torr. Efﬁcient minimization of higher order submodular func- tions using monotonic boolean functions. Arxiv preprint arXiv:1109.2304, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[21]

S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Conference on Com- puter Vision and Pattern Recognition, 2006

work page 2006
[22]

Shotton, J

J. Shotton, J. Winn, C. Rother, and A. Criminisi. Texton- Boost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Euro- pean Conference on Computer Vision, 2006

work page 2006
[23]

S. N. Sinha, P. Mordohai, and M. Pollefeys. Multi-view stereo via graph cuts on the dual of an adaptive tetrahe- dral mesh. In International Conference on Computer Vision, 2007

work page 2007
[24]

Strecha, W

C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. InConference on Computer Vision and Pattern Recognition, 2008

work page 2008
[25]

V ogiatzis, P

G. V ogiatzis, P. H. Torr, and R. Cipolla. Multi-view stereo via volumetric graph-cuts. In Conference on Computer Vision and Pattern Recognition, 2005

work page 2005
[26]

H.-H. Vu, R. Keriven, P. Labatut, and J.-P. Pons. Towards high-resolution large-scale multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2009

work page 2009
[27]

H.-H. Vu, P. Labatut, J.-P. Pons, and R. Keriven. High accu- racy and visibility-consistent dense multiview stereo. Trans- actions on Pattern Analysis and Machine Intelligence, 2012

work page 2012
[28]

C. Zach. Fast and high quality fusion of depth maps. In 3D Data Processing, Visualization and Transmission, 2008

work page 2008

[1] [1]

J. Besag. On the statisical analysis of dirty pictures. Journal of the Royal Statistical Society, 1986

work page 1986

[2] [2]

Boros and P

E. Boros and P. Hammer. Pseudo-boolean optimization. Dis- crete Applied Mathematics, 2002

work page 2002

[3] [3]

Boykov, O

Y . Boykov, O. Veksler, and R. Zabih. Fast approximate en- ergy minimization via graph cuts. Transactions on Pattern Analysis and Machine Intelligence, 2001

work page 2001

[4] [4]

G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Seg- mentation and recognition using structure from motion point clouds. In European Conference on Computer Vision, 2008

work page 2008

[5] [5]

Cohen, C

A. Cohen, C. Zach, S. N. Sinha, and M. Pollefeys. Discover- ing and exploiting 3d symmetries in structure from motion. In Conference on Computer Vision and Pattern Recognition, 2012

work page 2012

[6] [6]

Cremers and K

D. Cremers and K. Kolev. Multiview stereo and silhou- ette consistency via convex functionals over convex domains. Transactions on Pattern Analysis and Machine Intelligence, 2011

work page 2011

[7] [7]

Curless and M

B. Curless and M. Levoy. A volumetric method for build- ing complex models from range images. In Conference on Computer graphics and interactive techniques, 1996

work page 1996

[8] [8]

Delaunoy, M

A. Delaunoy, M. Pollefeys, et al. Photometric bundle adjust- ment for dense multi-view 3d modeling. In Conference on Computer Vision and Pattern Recognition, 2014

work page 2014

[9] [9]

Freedman and P

D. Freedman and P. Drineas. Energy minimization via graph cuts: Settling what is possible. In Conference on Computer Vision and Pattern Recognition, 2005

work page 2005

[10] [10]

A. V . Goldberg, S. Hed, H. Kaplan, R. E. Tarjan, and R. F. Werneck. Maximum ﬂows by incremental breadth-ﬁrst search. In Proc. ALGO ESA, 2011

work page 2011

[11] [11]

C. Hane, C. Zach, A. Cohen, R. Angst, and M. Pollefeys. Joint 3D scene reconstruction and class segmentation. In Conference on Computer Vision and Pattern Recognition , 2013

work page 2013

[12] [12]

Hernandez, G

C. Hernandez, G. V ogiatzis, and R. Cipolla. Probabilistic visibility for multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2007

work page 2007

[13] [13]

B.-s. Kim, P. Kohli, and S. Savarese. 3D scene understanding by voxel-CRF. In International Conference on Computer Vision, 2013. Example Image Depth Map Semantic Segmentation Reconstructed 3D model Figure 3. Qualitative results on 6 data sets (from top to bottom): Castle [24], South Building [11], Catania [11], Providence [11], CAB [5] and Opera [5]. Our ...

work page 2013

[14] [14]

Kolmogorov and Y

V . Kolmogorov and Y . Boykov. What metrics can be approx- imated by geo-cuts, or global optimization of length/area and ﬂux. In International Conference on Computer Vision, 2005

work page 2005

[15] [15]

Ladicky, C

L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associa- tive hierarchical CRFs for object class image segmentation. In International Conference on Computer Vision, 2009

work page 2009

[16] [16]

Lempitsky and Y

V . Lempitsky and Y . Boykov. Global optimization for shape ﬁtting. In Conference on Computer Vision and Pattern Recognition, 2007

work page 2007

[17] [17]

Liu and D

S. Liu and D. B. Cooper. Ray markov random ﬁelds for image-based 3d modeling: Model and efﬁcient inference. In Conference on Computer Vision and Pattern Recognition, 2010

work page 2010

[18] [18]

Liu and D

S. Liu and D. B. Cooper. A complete statistical inverse ray tracing approach to multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2011

work page 2011

[19] [19]

W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. InProceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, 1987

work page 1987

[20] [20]

Efficient Minimization of Higher Order Submodular Functions using Monotonic Boolean Functions

S. Ramalingam, C. Russell, L. Ladicky, and P. H. Torr. Efﬁcient minimization of higher order submodular func- tions using monotonic boolean functions. Arxiv preprint arXiv:1109.2304, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[21] [21]

S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In Conference on Com- puter Vision and Pattern Recognition, 2006

work page 2006

[22] [22]

Shotton, J

J. Shotton, J. Winn, C. Rother, and A. Criminisi. Texton- Boost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Euro- pean Conference on Computer Vision, 2006

work page 2006

[23] [23]

S. N. Sinha, P. Mordohai, and M. Pollefeys. Multi-view stereo via graph cuts on the dual of an adaptive tetrahe- dral mesh. In International Conference on Computer Vision, 2007

work page 2007

[24] [24]

Strecha, W

C. Strecha, W. von Hansen, L. Van Gool, P. Fua, and U. Thoennessen. On benchmarking camera calibration and multi-view stereo for high resolution imagery. InConference on Computer Vision and Pattern Recognition, 2008

work page 2008

[25] [25]

V ogiatzis, P

G. V ogiatzis, P. H. Torr, and R. Cipolla. Multi-view stereo via volumetric graph-cuts. In Conference on Computer Vision and Pattern Recognition, 2005

work page 2005

[26] [26]

H.-H. Vu, R. Keriven, P. Labatut, and J.-P. Pons. Towards high-resolution large-scale multi-view stereo. In Conference on Computer Vision and Pattern Recognition, 2009

work page 2009

[27] [27]

H.-H. Vu, P. Labatut, J.-P. Pons, and R. Keriven. High accu- racy and visibility-consistent dense multiview stereo. Trans- actions on Pattern Analysis and Machine Intelligence, 2012

work page 2012

[28] [28]

C. Zach. Fast and high quality fusion of depth maps. In 3D Data Processing, Visualization and Transmission, 2008

work page 2008