Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

Aleksander Plocharski; Maciej Janicki; Przemyslaw Musialski

arxiv: 2604.09260 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.GR· cs.LG

Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

Maciej Janicki , Aleksander Plocharski , Przemyslaw Musialski This is my paper

Pith reviewed 2026-05-10 17:53 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG

keywords facade parsingstructural regularityalignment lossYOLOv8object detectiongeometric priorsbounding boxesCMP dataset

0 comments

The pith

Augmenting YOLOv8 training with an alignment loss yields structurally regular facade parsings from imperfect images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard object detectors identify windows, doors and other facade elements one by one, so the resulting layout often shows misaligned boxes that break when fed into 3D reconstruction software. The paper adds a single lightweight term to the YOLOv8 loss that penalizes arrangements departing from a regular grid. This term is active only while the network learns; the final detector runs exactly as before. A reader would care because the change supplies the geometric order needed for automatic building models without hand-written rules or extra cleanup steps.

Core claim

By augmenting the YOLOv8 training objective with a custom lightweight alignment loss, the method encourages grid-consistent arrangements of bounding boxes during training. This regularization injects geometric priors that improve structural regularity, correcting alignment errors caused by perspective and occlusion while preserving a controllable trade-off with standard detection accuracy on the CMP dataset.

What carries the argument

A custom lightweight alignment loss added to YOLOv8 training that penalizes deviations from grid-consistent bounding-box placements.

Load-bearing premise

That pushing bounding boxes toward grid consistency through an extra loss term during training will produce coherent facade structures useful for procedural reconstruction without any explicit architectural rules or post-processing.

What would settle it

Training the model on the CMP dataset, then measuring alignment error on held-out images and finding no reduction relative to plain YOLOv8, or finding that accuracy drops become uncontrollable, would show the added loss does not deliver the claimed structural gains.

Figures

Figures reproduced from arXiv: 2604.09260 by Aleksander Plocharski, Maciej Janicki, Przemyslaw Musialski.

**Figure 1.** Figure 1: Our loss term corrects errors in facade parsing resulting from image imperfections. It corrects facade elements’ sizes and positions and is able to reconstruct elements that are partially obstructed. Abstract Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction.… view at source ↗

**Figure 2.** Figure 2: Example of a pair of bounding boxes meeting the required conditions (for y-axis): coordinate differences are below threshold and the boxes do not overlap. • both bounding boxes are of the same class, • bounding boxes do not overlap (IoU ≈ 0), • the abs difference between x1 coordinates is below threshold T, • the abs difference between x2 coordinates is below threshold T. Threshold T (defined in pixels) i… view at source ↗

**Figure 3.** Figure 3: Analysis of structural regularity. (a) Mean Squared Error (MSE) of the rank-k approximation for windows; lower curves indicate better low-rank approximation (higher regularity). (b) Relative SVD-based regularity score for varying alignment weights W and thresholds T (baseline = 100%; lower is better). (c) The trade-off between structural regularity (SVD metric, lower is better) and detection accuracy (mAP,… view at source ↗

**Figure 4.** Figure 4: Qualitative results mosaic. Each row shows the Ground Truth (left), Baseline prediction (middle), and our Aligned prediction (right). The alignment loss (Ours) effectively restores grid-like regularity. or heights within aligned pairs. In facades with mixed window types within the same row or column, this can lead to unintended stretching or shrinking of bounding boxes. Second, the effectiveness of the al… view at source ↗

read the original abstract

Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The alignment loss is a straightforward regularization trick for YOLOv8 on facades, but the abstract gives no numbers or loss details and the perspective handling looks underspecified.

read the letter

The paper adds a custom alignment loss to YOLOv8 training that pushes bounding boxes toward grid-consistent layouts on facade images. This is the core new piece: a domain-specific regularizer that leaves inference untouched and aims to produce parses better suited for procedural 3D reconstruction. The authors correctly flag that independent per-element detection often yields incoherent results on real photos with perspective and occlusion, and the loss is presented as a lightweight way to inject geometric structure during training. That focus on a practical downstream need is the part that lands cleanly. The claim of a controllable accuracy trade-off is also reasonable in principle for this kind of targeted tweak. The soft spots are more noticeable. The abstract supplies no quantitative metrics, no ablation on the loss weight, and no description of the loss formulation itself, so it is impossible to tell whether the reported structural gains are statistically reliable or just visual. The stress-test concern holds up on the given description: ground-truth boxes on CMP follow projected geometry, yet nothing in the method indicates the loss is conditioned on camera parameters or local planes. A global 2D grid penalty in image space could therefore pull predictions away from accurate localizations on angled views rather than truly correcting them. If the full paper does not show viewpoint-stratified results or an adaptive mechanism, the improvement may be narrower than stated. This is aimed at researchers already working on facade parsing or urban reconstruction pipelines who might want a simple training-time prior. A reader using YOLOv8 for similar tasks could get a usable idea from the loss design, provided the experiments are fleshed out. The work is coherent enough on its own terms to deserve peer review; the referee can ask for the missing numbers, loss math, and perspective analysis.

Referee Report

2 major / 1 minor

Summary. The paper augments the YOLOv8 training objective with a lightweight alignment loss that encourages grid-consistent bounding-box arrangements for facade elements. This is claimed to inject geometric priors that correct perspective- and occlusion-induced alignment errors on the CMP dataset while preserving a controllable trade-off with standard detection accuracy and leaving the inference pipeline unchanged.

Significance. If the claimed improvements and controllable trade-off are demonstrated with rigorous metrics and ablations, the approach would provide a simple regularization technique for injecting structural priors into off-the-shelf detectors without post-processing or explicit architectural rules, which could benefit downstream procedural reconstruction pipelines in computer vision.

major comments (2)

[Abstract] Abstract: The central claim that the method 'successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion' is unsupported by any quantitative metrics, ablation results, or implementation details of the alignment loss (e.g., its formulation, weighting hyperparameter, or how it is combined with the YOLOv8 objective). This absence makes it impossible to verify whether the trade-off is controllable or whether the structural gains are statistically meaningful.
[Abstract] Abstract / Method description: The alignment loss penalizes deviation from grid-consistent bounding-box arrangements in image space, yet the manuscript provides no indication that the loss is conditioned on estimated camera parameters, vanishing points, or local facade planes. In perspective views, CMP ground-truth annotations follow projected (non-grid) geometry; without such conditioning the loss risks systematically biasing predictions toward idealized frontal grids, undermining the claim that it corrects rather than introduces misalignment.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., mAP delta and a structural regularity metric) to allow readers to gauge the magnitude of the reported improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, proposing revisions where the presentation can be strengthened without altering the core technical contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the method 'successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion' is unsupported by any quantitative metrics, ablation results, or implementation details of the alignment loss (e.g., its formulation, weighting hyperparameter, or how it is combined with the YOLOv8 objective). This absence makes it impossible to verify whether the trade-off is controllable or whether the structural gains are statistically meaningful.

Authors: We agree that the abstract's brevity omits key supporting details. The full manuscript details the alignment loss formulation (Section 3.2), its weighting and integration with the YOLOv8 objective (Section 3.3), and provides quantitative metrics, ablations, and trade-off analysis on the CMP dataset (Section 4), including alignment regularity scores and mAP preservation. We will revise the abstract to concisely reference these quantitative improvements and direct readers to the relevant sections for verification. revision: yes
Referee: [Abstract] Abstract / Method description: The alignment loss penalizes deviation from grid-consistent bounding-box arrangements in image space, yet the manuscript provides no indication that the loss is conditioned on estimated camera parameters, vanishing points, or local facade planes. In perspective views, CMP ground-truth annotations follow projected (non-grid) geometry; without such conditioning the loss risks systematically biasing predictions toward idealized frontal grids, undermining the claim that it corrects rather than introduces misalignment.

Authors: The loss is intentionally lightweight and operates in image space to regularize toward structural patterns observed directly in the CMP training annotations, which already encode perspective projections. It does not enforce idealized frontal grids but penalizes deviations from locally consistent arrangements present in the data. We will revise the method section to explicitly clarify this design rationale, add discussion of why explicit camera conditioning is not required, and include supporting qualitative results and quantitative comparisons demonstrating correction of perspective-induced errors. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method is an independent regularization addition

full rationale

The paper augments the YOLOv8 objective with an external alignment loss that encourages grid-consistent bounding-box arrangements during training. This is a standard added regularization term rather than any derivation in which a quantity is obtained from its own fitted parameters or reduced by construction to the inputs. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would make the central claim equivalent to its own premises. The reported improvements on the CMP dataset are presented as empirical outcomes of the training process, leaving the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method implicitly assumes that facade elements should form regular grids and that this prior can be injected via a differentiable loss without side effects on detection.

pith-pipeline@v0.9.0 · 5386 in / 1172 out tokens · 30754 ms · 2026-05-10T17:53:50.649147+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

ACM Transactions on Graphics (TOG) 33 , 6 (2014),

[FMLW14] FAN L., M USIALSKI P., L IU L., W ONKA P.: Structure com- pletion for facade layouts. ACM Transactions on Graphics (TOG) 33 , 6 (2014),

work page 2014
[2]

Trust -Region Eigenvalue Filtering for Projected Newton,

2, 4 [JCQ23] JOCHER G., C HAURASIA A., Q IU J.: Y olov8: Real-time ob- ject detection. Ultralytics (2023). URL: https://github.com/ ultralytics/ultralytics. 2 [LXZ∗20] LIU H., X U Y., Z HANG J., Z HU J., L I Y., H OI S. C. H.: DeepFacade: A Deep Learning Approach to Facade Parsing With Sym- metric Loss. Trans. Multi. 22 , 12 (Dec. 2020), 31533165. doi:10....

work page doi:10.1145/3680528.3687657 2023

[1] [1]

ACM Transactions on Graphics (TOG) 33 , 6 (2014),

[FMLW14] FAN L., M USIALSKI P., L IU L., W ONKA P.: Structure com- pletion for facade layouts. ACM Transactions on Graphics (TOG) 33 , 6 (2014),

work page 2014

[2] [2]

Trust -Region Eigenvalue Filtering for Projected Newton,

2, 4 [JCQ23] JOCHER G., C HAURASIA A., Q IU J.: Y olov8: Real-time ob- ject detection. Ultralytics (2023). URL: https://github.com/ ultralytics/ultralytics. 2 [LXZ∗20] LIU H., X U Y., Z HANG J., Z HU J., L I Y., H OI S. C. H.: DeepFacade: A Deep Learning Approach to Facade Parsing With Sym- metric Loss. Trans. Multi. 22 , 12 (Dec. 2020), 31533165. doi:10....

work page doi:10.1145/3680528.3687657 2023