pith. sign in

arxiv: 2604.09260 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.GR· cs.LG

Beyond Segmentation: Structurally Informed Facade Parsing from Imperfect Images

Pith reviewed 2026-05-10 17:53 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords facade parsingstructural regularityalignment lossYOLOv8object detectiongeometric priorsbounding boxesCMP dataset
0
0 comments X

The pith

Augmenting YOLOv8 training with an alignment loss yields structurally regular facade parsings from imperfect images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard object detectors identify windows, doors and other facade elements one by one, so the resulting layout often shows misaligned boxes that break when fed into 3D reconstruction software. The paper adds a single lightweight term to the YOLOv8 loss that penalizes arrangements departing from a regular grid. This term is active only while the network learns; the final detector runs exactly as before. A reader would care because the change supplies the geometric order needed for automatic building models without hand-written rules or extra cleanup steps.

Core claim

By augmenting the YOLOv8 training objective with a custom lightweight alignment loss, the method encourages grid-consistent arrangements of bounding boxes during training. This regularization injects geometric priors that improve structural regularity, correcting alignment errors caused by perspective and occlusion while preserving a controllable trade-off with standard detection accuracy on the CMP dataset.

What carries the argument

A custom lightweight alignment loss added to YOLOv8 training that penalizes deviations from grid-consistent bounding-box placements.

Load-bearing premise

That pushing bounding boxes toward grid consistency through an extra loss term during training will produce coherent facade structures useful for procedural reconstruction without any explicit architectural rules or post-processing.

What would settle it

Training the model on the CMP dataset, then measuring alignment error on held-out images and finding no reduction relative to plain YOLOv8, or finding that accuracy drops become uncontrollable, would show the added loss does not deliver the claimed structural gains.

Figures

Figures reproduced from arXiv: 2604.09260 by Aleksander Plocharski, Maciej Janicki, Przemyslaw Musialski.

Figure 1
Figure 1. Figure 1: Our loss term corrects errors in facade parsing resulting from image imperfections. It corrects facade elements’ sizes and positions and is able to reconstruct elements that are partially obstructed. Abstract Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction.… view at source ↗
Figure 2
Figure 2. Figure 2: Example of a pair of bounding boxes meeting the re￾quired conditions (for y-axis): coordinate differences are below threshold and the boxes do not overlap. • both bounding boxes are of the same class, • bounding boxes do not overlap (IoU ≈ 0), • the abs difference between x1 coordinates is below threshold T, • the abs difference between x2 coordinates is below threshold T. Threshold T (defined in pixels) i… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of structural regularity. (a) Mean Squared Error (MSE) of the rank-k approximation for windows; lower curves indicate better low-rank approximation (higher regularity). (b) Relative SVD-based regularity score for varying alignment weights W and thresholds T (baseline = 100%; lower is better). (c) The trade-off between structural regularity (SVD metric, lower is better) and detection accuracy (mAP,… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results mosaic. Each row shows the Ground Truth (left), Baseline prediction (middle), and our Aligned prediction (right). The alignment loss (Ours) effectively restores grid-like regularity. or heights within aligned pairs. In facades with mixed window types within the same row or column, this can lead to unintended stretching or shrinking of bounding boxes. Second, the effective￾ness of the al… view at source ↗
read the original abstract

Standard object detectors typically treat architectural elements independently, often resulting in facade parsings that lack the structural coherence required for downstream procedural reconstruction. We address this limitation by augmenting the YOLOv8 training objective with a custom lightweight alignment loss. This regularization encourages grid-consistent arrangements of bounding boxes during training, effectively injecting geometric priors without altering the standard inference pipeline. Experiments on the CMP dataset demonstrate that our method successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion while maintaining a controllable trade-off with standard detection accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper augments the YOLOv8 training objective with a lightweight alignment loss that encourages grid-consistent bounding-box arrangements for facade elements. This is claimed to inject geometric priors that correct perspective- and occlusion-induced alignment errors on the CMP dataset while preserving a controllable trade-off with standard detection accuracy and leaving the inference pipeline unchanged.

Significance. If the claimed improvements and controllable trade-off are demonstrated with rigorous metrics and ablations, the approach would provide a simple regularization technique for injecting structural priors into off-the-shelf detectors without post-processing or explicit architectural rules, which could benefit downstream procedural reconstruction pipelines in computer vision.

major comments (2)
  1. [Abstract] Abstract: The central claim that the method 'successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion' is unsupported by any quantitative metrics, ablation results, or implementation details of the alignment loss (e.g., its formulation, weighting hyperparameter, or how it is combined with the YOLOv8 objective). This absence makes it impossible to verify whether the trade-off is controllable or whether the structural gains are statistically meaningful.
  2. [Abstract] Abstract / Method description: The alignment loss penalizes deviation from grid-consistent bounding-box arrangements in image space, yet the manuscript provides no indication that the loss is conditioned on estimated camera parameters, vanishing points, or local facade planes. In perspective views, CMP ground-truth annotations follow projected (non-grid) geometry; without such conditioning the loss risks systematically biasing predictions toward idealized frontal grids, undermining the claim that it corrects rather than introduces misalignment.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., mAP delta and a structural regularity metric) to allow readers to gauge the magnitude of the reported improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, proposing revisions where the presentation can be strengthened without altering the core technical contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the method 'successfully improves structural regularity, correcting alignment errors caused by perspective and occlusion' is unsupported by any quantitative metrics, ablation results, or implementation details of the alignment loss (e.g., its formulation, weighting hyperparameter, or how it is combined with the YOLOv8 objective). This absence makes it impossible to verify whether the trade-off is controllable or whether the structural gains are statistically meaningful.

    Authors: We agree that the abstract's brevity omits key supporting details. The full manuscript details the alignment loss formulation (Section 3.2), its weighting and integration with the YOLOv8 objective (Section 3.3), and provides quantitative metrics, ablations, and trade-off analysis on the CMP dataset (Section 4), including alignment regularity scores and mAP preservation. We will revise the abstract to concisely reference these quantitative improvements and direct readers to the relevant sections for verification. revision: yes

  2. Referee: [Abstract] Abstract / Method description: The alignment loss penalizes deviation from grid-consistent bounding-box arrangements in image space, yet the manuscript provides no indication that the loss is conditioned on estimated camera parameters, vanishing points, or local facade planes. In perspective views, CMP ground-truth annotations follow projected (non-grid) geometry; without such conditioning the loss risks systematically biasing predictions toward idealized frontal grids, undermining the claim that it corrects rather than introduces misalignment.

    Authors: The loss is intentionally lightweight and operates in image space to regularize toward structural patterns observed directly in the CMP training annotations, which already encode perspective projections. It does not enforce idealized frontal grids but penalizes deviations from locally consistent arrangements present in the data. We will revise the method section to explicitly clarify this design rationale, add discussion of why explicit camera conditioning is not required, and include supporting qualitative results and quantitative comparisons demonstrating correction of perspective-induced errors. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method is an independent regularization addition

full rationale

The paper augments the YOLOv8 objective with an external alignment loss that encourages grid-consistent bounding-box arrangements during training. This is a standard added regularization term rather than any derivation in which a quantity is obtained from its own fitted parameters or reduced by construction to the inputs. No equations, self-citations, or uniqueness theorems are invoked in the provided text that would make the central claim equivalent to its own premises. The reported improvements on the CMP dataset are presented as empirical outcomes of the training process, leaving the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The method implicitly assumes that facade elements should form regular grids and that this prior can be injected via a differentiable loss without side effects on detection.

pith-pipeline@v0.9.0 · 5386 in / 1172 out tokens · 30754 ms · 2026-05-10T17:53:50.649147+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    ACM Transactions on Graphics (TOG) 33 , 6 (2014),

    [FMLW14] FAN L., M USIALSKI P., L IU L., W ONKA P.: Structure com- pletion for facade layouts. ACM Transactions on Graphics (TOG) 33 , 6 (2014),

  2. [2]

    Trust -Region Eigenvalue Filtering for Projected Newton,

    2, 4 [JCQ23] JOCHER G., C HAURASIA A., Q IU J.: Y olov8: Real-time ob- ject detection. Ultralytics (2023). URL: https://github.com/ ultralytics/ultralytics. 2 [LXZ∗20] LIU H., X U Y., Z HANG J., Z HU J., L I Y., H OI S. C. H.: DeepFacade: A Deep Learning Approach to Facade Parsing With Sym- metric Loss. Trans. Multi. 22 , 12 (Dec. 2020), 31533165. doi:10....