Recognition: 2 theorem links
· Lean TheoremGATA2Floor: Graph attention for floor counting in street-view facades
Pith reviewed 2026-05-13 07:23 UTC · model grok-4.3
The pith
Graph attention networks count building floors from street-view facades by assigning windows to latent levels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GATA2Floor predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns detected facade elements to latent floor slots. The model is built on a graph whose nodes are window and door detections and whose edges incorporate a vertical prior; multi-head GATv2 layers propagate information across this structure to produce both the scalar count and the per-element floor assignments.
What carries the argument
GATA2Floor, a multi-head GATv2 network that uses learnable cross-attention queries to assign facade graph nodes to latent floor slots while predicting the total floor count.
Load-bearing premise
Modeling facades as graphs with a vertical prior on edges plus GATv2 attention will reliably capture floor structure even in irregular or occluded real-world images.
What would settle it
A set of street-view facades containing irregular or occluded window patterns on which the model produces floor-count errors larger than those of a simple vertical sorting baseline would falsify the claim.
read the original abstract
Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GATA2Floor, a multi-head GATv2-based graph attention model that represents building facades as graphs over window/door detections with a vertical prior on edges. It predicts the global floor count while using learnable cross-attention queries to softly assign detections to latent floor slots for interpretability and robustness to irregular designs. A label-free proposal mechanism based on self-supervised features and vision-language scoring is introduced to address the lack of annotated data, demonstrating the utility of relational graph reasoning for facade understanding.
Significance. If the central claims hold, the work would advance automated facade analysis for urban analytics, energy assessment, and emergency planning by showing how graph attention with vertical priors and cross-attention queries can yield both accurate counts and interpretable floor assignments. The label-free self-supervised component is a clear strength that could broaden applicability where labeled data is scarce.
major comments (3)
- [Abstract] Abstract: the approach is described but no performance numbers, error analysis, ablation studies, or validation details are supplied, so it is impossible to verify whether the graph construction, GATv2 attention, and cross-attention queries actually support the floor-counting and assignment claims.
- [Method] Method section (graph construction): the vertical prior is invoked but its precise definition (e.g., how y-coordinate differences are turned into edge weights or adjacency) is not given; this is load-bearing because missing or spurious detections under occlusion would sever vertical connections and break both count prediction and slot assignment.
- [Experiments] Experiments: no ablation on detection failure modes (occlusion, shadows, irregular spacing) is reported, yet the central claim of robustness to irregular designs rests on the assumption that the graph plus GATv2 plus cross-attention queries can recover from such failures.
minor comments (1)
- [Method] Notation for the number of latent floor slots and attention heads should be introduced explicitly with their default values or ranges.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's significance and for the constructive major comments. We address each point below and will revise the manuscript to improve clarity, detail, and verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract: the approach is described but no performance numbers, error analysis, ablation studies, or validation details are supplied, so it is impossible to verify whether the graph construction, GATv2 attention, and cross-attention queries actually support the floor-counting and assignment claims.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will update the abstract to report the primary performance metrics (floor-count accuracy and soft-assignment precision on the evaluated datasets) along with a concise statement of the validation protocol. This change will make the claims immediately verifiable while remaining within length constraints. revision: yes
-
Referee: [Method] Method section (graph construction): the vertical prior is invoked but its precise definition (e.g., how y-coordinate differences are turned into edge weights or adjacency) is not given; this is load-bearing because missing or spurious detections under occlusion would sever vertical connections and break both count prediction and slot assignment.
Authors: The referee correctly identifies that the vertical prior requires an explicit definition. We will revise the method section to state that nodes are connected when their normalized vertical distance is below threshold τ, with edge weights w_ij = exp(−|y_i − y_j|/σ). The revised text will also include the concrete values of τ and σ used in experiments and explain how GATv2 multi-head attention combined with the cross-attention queries enables information propagation even when some vertical edges are absent due to occlusion. revision: yes
-
Referee: [Experiments] Experiments: no ablation on detection failure modes (occlusion, shadows, irregular spacing) is reported, yet the central claim of robustness to irregular designs rests on the assumption that the graph plus GATv2 plus cross-attention queries can recover from such failures.
Authors: We acknowledge that a targeted analysis of detection failure modes would strengthen the robustness claim. Our current experiments already evaluate the model on real-world facades containing occlusions and irregular spacing, where it outperforms non-graph baselines. For the revision we will add a dedicated subsection presenting both qualitative examples of challenging cases and quantitative results on synthetically perturbed detections (random removal of 15–25 % of nodes to simulate occlusion), thereby directly illustrating the contribution of the graph structure and cross-attention queries. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper constructs facade graphs from detections, applies a vertical prior on edges, and uses standard GATv2 attention plus learnable cross-attention queries to predict floor count and assign elements to slots. No derivation step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain; the components are independent applications of established graph attention techniques without tautological renaming or imported uniqueness theorems from the same authors. The approach remains self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of attention heads and latent floor slots
axioms (1)
- domain assumption Facades can be represented as graphs with a vertical prior on edges between window/door detections
invented entities (1)
-
latent floor slots
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
model each facade as a graph over window/door detections with a vertical prior on edges... GATv2... learnable cross-attention queries... softly assigns elements to latent floor slots
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
vertical bias mask... dy_norm... τ_vertical = α_outlier × μ_top-k
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GATA2Floor: Graph attention for floor counting in street-view facades
INTRODUCTION Street view imagery (SVI) offers a valuable resource in build- ing facades with multiple potential applications (energy esti- mation, construction cost/style prediction etc.), where accu- rate building-level information is critical. Estimating floors, however, requires reasoning over spatially arranged elements (windows/doors) rather than tre...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
PROPOSED METHODOLOGY The proposed GATA2Floor operates on precomputed window and door bounding boxes obtained either from a supervised detector or from a lightweight label-free proposal mechanism (Section 2.5) when annotations are unavailable, and builds a graph over those boxes rather than on the raw image. Con- cretely, given a set ofNelement detections ...
-
[3]
Floor counting and Assignment (GATA2Floor) Proposed GATA2Floor Build vertical-aware graph Input Graph Features Input Embedding + Pos Enc Residual GAT block L Vertical Attention Multi-Head Cross-AttentionFloor Q Assignment Head Global mean Pool + Globalfeatures Counting HeadConfidence Head GATv2Conv GraphNorm LeakyReLU Dropdout LayerNorm FFN LayerNorm Vert...
-
[4]
Facade element detection (Pretrained detector OR Label-free proposal) Dense patch embedding Coherence Input Image Edge extraction y x Spatial var map Coherence map Grayscale Saliency map Edge map Spatial var DINOv2 GMM x2 VLM Prompt Crops Result (proposals)Label-free proposal Mask R-CNN YOLO OR Supervised detector (Pretrained) Result (detections) Fig. 1. ...
-
[5]
EXPERIMENTS AND RESULTS 3.1. Datasets We use multiple common labeled datasets in the facade de- tection field like the Amsterdam Facade, ECP, eTRIMS, and ParisArtDecoFacades [13, 14]. We perform manual labeling for the floor-level ground truth generation. 3.2. Graph-based representation We first evaluate the proposed graph-based representation be- fore th...
-
[6]
CONCLUSION This work models facades as vertical-aware graphs over win- dow/door detections and introduces GATA2Floor, a multi- head GATv2 architecture that jointly performs global floor counting and soft element-to-floor assignment. Extensive ex- periments across public and a large unlabeled datasets show that GATA2Floor outperforms clustering-based basel...
-
[7]
Faster R-CNN: Towards real-time object detection with region proposal networks,
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in29th Annual Conference on Neural Infor- mation Processing Systems (NeurIPS), 2015, pp. 91–99
work page 2015
-
[8]
K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask R-CNN,” inIEEE International Conference on Com- puter Vision (ICCV), 2017, pp. 2980–2988
work page 2017
-
[9]
You only look once: Unified, real-time object detec- tion,
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detec- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788
work page 2016
-
[10]
Deep learning-based door and window detection from build- ing fac ¸ade,
G. Sezen, M. C ¸ akır, M. E. Atik, and Z. Duran, “Deep learning-based door and window detection from build- ing fac ¸ade,” inThe International Archives of the Pho- togrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), 2022, vol. XLIII-B4-2022, pp. 315–320
work page 2022
-
[11]
Zero-shot building attribute extraction from large-scale vision and language models,
F. Pan, S. Jeon, B. Wang, F. Mckenna, and S. X. Yu, “Zero-shot building attribute extraction from large-scale vision and language models,” inIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), 2024, pp. 8632–8641
work page 2024
-
[12]
The graph neural network model,
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2009
work page 2009
-
[13]
P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y . Bengio, “Graph attention networks,” in International Conference on Learning Representations (ICLR), 2018
work page 2018
-
[14]
How attentive are graph attention networks?,
S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?,” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[15]
Floorlevel-net: Rec- ognizing floor-level lines with height-attention-guided multi-task learning,
M. Wu, W. Zeng, and C.-W. Fu, “Floorlevel-net: Rec- ognizing floor-level lines with height-attention-guided multi-task learning,”IEEE Transactions on Image Pro- cessing, vol. 30, pp. 6686–6699, 2021
work page 2021
-
[16]
F. Moubayed, R. Becker, and J. Blankenbach, “Geodata- based number of floor estimation for urban residential buildings as an input parameter for energy modelling,” Geo-spatial Information Science, vol. 0, pp. 1–27, 2025
work page 2025
-
[17]
H. Li, Z. Yuan, G. Dax, G. Kong, H. Fan, A. Zipf, and M. Werner, “Semi-supervised learning from street-view images and openstreetmap for automatic building height estimation,”arXiv preprint arXiv:2307.02574, 2023
-
[18]
Y . Sun, S. Chen, Y . Tian, and X. X. Zhu, “Building floor number estimation from crowdsourced street-level images: Munich dataset and baseline method,”arXiv preprint arXiv:2505.18021, 2025
-
[19]
eTRIMS image database for interpreting images of man-made scenes,
F. Kor ˇc and W. F¨orstner, “eTRIMS image database for interpreting images of man-made scenes,” Tech. Rep. TR-IGG-P-2009-01, Dept. of Photogrammetry, Univer- sity of Bonn, 2009
work page 2009
-
[20]
Learning gram- mars for architecture-specific facade parsing,
R. Gadde, R. Marlet, and N. Paragios, “Learning gram- mars for architecture-specific facade parsing,”Interna- tional Journal of Computer Vision, vol. 117, no. 3, pp. 290–316, 2016
work page 2016
-
[21]
Ross Girshick, “Fast R-CNN,” inIEEE Interna- tional Conference on Computer Vision (ICCV), 2015, pp. 1440–1448
work page 2015
-
[22]
A 3×3 isotropic gradient op- erator for image processing,
I. Sobel and G. Feldman, “A 3×3 isotropic gradient op- erator for image processing,” inPattern Classification and Scene Analysis, pp. 271–272. 1973
work page 1973
-
[23]
Dinov2: Learning robust visual features without supervision,
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learn- ing Research Journal, 2024
work page 2024
-
[24]
Maxi- mum likelihood from incomplete data via the EM algo- rithm,
A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maxi- mum likelihood from incomplete data via the EM algo- rithm,”Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977
work page 1977
-
[25]
Learning transferable vi- sual models from natural language supervision,
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever, “Learning transferable vi- sual models from natural language supervision,” inIn- ternational Conference on Machine Learning (ICML). PmLR, 2021, pp. 8748–8763
work page 2021
-
[26]
OpenAI, “GPT-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Decoupled weight decay regularization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[28]
Floor count from street view imagery using learning-based fac ¸ade parsing,
D. J. Dobson, “Floor count from street view imagery using learning-based fac ¸ade parsing,” Master’s thesis, TU Delft, 2023
work page 2023
-
[29]
Yolo-world: Real-time open-vocabulary object detec- tion,
T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detec- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.