pith. sign in

arxiv: 2604.21102 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI

Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery

Pith reviewed 2026-05-10 00:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords building condition assessmentstreet view imagerylarge language modelsknowledge distillationhuman-AI alignmentGoogle Street Viewmultimodal modelshousing attributes
0
0 comments X

The pith

Fine-tuning a large language model on modest human labels produces building condition scores from street-view images that align with or exceed individual human raters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that fine-tunes Gemma 3 27B on a small set of human-labeled street-view images to score building conditions nationwide. The resulting scores match human mean opinion scores more closely than many individual raters do. The same capabilities are then distilled into much smaller models that run several times faster while keeping similar accuracy. The method also extends to rating many additional housing and neighborhood attributes. A dashboard presents the outputs for practical use by homeowners and planners.

Core claim

By fine-tuning Gemma 3 27B on a modest human-labeled dataset, the approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. Knowledge distillation transfers these capabilities to a smaller Gemma 3 4B model with a 3x speedup and further to CNN and transformer architectures with 30x speed gains while retaining close performance.

What carries the argument

Fine-tuned multimodal LLM (Gemma 3 27B) that scores building conditions directly from Google Street View imagery, followed by knowledge distillation to smaller models.

If this is right

  • Large-scale nationwide building condition maps become feasible with only modest additional human labeling.
  • Multiple built-environment and housing attributes beyond overall condition can be assessed in the same pipeline.
  • Faster distilled models enable real-time or batch processing of millions of street-view images.
  • A visualization dashboard integrates the automated scores for direct use by homeowners and analysts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fine-tuning on local labels could adapt the method to street-view data in other countries.
  • The outputs could be combined with property records to study links between visible condition and housing values.
  • Repeated assessments over time might track neighborhood change or the effects of local policies.
  • The approach could serve as a low-cost screening tool for housing quality studies that previously required field visits.

Load-bearing premise

The modest human-labeled dataset captures the full range of US building conditions so the model generalizes to new street-view images without geographic or architectural bias.

What would settle it

Systematic deviation between the model's scores and fresh human mean opinion scores on a large sample of street-view images from regions or building types absent from the original training labels.

read the original abstract

We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs' capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. This paper claims to introduce a novel framework for nationwide US building condition assessment using Google Street View imagery and multimodal LLMs. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, it achieves strong correlation with human mean opinion scores (MOS), outperforming individual raters on SRCC and PLCC. Knowledge distillation is applied to create faster models (Gemma 3 4B with 3x speedup, and CNN/transformer models with 30x speedup) while maintaining comparable performance. The work also includes a human-AI alignment study for various built environment and housing attributes and a visualization dashboard.

Significance. Should the results prove robust upon verification of the dataset and generalization, the significance is high for the field of computer vision applied to urban informatics and housing assessment. It demonstrates the potential of LLMs for subjective visual assessment tasks with limited labeling, and the distillation strategy addresses practical deployment concerns for large-scale applications. The human-AI alignment investigation is a valuable addition for understanding model capabilities beyond simple metrics.

major comments (3)
  1. [Abstract] The abstract states that the approach 'achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC'. However, no specific numerical values for these metrics, no details on the number of human raters or their agreement levels, and no information on the dataset size or composition are provided, which are essential to substantiate the outperformance claim and interpret its reliability.
  2. [Methods and Experimental Setup] The central claim of applicability for 'large-scale building condition assessment' nationwide relies on the assumption that the modest human-labeled dataset is representative of US building diversity. The manuscript does not report the number of images, geographic coverage (cities/states), building type distribution, or any cross-region validation (e.g., training on one set of states and testing on others). This omission directly impacts the validity of the generalization claim.
  3. [Results] While knowledge distillation to EfficientNetV2-M and SwinV2-B is presented as achieving close performance with 30x speed gain, there is no ablation study or detailed comparison table showing the exact performance drop, inference times on specific hardware, or error analysis across different building conditions.
minor comments (2)
  1. [Abstract] The phrase 'modest human-labeled dataset' is imprecise; the exact size (e.g., number of labeled images) should be stated explicitly for context.
  2. [Overall] Consider including a limitations section discussing potential biases in GSV imagery (e.g., temporal, weather, or regional coverage issues) and how they affect the assessments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has identified key areas where additional details and clarifications will strengthen the manuscript. We address each major comment point by point below, indicating the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that the approach 'achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC'. However, no specific numerical values for these metrics, no details on the number of human raters or their agreement levels, and no information on the dataset size or composition are provided, which are essential to substantiate the outperformance claim and interpret its reliability.

    Authors: We agree that the abstract would be more informative with these supporting details. The specific SRCC and PLCC values, number of human raters, their agreement levels, and dataset size/composition are reported in the main text (Sections 3 and 4). We will revise the abstract to incorporate the key numerical results and dataset characteristics to better substantiate the claims. revision: yes

  2. Referee: [Methods and Experimental Setup] The central claim of applicability for 'large-scale building condition assessment' nationwide relies on the assumption that the modest human-labeled dataset is representative of US building diversity. The manuscript does not report the number of images, geographic coverage (cities/states), building type distribution, or any cross-region validation (e.g., training on one set of states and testing on others). This omission directly impacts the validity of the generalization claim.

    Authors: We acknowledge this point on the need for explicit reporting to support the generalization claim. We will add a dedicated subsection to the Methods describing the dataset statistics, including the number of images, geographic coverage by cities and states, and building type distribution. We did not perform cross-region validation experiments due to the modest size of the labeled dataset; we will discuss this as a limitation in the revised manuscript and its implications for nationwide applicability. revision: partial

  3. Referee: [Results] While knowledge distillation to EfficientNetV2-M and SwinV2-B is presented as achieving close performance with 30x speed gain, there is no ablation study or detailed comparison table showing the exact performance drop, inference times on specific hardware, or error analysis across different building conditions.

    Authors: We agree that additional details on the distillation results would improve the manuscript. We will expand the Results section with a detailed comparison table reporting exact performance metrics, performance drops, inference times on specific hardware, and error analysis across building condition categories. A brief ablation study on distillation parameters will also be included. revision: yes

Circularity Check

0 steps flagged

No circularity: performance metrics computed against independent external human MOS benchmark

full rationale

The paper fine-tunes Gemma 3 27B on human-labeled data and reports SRCC/PLCC alignment to an external mean opinion score (MOS) benchmark, with the claim that the model outperforms individual raters relative to that same MOS. This evaluation structure uses an independent human-derived reference that is not defined in terms of the model's fitted parameters or outputs. No equations, self-referential definitions, or load-bearing self-citations are present in the abstract that would reduce the reported metrics to quantities constructed from the training process itself. The derivation chain therefore remains non-circular and relies on standard supervised evaluation against external labels.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of a modest human-labeled dataset for nationwide generalization and on standard assumptions that fine-tuning and distillation preserve task performance; no new physical entities or ad-hoc constants are introduced.

axioms (2)
  • domain assumption A modest human-labeled dataset drawn from Google Street View is sufficiently representative of US building conditions to allow reliable fine-tuning and generalization to unseen imagery.
    Implicit in the claim of nationwide applicability and strong alignment with human MOS.
  • domain assumption Knowledge distillation from the 27B model to the 4B model and to CNN/transformer architectures preserves the essential assessment capability.
    Required for the efficiency claims.

pith-pipeline@v0.9.0 · 5536 in / 1533 out tokens · 37186 ms · 2026-05-10T00:07:33.779655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

  1. [1]

    International Journal of Housing Policy19(1), 1–16 (2019)

    Anacker, K.B.: Introduction: Housing afford- ability and affordable housing. International Journal of Housing Policy19(1), 1–16 (2019)

  2. [2]

    In: Proceedings of Architectural Research Centers Consortium Annual Con- ference (2025)

    Hu, M., Ghorbany, S., Yao, S., Wang, C., Sisk, M.: BUILT2AFFORD: Machine- learning-driven passive retrofits for afford- able housing. In: Proceedings of Architectural Research Centers Consortium Annual Con- ference (2025)

  3. [3]

    ACM Computing Survey55(13s), 284– 128425 (2023)

    Starzy´ nska-Grze´ s, M.B., Roussel, R., Jacoby, S., Asadipour, A.: Computer vision-based analysis of buildings and built environments: A systematic review of current approaches. ACM Computing Survey55(13s), 284– 128425 (2023)

  4. [4]

    Building and Environment250, 111126 (2024)

    Ghorbany, S., Hu, M., Yao, S., Wang, C., Nguyen, Q.C., Yue, X., Alirezaei, M., Tas- dizen, T., Sisk, M.: Examining the role of passive design indicators in energy burden reduction: Insights from a machine learning and deep learning approach. Building and Environment250, 111126 (2024)

  5. [5]

    In: Proceedings of Interna- tional Symposium on Visual Computing, pp

    Yao, S., Ghorbany, S., Sisk, M., Hu, M., Wang, C.: Leveraging zero-shot learning on street-view imagery for built environment variable analysis. In: Proceedings of Interna- tional Symposium on Visual Computing, pp. 243–254 (2024)

  6. [6]

    Scientific Reports15, 19998 (2025) 15

    Ghorbany, S., Hu, M., Yao, S., Sisk, M., Wang, C., Zhang, K., Nguyen, Q.C.: Data driven assessment of built environment impacts on urban health across United States cities. Scientific Reports15, 19998 (2025) 15

  7. [7]

    In: Proceedings of International Symposium on Visual Computing, pp

    Yao, S., Ghorbany, S., Forstchen, M., Koro- taszand, A., Sisk, M., Hu, M., Wang, C.: Leveraging multimodal LLMs for build- ing condition assessment from street-view imagery. In: Proceedings of International Symposium on Visual Computing, pp. 219– 231 (2025)

  8. [8]

    Computa- tional Intelligence and Neuroscience2018, 7913952 (2018)

    Hoang, N.-D.: Image processing-based recog- nition of wall defects using machine learning approaches and steerable filters. Computa- tional Intelligence and Neuroscience2018, 7913952 (2018)

  9. [9]

    Jour- nal of Performance of Constructed Facilities 38(6), 04024050 (2024)

    Amrouni Hosseini, M., Ravanshadnia, M., Rahimzadegan, M., Ramezani, S.: Next- generation building condition assessment: BIM and neural network integration. Jour- nal of Performance of Constructed Facilities 38(6), 04024050 (2024)

  10. [10]

    ISPRS Jour- nal of Photogrammetry and Remote Sensing 175, 298–310 (2021)

    Zou, S., Wang, L.: Detecting individual aban- doned houses from Google street view: A hier- archical deep learning approach. ISPRS Jour- nal of Photogrammetry and Remote Sensing 175, 298–310 (2021)

  11. [11]

    In: Pro- ceedings of ACM SIGSPATIAL International Workshop on Spatial Big Data and AI for Industrial Applications, pp

    Liu, X., Haworth, J., Wang, M.: A new approach to assessing perceived walkability: Combining street view imagery with multi- modal contrastive learning model. In: Pro- ceedings of ACM SIGSPATIAL International Workshop on Spatial Big Data and AI for Industrial Applications, pp. 16–21 (2023)

  12. [12]

    In: Proceedings of AGILE Walking the X-min City Workshop (2025)

    Wang, X., Gilvear, A., Li, Y., Ilyankou, I.: Can CLIP see safe streets? comparing human and VLM perceptions of walkability and safety. In: Proceedings of AGILE Walking the X-min City Workshop (2025)

  13. [13]

    In: Proceedings of IEEE Vehicular Technology Conference, pp

    Cheng, Y., Yin, Z., Li, D., Li, Z.: Assessing urban safety: A digital twin approach using streetview and large language models. In: Proceedings of IEEE Vehicular Technology Conference, pp. 1–5 (2024)

  14. [14]

    arXiv preprint arXiv:2409.19527 (2024)

    Li, Z., Su, Y., Wang, H., Zhao, W.: Build- ingView: Constructing urban building exte- riors databases with street view imagery and multimodal large language mode. arXiv preprint arXiv:2409.19527 (2024)

  15. [15]

    Computers, Environment and Urban Systems117, 102243 (2025)

    Malekzadeh, M., Willberg, E., Torkko, J., Toivonen, T.: Urban attractiveness accord- ing to ChatGPT: Contrasting AI and human insights. Computers, Environment and Urban Systems117, 102243 (2025)

  16. [16]

    OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery, 2025

    Liang, X., Xie, J., Zhao, T., Stouffs, R., Biljecki, F.: OpenFACADES: An open frame- work for architectural caption and attribute data enrichment via street view imagery. arXiv preprint arXiv:2504.02866 (2025)

  17. [17]

    Computer-Aided Civil and Infrastructure Engineering (2025)

    Jiang, Y., Wang, J., Shen, X., Dai, K.: Large language model for post-earthquake structural damage assessment of buildings. Computer-Aided Civil and Infrastructure Engineering (2025)

  18. [18]

    The Innovation (2026)

    Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al.: A survey on LLM-as-a-judge. The Innovation (2026). In Press

  19. [19]

    In: Proceedings of Annual Conference on Neural Information Processing Systems, pp

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P.,et al.: Judging LLM-as-a- judge with MT-Bench and Chatbot Arena. In: Proceedings of Annual Conference on Neural Information Processing Systems, pp. 46595–46623 (2023)

  20. [20]

    In: Proceedings of ACM Symposium on User Interface Software and Technology, pp

    Shankar, S., Zamfirescu-Pereira, J.D., Hart- mann, B., Parameswaran, A., Arawjo, I.: Who validates the validators? aligning LLM- assisted evaluation of LLM outputs with human preferences. In: Proceedings of ACM Symposium on User Interface Software and Technology, pp. 131–113114 (2024)

  21. [21]

    arXiv preprint arXiv:2508.18076 , year=

    Chehbouni, K., Haddou, M., Cheung, J.C.K., Farnadi, G.: Neither valid nor reliable? inves- tigating the use of LLMs as judges. arXiv preprint arXiv:2508.18076 (2025)

  22. [22]

    In: Proceedings of Findings of the Association for Computational Linguistics, pp

    Gebreegziabher, S.A., Ai, K., Zhang, Z., Glassman, E., Li, T.J.-J.: Leveraging vari- ation theory in counterfactual data aug- mentation for optimized active learning. In: Proceedings of Findings of the Association for Computational Linguistics, pp. 894–906 (2025) 16

  23. [23]

    Gemma: Open Models Based on Gemini Research and Technology

    Mesnard, T., Hardin, C., Dadashi, R., Bhu- patiraju, S., Pathak, S., Sifre, L., Rivi` ere, M., Kale, M.S., Love, J., Tafti, P., et al.: Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

  24. [24]

    In: Pro- ceedings of IEEE International Conference on Machine Learning, pp

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter- efficient transfer learning for NLP. In: Pro- ceedings of IEEE International Conference on Machine Learning, pp. 2790–2799 (2019)

  25. [25]

    In: Proceedings of Advances in Neural Information Processing Systems, pp

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient finetun- ing of quantized LLMs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 10088–10115 (2023)

  26. [26]

    In: Proceedings of Advances in Neural Information Processing Systems (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Proceedings of Advances in Neural Information Processing Systems (2023)

  27. [27]

    https://mistral.ai/news/mistral-small-3

    Mistral AI Team: Mistral Small 3 (2025). https://mistral.ai/news/mistral-small-3

  28. [28]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  29. [29]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  30. [30]

    In: Proceedings of IEEE Conference on Com- puter Vision and Pattern Recognition, pp

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Com- puter Vision and Pattern Recognition, pp. 770–778 (2016)

  31. [31]

    In: Proceedings of IEEE Inter- national Conference on Computer Vision, pp

    Howard, A., Sandler, M., Chu, G., Chen, L.- C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V.,et al.: Searching for MobileNetV3. In: Proceedings of IEEE Inter- national Conference on Computer Vision, pp. 1314–1324 (2019)

  32. [32]

    In: Proceedings of IEEE International Conference on Machine Learning, pp

    Tan, M., Le, Q.: EfficientNetV2: Smaller models and faster training. In: Proceedings of IEEE International Conference on Machine Learning, pp. 10096–10106 (2021)

  33. [33]

    In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp

    Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L.,et al.: Swin Transformer V2: Scaling up capac- ity and resolution. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 12009–12019 (2022) 17