A Zero-Shot Multi-Agent Framework for Human-Building Interaction via Programmatic Reasoning
Pith reviewed 2026-06-27 10:27 UTC · model grok-4.3
The pith
A hierarchical multi-agent framework uses a Doorman for query decomposition and coding agents that emit executable Python scripts to deliver accurate building analytics from natural language without fine-tuning or RAG.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that semantic routing combined with programmatic reasoning lets LLMs handle human-building interaction reliably in a zero-shot setting by generating executable Python scripts for exact calculations, thereby avoiding the need to embed domain knowledge directly in base models or rely on retrieval-augmented generation, and this produces accurate responses on real data from over 200 buildings for diverse stakeholders and applications.
What carries the argument
The Doorman mechanism for task decomposition together with specialized coding agents that output executable Python scripts for arithmetic and building analytics.
If this is right
- The system supplies accurate and contextual responses to stakeholders ranging from tenants to building managers.
- It supports multiple building system applications on data from more than 200 commercial buildings.
- Programmatic reasoning via generated scripts replaces standard RAG for technical precision.
- Natural language understanding is decoupled from domain analytics so that no single model needs to hold both.
- Zero-shot operation works across varying LLM alignment characteristics without per-domain retraining.
Where Pith is reading between the lines
- The same routing-plus-script pattern could apply to other data-heavy domains that mix natural language with precise calculations, such as energy-grid operations or facility maintenance logs.
- Real-time sensor feeds could be wired directly into the script execution step to support live queries rather than static datasets.
- Error rates might drop further if the coding agents were allowed to chain multiple short scripts instead of one monolithic output per query.
Load-bearing premise
Routing natural language queries to coding agents that emit executable Python scripts will yield reliable technical accuracy without fine-tuning or direct domain-knowledge embedding in the models.
What would settle it
A set of building queries where the generated Python scripts return demonstrably incorrect numerical results or fail to interpret user intent on the 200-building dataset.
read the original abstract
Large Language Model (LLM) offers opportunities to enhance Human-Building Interaction (HBI) by enabling more direct interactions through intuitive interfaces to complex building systems. These systems can be characterized by the vast amounts of data across multiple formats, the lack of nonconfidential and generalizable information, and the requirement of domain expertise for interpretation. Applying LLMs to domain-specific tasks like HBI presents additional challenges. Limited training data makes traditional fine-tuning approaches impractical. Meanwhile, the opacity of LLM training data requires careful integration of domain knowledge to ensure reliability. Additionally, different LLMs exhibit varying alignment characteristics, suggesting that achieving both natural interaction and technical accuracy requires a multi-agent approach. These challenges highlight the need for innovative approaches to adapt LLMs for specialized domains while maintaining accuracy and user engagement. In this paper, we develop a hierarchical multi-agent framework that utilizes semantic routing and programmatic reasoning to decouple natural language understanding from building analytics. Instead of standard RAG approaches, our system employs a "Doorman" mechanism for task decomposition and specialized coding agents that generate executable Python scripts for precise arithmetic. We validate this framework on a dataset from more than 200 commercial buildings. Results demonstrate the effectiveness in providing accurate and contextual responses for diverse users, including stakeholders, from tenants to building managers, across various building system applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a hierarchical multi-agent LLM framework for Human-Building Interaction that uses a 'Doorman' semantic router for query decomposition and specialized coding agents that emit executable Python scripts for building analytics computations. It claims this zero-shot approach decouples natural language handling from technical accuracy, avoiding fine-tuning and RAG limitations, and validates the system on data from more than 200 commercial buildings to demonstrate accurate, contextual responses for users ranging from tenants to building managers across building system applications.
Significance. If the quantitative results hold, the separation of routing from programmatic reasoning offers a practical route to reliable domain-specific LLM use in data-scarce settings like building management, where direct fine-tuning is impractical. The design choice to generate executable code rather than rely on LLM arithmetic is a clear strength that could generalize to other technical domains.
major comments (2)
- [Abstract and §4] Abstract and §4 (Results): The effectiveness claim for >200 buildings is stated without any reported metrics (accuracy, error rates, success rates), baselines, statistical analysis, or exclusion criteria. This absence is load-bearing because the central contribution is the framework's reliability; without these numbers the claim cannot be evaluated.
- [§3] §3 (Methods): The description of the coding agents and Python script execution does not specify how domain knowledge (e.g., building metadata schemas, sensor units, or safety constraints) is injected into the generated code or how runtime errors are handled and reported back to the user.
minor comments (2)
- [§3] Notation for the Doorman routing logic is introduced without a formal definition or pseudocode; a diagram or algorithm box would improve clarity.
- [Abstract and §4] The abstract mentions 'various building system applications' but the results section does not enumerate which applications were tested or provide per-application breakdowns.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to provide the requested quantitative details and methodological clarifications.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Results): The effectiveness claim for >200 buildings is stated without any reported metrics (accuracy, error rates, success rates), baselines, statistical analysis, or exclusion criteria. This absence is load-bearing because the central contribution is the framework's reliability; without these numbers the claim cannot be evaluated.
Authors: We agree that the absence of quantitative metrics limits evaluability of the reliability claims. The current manuscript states validation on data from more than 200 buildings but does not report accuracy, error rates, baselines, statistical analysis, or exclusion criteria. In the revised version we will expand §4 with these metrics, including success rates across user types and building systems, baseline comparisons, statistical tests, and explicit exclusion criteria. revision: yes
-
Referee: [§3] §3 (Methods): The description of the coding agents and Python script execution does not specify how domain knowledge (e.g., building metadata schemas, sensor units, or safety constraints) is injected into the generated code or how runtime errors are handled and reported back to the user.
Authors: We acknowledge the need for greater specificity. The revised §3 will explicitly describe how domain knowledge is injected via structured prompts containing building metadata schemas, standardized sensor units, and safety constraints. It will also detail the runtime error handling process, in which execution errors are captured, returned to the coding agents for correction through iterative prompting, and only then surfaced to the user with explanatory context. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an architectural description of a hierarchical multi-agent LLM framework (Doorman routing plus specialized coding agents emitting Python scripts) for human-building interaction, validated on a >200-building dataset. No equations, fitted parameters, self-citations, or derivation chains appear in the abstract or described content. The central claim of effectiveness is framed as an empirical outcome of the proposed system rather than a result reduced by construction to its own inputs or prior self-referential work. This is a standard non-circular framework proposal.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
It’s about time: A comparison of canadian and american time–activity patterns
LEECH, J. A., NELSON, W. C., BURNETT, R. T., AARON, S., and RAIZENNE, M. E., 2002. “It’s about time: A comparison of canadian and american time–activity patterns”.Journal of Exposure Science & Environmental Epidemiology,12(6), p. 427–432
2002
-
[2]
A review of select human-building interfaces and their relationship to human behavior, energy use and occupant comfort
Day, J. K., McIlvennie, C., Brackley, C., Tarantini, M., Piselli, C., Hahn, J., O’Brien, W., Rajus, V . S., De Simone, M., Kjærgaard, M. B., et al., 2020. “A review of select human-building interfaces and their relationship to human behavior, energy use and occupant comfort”.Building and environment,178, p. 106920
2020
-
[3]
S., Churchill, E
Alavi, H. S., Churchill, E. F., Wiberg, M., Lalanne, D., Dalsgaard, P., Fatah gen Schieck, A., and Rogers, Y ., 2019. Introduction to human-building interaction (hbi) interfac- ing hci with architecture and urban design
2019
-
[4]
The field of human building interaction for convergent research and innovation for intelligent built environments
Becerik-Gerber, B., Lucas, G., Aryal, A., Awada, M., Berg´es, M., Billington, S., Boric-Lubecke, O., Ghahra- mani, A., Heydarian, A., H ¨oelscher, C., et al., 2022. “The field of human building interaction for convergent research and innovation for intelligent built environments”.Scien- tific Reports,12(1), p. 22092
2022
-
[5]
I., 2022
Messner, J. I., 2022. The lifecycle of a building project. Accessed: 2024-09-02
2022
-
[6]
Modeling and simulation of energy-related human-building interac- tion: A systematic review
Norouziasl, S., Jafari, A., and Zhu, Y ., 2021. “Modeling and simulation of energy-related human-building interac- tion: A systematic review”.Journal of Building Engineer- ing,44, p. 102928
2021
-
[7]
Human-building interaction for indoor environmental con- trol: Evolution of technology and future prospects
Kim, H., Kang, H., Choi, H., Jung, D., and Hong, T., 2023. “Human-building interaction for indoor environmental con- trol: Evolution of technology and future prospects”.Au- tomation in Construction,152, p. 104938
2023
-
[8]
Ten questions concerning human-building interaction research for improving the quality of life
Becerik-Gerber, B., Lucas, G., Aryal, A., Awada, M., Berg´es, M., Billington, S. L., Boric-Lubecke, O., Ghahra- mani, A., Heydarian, A., Jazizadeh, F., et al., 2022. “Ten questions concerning human-building interaction research for improving the quality of life”.Building and Environ- ment,226, p. 109681
2022
-
[9]
Bosch building solutions - history of building automation.https: //www.boschbuildingsolutions
Bosch, 2023. Bosch building solutions - history of building automation.https: //www.boschbuildingsolutions. com/xc/en/news-and-stories/ history-of-building-automation/. Accessed: 2023-05-23
2023
-
[10]
Nantum ai
Nantum AI, 2024. Nantum ai. Accessed: 2024-06-22
2024
-
[11]
Design and applica- tions of an iot architecture for data-driven smart building operations and experimentation
Malkawi, A., Ervin, S., Han, X., Chen, E. X., Lim, S., Am- panavos, S., and Howard, P., 2023. “Design and applica- tions of an iot architecture for data-driven smart building operations and experimentation”.Energy and Buildings, 295, p. 113291
2023
-
[12]
The foundation for a smarter home
Apple, 2024. The foundation for a smarter home. Accessed: 2024-09-02
2024
-
[13]
Indoor envi- ronmental wellness index (iew-index): Towards intelligent building systems automation and optimization
Wang, Y ., Shen, G., and Mehmani, A., 2024. “Indoor envi- ronmental wellness index (iew-index): Towards intelligent building systems automation and optimization”.Building and Environment,247, p. 111039
2024
-
[14]
Word2Vec
Church, K. W., 2017. “Word2Vec”.Natural Language En- gineering,23(1), Jan., pp. 155–162
2017
-
[15]
Glove: Global Vectors for Word Representation
Pennington, J., Socher, R., and Manning, C., 2014. “Glove: Global Vectors for Word Representation”. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computa- tional Linguistics, pp. 1532–1543
2014
-
[16]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., 2019. BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding, May. arXiv:1810.04805
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[17]
A Survey of Large Language Models
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al., 2023. “A survey of large language models”.arXiv preprint arXiv:2303.18223
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Gong, L., Wang, S., Elhoushi, M., and Cheung, A.,
-
[19]
Evaluation of LLMs on syntax-aware code fill- in-the-middle tasks
“Evaluation of LLMs on syntax-aware code fill- in-the-middle tasks”. In Proceedings of the 41st Interna- tional Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 15907–15928
-
[20]
Gorilla: Large Language Model Connected with Massive APIs
Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E., 2023. “Gorilla: Large language model connected with massive apis”.arXiv preprint arXiv:2305.15334
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
SceneCraft: An LLM agent for synthesizing 3D scenes as blender code
Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y ., Ross, D. A., 9 Copyright © 2026 by ASME Schmid, C., and Fathi, A., 2024. “SceneCraft: An LLM agent for synthesizing 3D scenes as blender code”. In Pro- ceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenk...
2026
-
[22]
Interpreting and improving large language models in arithmetic calculation
Zhang, W., Wan, C., Zhang, Y ., Cheung, Y .-M., Tian, X., Shen, X., and Ye, J., 2024. “Interpreting and improving large language models in arithmetic calculation”. In Pro- ceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProcee...
2024
-
[23]
Continual learning of large language models: A comprehensive sur- vey
Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y ., Wang, Z., Ebrahimi, S., and Wang, H., 2024. Continual learning of large language models: A comprehensive sur- vey
2024
-
[24]
A., 2020
Gururangan, S., Marasovi ´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A., 2020. Don’t stop pretraining: Adapt language models to domains and tasks
2020
-
[25]
Gsm-symbolic: Under- standing the limitations of mathematical reasoning in large language models
Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Ben- gio, S., and Farajtabar, M., 2024. Gsm-symbolic: Under- standing the limitations of mathematical reasoning in large language models
2024
-
[26]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K ¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., et al., 2020. “Retrieval-augmented generation for knowledge-intensive nlp tasks”.Advances in Neural In- formation Processing Systems,33, pp. 9459–9474
2020
-
[27]
A survey on llm-based multi-agent systems: workflow, infras- tructure, and challenges
Li, X., Wang, S., Zeng, S., Wu, Y ., and Yang, Y ., 2024. “A survey on llm-based multi-agent systems: workflow, infras- tructure, and challenges”.Vicinagearth,1(1), p. 9
2024
-
[28]
APT: Adap- tive pruning and tuning pretrained language models for effi- cient training and inference
Zhao, B., Hajishirzi, H., and Cao, Q., 2024. “APT: Adap- tive pruning and tuning pretrained language models for effi- cient training and inference”. In Proceedings of the 41st In- ternational Conference on Machine Learning, R. Salakhut- dinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProceedings of Ma- ...
2024
-
[29]
Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., and Ichter, B.,
-
[30]
Chain of code: Reasoning with a language model- augmented code emulator
“Chain of code: Reasoning with a language model- augmented code emulator”. In Proceedings of the 41st In- ternational Conference on Machine Learning, R. Salakhut- dinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProceedings of Ma- chine Learning Research, PMLR, pp. 28259–28277
-
[31]
Improving factuality and reasoning in lan- guage models through multiagent debate
Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mor- datch, I., 2024. “Improving factuality and reasoning in lan- guage models through multiagent debate”. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, JMLR.org
2024
-
[32]
Agent instructs large language models to be general zero-shot reasoners
Crispino, N., Montgomery, K., Zeng, F., Song, D., and Wang, C., 2024. “Agent instructs large language models to be general zero-shot reasoners”. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, JMLR.org
2024
-
[33]
Transformer machine learning language model for auto- alignment of long-term and short-term plans in construc- tion
Amer, F., Jung, Y ., and Golparvar-Fard, M., 2021. “Transformer machine learning language model for auto- alignment of long-term and short-term plans in construc- tion”.Automation in Construction,132, p. 103929
2021
-
[34]
Investigating the use of chatgpt for the scheduling of con- struction projects
Prieto, S. A., Mengiste, E. T., and Garc´ıa de Soto, B., 2023. “Investigating the use of chatgpt for the scheduling of con- struction projects”.Buildings,13(4), p. 857
2023
-
[35]
Gpt models in con- struction industry: Opportunities, limitations, and a use case validation
Saka, A., Taiwo, R., Saka, N., Salami, B. A., Ajayi, S., Akande, K., and Kazemi, H., 2023. “Gpt models in con- struction industry: Opportunities, limitations, and a use case validation”.Developments in the Built Environment, p. 100300
2023
-
[36]
Leveraging chatgpt to aid construction hazard recognition and support safety education and training
Uddin, S. J., Albert, A., Ovid, A., and Alsharef, A., 2023. “Leveraging chatgpt to aid construction hazard recognition and support safety education and training”.Sustainability, 15(9), p. 7121
2023
-
[37]
Text2bim: Generating building models using a large language model-based multi-agent framework
Du, C., Esser, S., Nousias, S., and Borrmann, A., 2024. “Text2bim: Generating building models using a large language model-based multi-agent framework”.arXiv preprint arXiv:2408.08054
-
[38]
Llm-funcmapper: Function identification for interpreting complex clauses in building codes via llm
Zheng, Z., Chen, K.-Y ., Cao, X.-Y ., Lu, X.-Z., and Lin, J.- R., 2023. “Llm-funcmapper: Function identification for interpreting complex clauses in building codes via llm”. arXiv preprint arXiv:2308.08728
-
[39]
Automated building information modeling compliance check through a large language model combined with deep learning and ontology
Chen, N., Lin, X., Jiang, H., and An, Y ., 2024. “Automated building information modeling compliance check through a large language model combined with deep learning and ontology”.Buildings,14(7), p. 1983
2024
-
[40]
Zheng, J., and Fischer, M., 2023. “Bim-gpt: a prompt- based virtual assistant framework for bim information re- trieval”.arXiv preprint arXiv:2304.09333
-
[41]
Hotgpt: How to make software documentation more useful with a large language model?
Su, Y ., Wan, C., Sethi, U., Lu, S., Musuvathi, M., and Nath, S., 2023. “Hotgpt: How to make software documentation more useful with a large language model?”. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pp. 87–93
2023
-
[42]
Advancing build- ing energy modeling with large language models: Explo- ration and case studies
Zhang, L., Chen, Z., and Ford, V ., 2024. “Advancing build- ing energy modeling with large language models: Explo- ration and case studies”.arXiv preprint arXiv:2402.09579
-
[43]
Eplus- llm: A large language model-based computing platform for automated building energy modeling
Jiang, G., Ma, Z., Zhang, L., and Chen, J., 2024. “Eplus- llm: A large language model-based computing platform for automated building energy modeling”.Applied Energy, 367, p. 123431
2024
-
[44]
Semantic enrichment 10 Copyright © 2026 by ASME for bim-based building energy performance simulations us- ing semantic textual similarity and fine-tuning multilingual llm
Forth, K., and Borrmann, A., 2024. “Semantic enrichment 10 Copyright © 2026 by ASME for bim-based building energy performance simulations us- ing semantic textual similarity and fine-tuning multilingual llm”.Journal of Building Engineering,95, p. 110312
2024
-
[45]
Using large language models for the interpretation of building regulations
Fuchs, S., Witbrock, M., Dimyadi, J., and Amor, R., 2024. “Using large language models for the interpretation of building regulations”.arXiv preprint arXiv:2407.21060
-
[46]
Exploring automated en- ergy optimization with unstructured building data: A multi- agent based framework leveraging large language models
Xiao, T., and Xu, P., 2024. “Exploring automated en- ergy optimization with unstructured building data: A multi- agent based framework leveraging large language models”. Energy and Buildings, p. 114691
2024
-
[47]
An llm- based digital twin for optimizing human-in-the loop sys- tems
Yang, H., Siew, M., and Joe-Wong, C., 2024. “An llm- based digital twin for optimizing human-in-the loop sys- tems”.arXiv preprint arXiv:2403.16809. A Survey Instrument and Response Data The complete survey instrument (including all questions and answer options) and the anonymized user response dataset are available at the following links: Survey instrume...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.