pith. sign in

arxiv: 2606.11354 · v1 · pith:GI5QFFE3new · submitted 2026-06-09 · 💻 cs.ET

A Zero-Shot Multi-Agent Framework for Human-Building Interaction via Programmatic Reasoning

Pith reviewed 2026-06-27 10:27 UTC · model grok-4.3

classification 💻 cs.ET
keywords multi-agent frameworkhuman-building interactionprogrammatic reasoningzero-shotsemantic routingbuilding analyticsLLM agents
0
0 comments X

The pith

A hierarchical multi-agent framework uses a Doorman for query decomposition and coding agents that emit executable Python scripts to deliver accurate building analytics from natural language without fine-tuning or RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a zero-shot multi-agent system for human-building interaction that separates intuitive language handling from precise technical calculations in complex building data. A top-level Doorman agent breaks down user questions, then routes them to specialized coding agents that write and run Python scripts for arithmetic and analysis. This is tested on data from more than 200 commercial buildings and produces accurate, contextual answers for users from tenants to managers across multiple building systems. A sympathetic reader would care because building systems hold large, opaque datasets that normally require scarce domain experts, and this setup aims to make those data queryable through ordinary language.

Core claim

The central claim is that semantic routing combined with programmatic reasoning lets LLMs handle human-building interaction reliably in a zero-shot setting by generating executable Python scripts for exact calculations, thereby avoiding the need to embed domain knowledge directly in base models or rely on retrieval-augmented generation, and this produces accurate responses on real data from over 200 buildings for diverse stakeholders and applications.

What carries the argument

The Doorman mechanism for task decomposition together with specialized coding agents that output executable Python scripts for arithmetic and building analytics.

If this is right

  • The system supplies accurate and contextual responses to stakeholders ranging from tenants to building managers.
  • It supports multiple building system applications on data from more than 200 commercial buildings.
  • Programmatic reasoning via generated scripts replaces standard RAG for technical precision.
  • Natural language understanding is decoupled from domain analytics so that no single model needs to hold both.
  • Zero-shot operation works across varying LLM alignment characteristics without per-domain retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same routing-plus-script pattern could apply to other data-heavy domains that mix natural language with precise calculations, such as energy-grid operations or facility maintenance logs.
  • Real-time sensor feeds could be wired directly into the script execution step to support live queries rather than static datasets.
  • Error rates might drop further if the coding agents were allowed to chain multiple short scripts instead of one monolithic output per query.

Load-bearing premise

Routing natural language queries to coding agents that emit executable Python scripts will yield reliable technical accuracy without fine-tuning or direct domain-knowledge embedding in the models.

What would settle it

A set of building queries where the generated Python scripts return demonstrably incorrect numerical results or fail to interpret user intent on the 200-building dataset.

read the original abstract

Large Language Model (LLM) offers opportunities to enhance Human-Building Interaction (HBI) by enabling more direct interactions through intuitive interfaces to complex building systems. These systems can be characterized by the vast amounts of data across multiple formats, the lack of nonconfidential and generalizable information, and the requirement of domain expertise for interpretation. Applying LLMs to domain-specific tasks like HBI presents additional challenges. Limited training data makes traditional fine-tuning approaches impractical. Meanwhile, the opacity of LLM training data requires careful integration of domain knowledge to ensure reliability. Additionally, different LLMs exhibit varying alignment characteristics, suggesting that achieving both natural interaction and technical accuracy requires a multi-agent approach. These challenges highlight the need for innovative approaches to adapt LLMs for specialized domains while maintaining accuracy and user engagement. In this paper, we develop a hierarchical multi-agent framework that utilizes semantic routing and programmatic reasoning to decouple natural language understanding from building analytics. Instead of standard RAG approaches, our system employs a "Doorman" mechanism for task decomposition and specialized coding agents that generate executable Python scripts for precise arithmetic. We validate this framework on a dataset from more than 200 commercial buildings. Results demonstrate the effectiveness in providing accurate and contextual responses for diverse users, including stakeholders, from tenants to building managers, across various building system applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a hierarchical multi-agent LLM framework for Human-Building Interaction that uses a 'Doorman' semantic router for query decomposition and specialized coding agents that emit executable Python scripts for building analytics computations. It claims this zero-shot approach decouples natural language handling from technical accuracy, avoiding fine-tuning and RAG limitations, and validates the system on data from more than 200 commercial buildings to demonstrate accurate, contextual responses for users ranging from tenants to building managers across building system applications.

Significance. If the quantitative results hold, the separation of routing from programmatic reasoning offers a practical route to reliable domain-specific LLM use in data-scarce settings like building management, where direct fine-tuning is impractical. The design choice to generate executable code rather than rely on LLM arithmetic is a clear strength that could generalize to other technical domains.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): The effectiveness claim for >200 buildings is stated without any reported metrics (accuracy, error rates, success rates), baselines, statistical analysis, or exclusion criteria. This absence is load-bearing because the central contribution is the framework's reliability; without these numbers the claim cannot be evaluated.
  2. [§3] §3 (Methods): The description of the coding agents and Python script execution does not specify how domain knowledge (e.g., building metadata schemas, sensor units, or safety constraints) is injected into the generated code or how runtime errors are handled and reported back to the user.
minor comments (2)
  1. [§3] Notation for the Doorman routing logic is introduced without a formal definition or pseudocode; a diagram or algorithm box would improve clarity.
  2. [Abstract and §4] The abstract mentions 'various building system applications' but the results section does not enumerate which applications were tested or provide per-application breakdowns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and will revise the manuscript to provide the requested quantitative details and methodological clarifications.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): The effectiveness claim for >200 buildings is stated without any reported metrics (accuracy, error rates, success rates), baselines, statistical analysis, or exclusion criteria. This absence is load-bearing because the central contribution is the framework's reliability; without these numbers the claim cannot be evaluated.

    Authors: We agree that the absence of quantitative metrics limits evaluability of the reliability claims. The current manuscript states validation on data from more than 200 buildings but does not report accuracy, error rates, baselines, statistical analysis, or exclusion criteria. In the revised version we will expand §4 with these metrics, including success rates across user types and building systems, baseline comparisons, statistical tests, and explicit exclusion criteria. revision: yes

  2. Referee: [§3] §3 (Methods): The description of the coding agents and Python script execution does not specify how domain knowledge (e.g., building metadata schemas, sensor units, or safety constraints) is injected into the generated code or how runtime errors are handled and reported back to the user.

    Authors: We acknowledge the need for greater specificity. The revised §3 will explicitly describe how domain knowledge is injected via structured prompts containing building metadata schemas, standardized sensor units, and safety constraints. It will also detail the runtime error handling process, in which execution errors are captured, returned to the coding agents for correction through iterative prompting, and only then surfaced to the user with explanatory context. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an architectural description of a hierarchical multi-agent LLM framework (Doorman routing plus specialized coding agents emitting Python scripts) for human-building interaction, validated on a >200-building dataset. No equations, fitted parameters, self-citations, or derivation chains appear in the abstract or described content. The central claim of effectiveness is framed as an empirical outcome of the proposed system rather than a result reduced by construction to its own inputs or prior self-referential work. This is a standard non-circular framework proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5767 in / 947 out tokens · 31128 ms · 2026-06-27T10:27:47.554228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    It’s about time: A comparison of canadian and american time–activity patterns

    LEECH, J. A., NELSON, W. C., BURNETT, R. T., AARON, S., and RAIZENNE, M. E., 2002. “It’s about time: A comparison of canadian and american time–activity patterns”.Journal of Exposure Science & Environmental Epidemiology,12(6), p. 427–432

  2. [2]

    A review of select human-building interfaces and their relationship to human behavior, energy use and occupant comfort

    Day, J. K., McIlvennie, C., Brackley, C., Tarantini, M., Piselli, C., Hahn, J., O’Brien, W., Rajus, V . S., De Simone, M., Kjærgaard, M. B., et al., 2020. “A review of select human-building interfaces and their relationship to human behavior, energy use and occupant comfort”.Building and environment,178, p. 106920

  3. [3]

    S., Churchill, E

    Alavi, H. S., Churchill, E. F., Wiberg, M., Lalanne, D., Dalsgaard, P., Fatah gen Schieck, A., and Rogers, Y ., 2019. Introduction to human-building interaction (hbi) interfac- ing hci with architecture and urban design

  4. [4]

    The field of human building interaction for convergent research and innovation for intelligent built environments

    Becerik-Gerber, B., Lucas, G., Aryal, A., Awada, M., Berg´es, M., Billington, S., Boric-Lubecke, O., Ghahra- mani, A., Heydarian, A., H ¨oelscher, C., et al., 2022. “The field of human building interaction for convergent research and innovation for intelligent built environments”.Scien- tific Reports,12(1), p. 22092

  5. [5]

    I., 2022

    Messner, J. I., 2022. The lifecycle of a building project. Accessed: 2024-09-02

  6. [6]

    Modeling and simulation of energy-related human-building interac- tion: A systematic review

    Norouziasl, S., Jafari, A., and Zhu, Y ., 2021. “Modeling and simulation of energy-related human-building interac- tion: A systematic review”.Journal of Building Engineer- ing,44, p. 102928

  7. [7]

    Human-building interaction for indoor environmental con- trol: Evolution of technology and future prospects

    Kim, H., Kang, H., Choi, H., Jung, D., and Hong, T., 2023. “Human-building interaction for indoor environmental con- trol: Evolution of technology and future prospects”.Au- tomation in Construction,152, p. 104938

  8. [8]

    Ten questions concerning human-building interaction research for improving the quality of life

    Becerik-Gerber, B., Lucas, G., Aryal, A., Awada, M., Berg´es, M., Billington, S. L., Boric-Lubecke, O., Ghahra- mani, A., Heydarian, A., Jazizadeh, F., et al., 2022. “Ten questions concerning human-building interaction research for improving the quality of life”.Building and Environ- ment,226, p. 109681

  9. [9]

    Bosch building solutions - history of building automation.https: //www.boschbuildingsolutions

    Bosch, 2023. Bosch building solutions - history of building automation.https: //www.boschbuildingsolutions. com/xc/en/news-and-stories/ history-of-building-automation/. Accessed: 2023-05-23

  10. [10]

    Nantum ai

    Nantum AI, 2024. Nantum ai. Accessed: 2024-06-22

  11. [11]

    Design and applica- tions of an iot architecture for data-driven smart building operations and experimentation

    Malkawi, A., Ervin, S., Han, X., Chen, E. X., Lim, S., Am- panavos, S., and Howard, P., 2023. “Design and applica- tions of an iot architecture for data-driven smart building operations and experimentation”.Energy and Buildings, 295, p. 113291

  12. [12]

    The foundation for a smarter home

    Apple, 2024. The foundation for a smarter home. Accessed: 2024-09-02

  13. [13]

    Indoor envi- ronmental wellness index (iew-index): Towards intelligent building systems automation and optimization

    Wang, Y ., Shen, G., and Mehmani, A., 2024. “Indoor envi- ronmental wellness index (iew-index): Towards intelligent building systems automation and optimization”.Building and Environment,247, p. 111039

  14. [14]

    Word2Vec

    Church, K. W., 2017. “Word2Vec”.Natural Language En- gineering,23(1), Jan., pp. 155–162

  15. [15]

    Glove: Global Vectors for Word Representation

    Pennington, J., Socher, R., and Manning, C., 2014. “Glove: Global Vectors for Word Representation”. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computa- tional Linguistics, pp. 1532–1543

  16. [16]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K., 2019. BERT: Pre-training of Deep Bidirec- tional Transformers for Language Understanding, May. arXiv:1810.04805

  17. [17]

    A Survey of Large Language Models

    Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al., 2023. “A survey of large language models”.arXiv preprint arXiv:2303.18223

  18. [18]

    Gong, L., Wang, S., Elhoushi, M., and Cheung, A.,

  19. [19]

    Evaluation of LLMs on syntax-aware code fill- in-the-middle tasks

    “Evaluation of LLMs on syntax-aware code fill- in-the-middle tasks”. In Proceedings of the 41st Interna- tional Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProceedings of Machine Learning Research, PMLR, pp. 15907–15928

  20. [20]

    Gorilla: Large Language Model Connected with Massive APIs

    Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E., 2023. “Gorilla: Large language model connected with massive apis”.arXiv preprint arXiv:2305.15334

  21. [21]

    SceneCraft: An LLM agent for synthesizing 3D scenes as blender code

    Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y ., Ross, D. A., 9 Copyright © 2026 by ASME Schmid, C., and Fathi, A., 2024. “SceneCraft: An LLM agent for synthesizing 3D scenes as blender code”. In Pro- ceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenk...

  22. [22]

    Interpreting and improving large language models in arithmetic calculation

    Zhang, W., Wan, C., Zhang, Y ., Cheung, Y .-M., Tian, X., Shen, X., and Ye, J., 2024. “Interpreting and improving large language models in arithmetic calculation”. In Pro- ceedings of the 41st International Conference on Machine Learning, R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProcee...

  23. [23]

    Continual learning of large language models: A comprehensive sur- vey

    Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y ., Wang, Z., Ebrahimi, S., and Wang, H., 2024. Continual learning of large language models: A comprehensive sur- vey

  24. [24]

    A., 2020

    Gururangan, S., Marasovi ´c, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A., 2020. Don’t stop pretraining: Adapt language models to domains and tasks

  25. [25]

    Gsm-symbolic: Under- standing the limitations of mathematical reasoning in large language models

    Mirzadeh, I., Alizadeh, K., Shahrokhi, H., Tuzel, O., Ben- gio, S., and Farajtabar, M., 2024. Gsm-symbolic: Under- standing the limitations of mathematical reasoning in large language models

  26. [26]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K ¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., et al., 2020. “Retrieval-augmented generation for knowledge-intensive nlp tasks”.Advances in Neural In- formation Processing Systems,33, pp. 9459–9474

  27. [27]

    A survey on llm-based multi-agent systems: workflow, infras- tructure, and challenges

    Li, X., Wang, S., Zeng, S., Wu, Y ., and Yang, Y ., 2024. “A survey on llm-based multi-agent systems: workflow, infras- tructure, and challenges”.Vicinagearth,1(1), p. 9

  28. [28]

    APT: Adap- tive pruning and tuning pretrained language models for effi- cient training and inference

    Zhao, B., Hajishirzi, H., and Cao, Q., 2024. “APT: Adap- tive pruning and tuning pretrained language models for effi- cient training and inference”. In Proceedings of the 41st In- ternational Conference on Machine Learning, R. Salakhut- dinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProceedings of Ma- ...

  29. [29]

    Li, C., Liang, J., Zeng, A., Chen, X., Hausman, K., Sadigh, D., Levine, S., Fei-Fei, L., Xia, F., and Ichter, B.,

  30. [30]

    Chain of code: Reasoning with a language model- augmented code emulator

    “Chain of code: Reasoning with a language model- augmented code emulator”. In Proceedings of the 41st In- ternational Conference on Machine Learning, R. Salakhut- dinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, eds., V ol. 235 ofProceedings of Ma- chine Learning Research, PMLR, pp. 28259–28277

  31. [31]

    Improving factuality and reasoning in lan- guage models through multiagent debate

    Du, Y ., Li, S., Torralba, A., Tenenbaum, J. B., and Mor- datch, I., 2024. “Improving factuality and reasoning in lan- guage models through multiagent debate”. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, JMLR.org

  32. [32]

    Agent instructs large language models to be general zero-shot reasoners

    Crispino, N., Montgomery, K., Zeng, F., Song, D., and Wang, C., 2024. “Agent instructs large language models to be general zero-shot reasoners”. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, JMLR.org

  33. [33]

    Transformer machine learning language model for auto- alignment of long-term and short-term plans in construc- tion

    Amer, F., Jung, Y ., and Golparvar-Fard, M., 2021. “Transformer machine learning language model for auto- alignment of long-term and short-term plans in construc- tion”.Automation in Construction,132, p. 103929

  34. [34]

    Investigating the use of chatgpt for the scheduling of con- struction projects

    Prieto, S. A., Mengiste, E. T., and Garc´ıa de Soto, B., 2023. “Investigating the use of chatgpt for the scheduling of con- struction projects”.Buildings,13(4), p. 857

  35. [35]

    Gpt models in con- struction industry: Opportunities, limitations, and a use case validation

    Saka, A., Taiwo, R., Saka, N., Salami, B. A., Ajayi, S., Akande, K., and Kazemi, H., 2023. “Gpt models in con- struction industry: Opportunities, limitations, and a use case validation”.Developments in the Built Environment, p. 100300

  36. [36]

    Leveraging chatgpt to aid construction hazard recognition and support safety education and training

    Uddin, S. J., Albert, A., Ovid, A., and Alsharef, A., 2023. “Leveraging chatgpt to aid construction hazard recognition and support safety education and training”.Sustainability, 15(9), p. 7121

  37. [37]

    Text2bim: Generating building models using a large language model-based multi-agent framework

    Du, C., Esser, S., Nousias, S., and Borrmann, A., 2024. “Text2bim: Generating building models using a large language model-based multi-agent framework”.arXiv preprint arXiv:2408.08054

  38. [38]

    Llm-funcmapper: Function identification for interpreting complex clauses in building codes via llm

    Zheng, Z., Chen, K.-Y ., Cao, X.-Y ., Lu, X.-Z., and Lin, J.- R., 2023. “Llm-funcmapper: Function identification for interpreting complex clauses in building codes via llm”. arXiv preprint arXiv:2308.08728

  39. [39]

    Automated building information modeling compliance check through a large language model combined with deep learning and ontology

    Chen, N., Lin, X., Jiang, H., and An, Y ., 2024. “Automated building information modeling compliance check through a large language model combined with deep learning and ontology”.Buildings,14(7), p. 1983

  40. [40]

    BIM-GPT: A prompt-based virtual assistant framework for bim information retrieval.arXiv preprint arXiv:2304.09333, 2023

    Zheng, J., and Fischer, M., 2023. “Bim-gpt: a prompt- based virtual assistant framework for bim information re- trieval”.arXiv preprint arXiv:2304.09333

  41. [41]

    Hotgpt: How to make software documentation more useful with a large language model?

    Su, Y ., Wan, C., Sethi, U., Lu, S., Musuvathi, M., and Nath, S., 2023. “Hotgpt: How to make software documentation more useful with a large language model?”. In Proceedings of the 19th Workshop on Hot Topics in Operating Systems, pp. 87–93

  42. [42]

    Advancing build- ing energy modeling with large language models: Explo- ration and case studies

    Zhang, L., Chen, Z., and Ford, V ., 2024. “Advancing build- ing energy modeling with large language models: Explo- ration and case studies”.arXiv preprint arXiv:2402.09579

  43. [43]

    Eplus- llm: A large language model-based computing platform for automated building energy modeling

    Jiang, G., Ma, Z., Zhang, L., and Chen, J., 2024. “Eplus- llm: A large language model-based computing platform for automated building energy modeling”.Applied Energy, 367, p. 123431

  44. [44]

    Semantic enrichment 10 Copyright © 2026 by ASME for bim-based building energy performance simulations us- ing semantic textual similarity and fine-tuning multilingual llm

    Forth, K., and Borrmann, A., 2024. “Semantic enrichment 10 Copyright © 2026 by ASME for bim-based building energy performance simulations us- ing semantic textual similarity and fine-tuning multilingual llm”.Journal of Building Engineering,95, p. 110312

  45. [45]

    Using large language models for the interpretation of building regulations

    Fuchs, S., Witbrock, M., Dimyadi, J., and Amor, R., 2024. “Using large language models for the interpretation of building regulations”.arXiv preprint arXiv:2407.21060

  46. [46]

    Exploring automated en- ergy optimization with unstructured building data: A multi- agent based framework leveraging large language models

    Xiao, T., and Xu, P., 2024. “Exploring automated en- ergy optimization with unstructured building data: A multi- agent based framework leveraging large language models”. Energy and Buildings, p. 114691

  47. [47]

    An llm- based digital twin for optimizing human-in-the loop sys- tems

    Yang, H., Siew, M., and Joe-Wong, C., 2024. “An llm- based digital twin for optimizing human-in-the loop sys- tems”.arXiv preprint arXiv:2403.16809. A Survey Instrument and Response Data The complete survey instrument (including all questions and answer options) and the anonymized user response dataset are available at the following links: Survey instrume...