pith. sign in

arxiv: 2606.25139 · v1 · pith:YWVGKEJYnew · submitted 2026-06-23 · 📡 eess.SY · cs.SY

Buildrix: An Open Platform for Sharing and Benchmarking Agentic AI Skills in Building Engineering

Pith reviewed 2026-06-25 21:40 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords agentic AIbuilding engineeringbenchmarking platformopen sourceskill sharingreproducible evaluationPython packageagent harness
0
0 comments X

The pith

Buildrix is an open platform that packages agentic AI skills for building engineering into reusable, expert-verifiable units with standardized benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Buildrix to move agentic AI applications in building engineering from isolated demonstrations to shared, reusable capabilities. It combines a Python command-line package for developing and managing skills and test cases, a web-based Hub for challenges and results, and a local harness for executing workflows with external tools. Skills are stored as self-contained packages that include instructions, scripts, and dependencies, while domain experts can verify test cases and designate them as golden standards. A sympathetic reader cares because this structure promises transparent, reproducible evaluation instead of repeated one-off implementations.

Core claim

Buildrix supplies an open, community-driven platform consisting of a Python command-line package, a web-based Hub, and a local agent harness that together allow standardized skills to be developed, published, installed, executed, and evaluated through expert-verified quantitative test cases promoted to golden benchmarks for building engineering tasks.

What carries the argument

The standardized self-contained skill package that bundles task instructions, executable scripts, dependencies, and resources, managed across the Python package, web Hub, and local harness for validation and execution.

If this is right

  • Skills become reusable packages that developers can publish, install, and manage through the Python command-line package.
  • The web Hub organizes open challenges, collects reviews, and displays benchmark results across skills.
  • Expert-verified quantitative test cases are promoted to golden standards that support consistent, reproducible evaluation.
  • The local harness enables agents to discover skills, provision external tools, and run multi-step building engineering workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use could accumulate a library of verified skills that reduce duplication when automating building control or design tasks.
  • The same packaging and verification model could be applied to agentic AI in related engineering domains if the components prove workable.
  • Public challenges on the Hub could surface which agent architectures reliably handle specific building workflow problems.

Load-bearing premise

The described components will be implemented and domain experts will adopt them to verify test cases and generate reproducible benchmarks.

What would settle it

No functional Python package, web Hub, or local harness is released, or no community skills with expert-verified golden test cases appear for benchmarking.

Figures

Figures reproduced from arXiv: 2606.25139 by Bing Dong, Zixin Jiang.

Figure 1
Figure 1. Figure 1: Overall architecture and workflow of Buildrix. Contributors create and submit challenges, skills, and test cases through the Buildrix package; the Buildrix Hub organizes, reviews, and benchmarks submitted artifacts; and end users install selected skills and execute them through an agent harness for real-world building engineering tasks. resenting a synthetic exercise with a single predetermined answer, eac… view at source ↗
Figure 2
Figure 2. Figure 2: Anatomy of the resstock-building-generation skill package. A skill is a self-contained folder. SKILL.md carries a short discovery prompt in its YAML frontmatter, which is loaded at startup to tell the agent when to use the skill, and a Markdown instruction body, which is loaded only after the skill is selected to tell the agent how to use it. config.yaml declares the external toolchain (EnergyPlus, OpenStu… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end retrofit analysis for an eight-building sample of the Onondaga County, NY residential stock, all values from EnergyPlus. (a) Syracuse 2024 hourly dry-bulb temperature over the summer, with the five identified heat-wave events shaded and peak temperatures labelled. (b) Whole-building annual site energy, baseline vs. retrofit, with per-building percentage savings. (c) Peak cooling demand during he… view at source ↗
Figure 4
Figure 4. Figure 4: Automatic code revision for schema mismatch. opened the associated script after receiving the error, and corrected the function call. The generated models also produced an EnergyPlus IDD version conflict with the locally provisioned simulation engine. The agent identified the mismatch and updated the model version before rerunning the simulations. During the first retrofit batch, seven models terminated wi… view at source ↗
Figure 5
Figure 5. Figure 5: Context efficiency and compositional scalability of the skill library. (a) Per-skill token usage versus full-body and script token footprints. (b) Startup context growth versus loading all at startup, together with the theoretical number of composable workflows 2 N − 1 as the library size increases. greater difficulty in selecting the most appropriate skill. A large repository is therefore useful only when… view at source ↗
read the original abstract

Agentic AI offers significant potential to automate complex building-engineering workflows. However, most existing applications remain isolated proof-of-concept demonstrations and lack reusable domain capabilities, human-verified evaluation cases, and standardized benchmarking infrastructure. This study presents Buildrix, an open, community-driven platform for developing, sharing, executing, and evaluating agentic AI skills for building engineering. Buildrix integrates three components: a Python command-line package for developing, validating, publishing, installing, and managing skills and test cases; a web-based Hub for organizing open challenges, reusable skills, test cases, reviews, and benchmark results; and a local agent harness that supports skill discovery, external toolchain provisioning, progressive context loading, and multi-step workflow execution. Buildrix skills are organized as standardized, self-contained packages containing task instructions, executable scripts, dependencies, and supporting resources. Quantitative test cases can be verified by domain experts and promoted to golden test cases for reproducible benchmark evaluation. Buildrix provides an open foundation for reusable capability development, transparent evaluation, and community-driven advancement of agentic AI in building engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Buildrix as an open, community-driven platform for agentic AI skills in building engineering. It integrates a Python command-line package for skill and test-case management, a web-based Hub for challenges and benchmarks, and a local agent harness for execution. Skills are standardized self-contained packages, with quantitative test cases verifiable by domain experts as golden cases for reproducible evaluation. The central claim is that Buildrix supplies an open foundation for reusable capability development, transparent evaluation, and community advancement.

Significance. If the described components were implemented, adopted, and used to produce verified benchmarks, the platform could address the isolation of current proof-of-concept agentic AI applications in the domain. The manuscript, however, contains only high-level descriptions of intended functionality with no code, artifacts, execution traces, benchmark numbers, or adoption data, so any significance remains prospective.

major comments (2)
  1. [Abstract] Abstract: The claim that Buildrix 'provides an open foundation for reusable capability development, transparent evaluation, and community-driven advancement' is not supported by any implementation details, source code, example skill packages, test-case artifacts, or usage metrics. The three components (Python package, web Hub, local harness) are described at the level of intended workflows only.
  2. [Abstract] Abstract: The assertion that 'quantitative test cases can be verified by domain experts and promoted to golden test cases for reproducible benchmark evaluation' rests on the unverified premise that the platform will be built and used; no mechanism, example workflow, or verification process is demonstrated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. The paper introduces Buildrix as a platform design to address fragmentation in agentic AI for building engineering, with the three components described through their architecture and workflows. We respond point-by-point to the major comments below, noting where revisions can clarify the scope of the current contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that Buildrix 'provides an open foundation for reusable capability development, transparent evaluation, and community-driven advancement' is not supported by any implementation details, source code, example skill packages, test-case artifacts, or usage metrics. The three components (Python package, web Hub, local harness) are described at the level of intended workflows only.

    Authors: We agree that the manuscript presents the platform at the level of design and intended workflows rather than including embedded source code, example packages, or usage metrics. This is consistent with system-description papers that focus on architecture to enable community adoption. The standardized skill package format, Hub organization, and harness capabilities are specified in sufficient detail to define the foundation. We will revise the abstract and add a short implementation-status paragraph to make this scope explicit. revision: partial

  2. Referee: [Abstract] Abstract: The assertion that 'quantitative test cases can be verified by domain experts and promoted to golden test cases for reproducible benchmark evaluation' rests on the unverified premise that the platform will be built and used; no mechanism, example workflow, or verification process is demonstrated.

    Authors: The verification mechanism is described as part of the skill-package structure and the Hub's review workflow: quantitative test cases are bundled with skills, domain experts can review them via the web interface, and accepted cases become golden references for benchmark runs executed by the local harness. While the manuscript does not include a live example workflow, the design specifies how this process operates. We will revise the abstract to frame this as a defined capability of the platform rather than a demonstrated outcome. revision: partial

Circularity Check

0 steps flagged

No circularity: platform description without derivations or predictions

full rationale

The paper is a descriptive announcement of a proposed platform (Python package, web Hub, local harness) with no equations, no quantitative predictions, no fitted parameters, and no derivation chain of any kind. The central claim is a statement of intended functionality and community benefit rather than a result obtained from internal logic or self-referential steps. No self-citations, ansatzes, or uniqueness theorems appear in any load-bearing role. This is a normal non-finding for a platform paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical claims, fitted parameters, axioms, or new postulated entities are present in the provided text.

pith-pipeline@v0.9.1-grok · 5718 in / 1065 out tokens · 28473 ms · 2026-06-25T21:40:12.865973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 7 linked inside Pith

  1. [1]

    Wetter, M., & Sulzer, M. (2024). A call to action for building energy system modelling in the age of decar- bonization.Journal of Building Performance Simula- tion, 17(3), 383–393

  2. [2]

    Blum, D., Wang, Z., Weyandt, C., Kim, D., Wetter, M., Hong, T., & Piette, M. A. (2022). Field demonstra- tion and implementation analysis of model predictive control in an office HVAC system.Applied Energy, 318, 119104

  3. [4]

    & Schmidt, D

    White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., ... & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT.arXiv preprintarXiv:2302.11382

  4. [5]

    & Sui, Z

    Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., ... & Sui, Z. (2024, November). A survey on in-context learning. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 1107–1128)

  5. [6]

    X., Zhou, K., Li, J., Tang, T., Dong, Z., Hou, Y.,

    Zhao, W. X., Zhou, K., Li, J., Tang, T., Dong, Z., Hou, Y., ... & Wen, J. R. (2026). A survey of large lan- guage models.Frontiers of Computer Science, 20(12), 2012627

  6. [7]

    Jiang, G., Ma, Z., Zhang, L., & Chen, J. (2024). EPlus-LLM: A large language model-based comput- ing platform for automated building energy modeling. Applied Energy, 367, 123431. 13

  7. [8]

    U., Kim, K., Senouci, A., Han, Z., & Zhang, Y

    Madireddy, S., Gao, L., Din, Z. U., Kim, K., Senouci, A., Han, Z., & Zhang, Y. (2025). Large language model-driven code compliance checking in building information modeling.Electronics, 14(11), 2146

  8. [9]

    S., & Capozzoli, A

    Perini, M., Antonucci, D., Giudice, R., Piscitelli, M. S., & Capozzoli, A. (2025). BrickLLM: A Python library for generating Brick-compliant RDF graphs using LLMs.SoftwareX, 30, 102121

  9. [10]

    & Chen, E

    Huang, X., Liu, W., Chen, X., Wang, X., Wang, H., Lian, D., ... & Chen, E. (2024). Understanding the planning of LLM agents: A survey.arXiv preprint arXiv:2402.02716

  10. [11]

    A., Tihanyi, N., & Debbah, M

    Ferrag, M. A., Tihanyi, N., & Debbah, M. (2025). FromLLMreasoningtoautonomousAIagents: Acom- prehensive review.arXiv preprintarXiv:2504.19678

  11. [12]

    Du, C., Esser, S., Nousias, S., & Borrmann, A. (2026). Text2BIM: Generating Building Models Using a Large Language Model-Based Multiagent Frame- work.Journal of Computing in Civil Engineering, 40(2), 04025142

  12. [13]

    Zhang, L., Ford, V., Chen, Z., & Chen, J. (2025). Automatic building energy model development and de- bugging using large language models agentic workflow. Energy and Buildings, 327, 115116

  13. [14]

    Lin, X., Prabowo, A., Razzak, I., Xue, H., Amos, M., Behrens, S., & Salim, F. D. (2024, December). Bitsa: Leveraging time series foundation model for building energy analytics. In2024 IEEE International Conference on Data Mining Workshops (ICDMW)(pp. 891–894). IEEE

  14. [15]

    B., Kuppan, K., & Divya, B

    Acharya, D. B., Kuppan, K., & Divya, B. (2025). Agen- tic AI: Autonomous intelligence for complex goals—A comprehensive survey.IEEE Access, 13, 18912–18936

  15. [16]

    & Wang, C

    Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., ... & Wang, C. (2024, August). AutoGen: Enabling next-gen LLM applications via multi-agent conversa- tions. InFirst Conference on Language Modeling

  16. [17]

    (n.d.).LangGraph

    LangChain. (n.d.).LangGraph. Retrieved May 26, 2026, from https://www.langchain.com/langgr aph

  17. [18]

    SignificantGravitas.(2023).AutoGPT[Computersoft- ware]. GitHub. https://github.com/Significant -Gravitas/AutoGPT

  18. [19]

    & Anandkumar, A

    Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., ... & Anandkumar, A. (2023). Voyager: An open-ended embodied agent with large language models.arXiv preprintarXiv:2305.16291

  19. [20]

    Xu, W., Wan, H., Goel, S., & Antonopoulos, C. A. (2025). Development of a dynamic multi-agent network for building energy modeling: A case study towards scalable and autonomous energy modeling.Energy and Buildings, 116712

  20. [21]

    Li, H., Zhang, L., Zhou, H., & Hong, T. (2026). MCP- enabled agentic AI workflow for building energy mod- elling: framework and use cases.Journal of Building Performance Simulation, 1–27

  21. [22]

    Lee, J., Song, J., Koo, J., Choi, S., Hwang, J., Saif, S. M. H., ... & Yoon, S. (2025). Agentic built environ- ments: a review.Energy and Buildings, 116159

  22. [23]

    & Abdelaziz, I

    Kate, K., Pedapati, T., Basu, K., Rizk, Y., Chen- thamarakshan, V., Chaudhury, S., ... & Abdelaziz, I. (2025). LongFuncEval: Measuring the effectiveness of long context models for function calling.arXiv preprint arXiv:2505.10570

  23. [24]

    Kim, Y., Gu, K., Park, C., Park, C., Schmidgall, S., Heydari, A. A., ... & Liu, X. (2025). Towards a science of scaling agent systems.arXiv preprint arXiv:2512.08296

  24. [25]

    & Molchanov, P

    Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., ... & Molchanov, P. (2025). Small language models are the future of agentic AI.arXiv preprintarXiv:2506.02153

  25. [26]

    (2024, December 20).Building effective agents

    Anthropic. (2024, December 20).Building effective agents. Anthropic Engineering.https://www.anthro pic.com/engineering/building-effective-age nts

  26. [27]

    E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K

    Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024, May). SWE-bench: Can language models resolve real-world GitHub issues? InInternational Conference on Learning Representa- tions(Vol. 2024, pp. 54107–54157)

  27. [28]

    (2024, May)

    Mialon, G., Fourrier, C., Wolf, T., LeCun, Y., & Scialom, T. (2024, May). GAIA: a benchmark for general AI assistants. InInternational Conference on Learning Representations(Vol. 2024, pp. 9025–9049)

  28. [29]

    https://openai.com/i ndex/introducing-codex/ , May 2025a

    OpenAI.Introducing Codex. https://openai.com/i ndex/introducing-codex/ , May 2025a. Accessed: 2026-04-06

  29. [30]

    & Gui, T

    Lin, J., Liu, S., Pan, C., Lin, L., Dou, S., Huang, X., ... & Gui, T. (2026). Agentic harness engineering: Observability-driven automatic evolution of coding- agent harnesses.arXiv preprintarXiv:2604.25850

  30. [31]

    (2025, October 16).Equipping agents for the real world with Agent Skills

    Zhang, B., Lazuka, K., & Murag, M. (2025, October 16).Equipping agents for the real world with Agent Skills. Anthropic Engineering.https://www.anthro pic.com/engineering/equipping-agents-for-t he-real-world-with-agent-skills. 14

  31. [32]

    https://platform.claude

    Agent Skills overview. https://platform.claude. com/docs/en/agents-and-tools/agent-skills/ overview

  32. [33]

    E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O

    Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). SWE-agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Process- ing Systems, 37, 50528–50652

  33. [34]

    (2025, April 18).Claude Code: Best prac- tices for agentic coding

    Anthropic. (2025, April 18).Claude Code: Best prac- tices for agentic coding. Anthropic Engineering.https: //www.anthropic.com/engineering/claude-cod e-best-practices

  34. [35]

    (2025, May).Introducing Codex

    OpenAI. (2025, May).Introducing Codex. https:// openai.com/index/introducing-codex/

  35. [36]

    (2025, June 25).Gemini CLI: Your open- source AI agent

    Google. (2025, June 25).Gemini CLI: Your open- source AI agent. Google Blog. https://blog.goo gle/technology/developers/introducing-gemin i-cli-open-source-ai-agent/

  36. [37]

    (2026).OpenClaw: Open-source AI coding assistant[Computer software]

    OpenClaw. (2026).OpenClaw: Open-source AI coding assistant[Computer software]. https://openclawla b.com/

  37. [38]

    (2026).Hermes Agent: A self- improving autonomous AI agent[Computer software]

    Nous Research. (2026).Hermes Agent: A self- improving autonomous AI agent[Computer software]. https://hermes-agent.nousresearch.com/

  38. [39]

    Jiang, Z., Xu, W., & Dong, B. (2026). An Agen- tic AI-Enabled Physics-Informed Machine Learning Framework for Grid-Interactive, Decarbonized Build- ing Operations.Advances in Applied Energy, 100273

  39. [40]

    (2026).Awesome Agent Skills: A cu- rated collection of agent skills from official develop- ment teams and the community[Computer software]

    VoltAgent. (2026).Awesome Agent Skills: A cu- rated collection of agent skills from official develop- ment teams and the community[Computer software]. GitHub. https://github.com/VoltAgent/awesome -agent-skills(accessed June 23, 2026). 15