pith. sign in

arxiv: 2606.22402 · v1 · pith:JEW367K5new · submitted 2026-06-21 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Reinforcement learning to improve large language model-based automated code compliance systems

Pith reviewed 2026-06-26 10:08 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG
keywords automated code compliancelarge language modelsreinforcement learningcode skeletonsbuilding regulationssupervised fine-tuninggroup relative policy optimizationGRPO
0
0 comments X

The pith

P4IR combines supervised fine-tuning with group relative policy optimization to produce more accurate code skeletons for automated building code compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents P4IR as a two-stage process that first uses supervised fine-tuning to embed domain knowledge about building regulations into an LLM, then applies Group Relative Policy Optimization to refine the high-level code skeletons it generates. This yields measurable gains: tree edit distance drops by up to 23.8 percent and token-level Levenshtein distance by up to 38.6 percent versus SFT-only baselines. In zero-shot use the resulting model also surpasses several frontier LLMs that rely on few-shot prompting, and the GRPO stage produces a small but statistically significant drop in false positives. The authors position the method as a route to more reliable LLM-based automated code compliance systems.

Core claim

P4IR demonstrates that applying Group Relative Policy Optimization after supervised fine-tuning lets an LLM generate higher-fidelity intermediate code skeletons for building regulation compliance, cutting structural and token-level edit distances relative to SFT alone while also lowering false positives and outperforming leading LLMs in zero-shot settings.

What carries the argument

The P4IR two-stage framework: supervised fine-tuning to instill domain knowledge, followed by Group Relative Policy Optimization (GRPO) to optimize the generated code skeletons directly against domain-specific objectives.

If this is right

  • Lower tree edit and Levenshtein distances produce code skeletons that better preserve both structure and semantics of the target regulations.
  • Zero-shot performance exceeding few-shot frontier models reduces the need for prompt engineering when deploying the system.
  • The GRPO stage's reduction in false positives improves the precision of the downstream compliance checker.
  • Direct optimization for domain objectives via GRPO offers a general recipe for improving LLM outputs on other structured generation tasks in regulatory domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same SFT-plus-GRPO pipeline could be tested on generating full executable compliance rules rather than skeletons alone.
  • Integration of the improved skeletons into an actual design-checking engine would provide a direct test of whether the distance metrics correlate with practical compliance accuracy.
  • The method may transfer to other code-generation settings where an intermediate structured representation must match a regulatory or specification template.

Load-bearing premise

Reductions in edit distance between generated code skeletons and reference skeletons will produce higher accuracy when those skeletons are used to check actual building designs against real regulations.

What would settle it

An end-to-end evaluation that runs the generated skeletons on a held-out set of real building designs and regulations and measures the rate of correct compliance decisions against a human-annotated ground truth.

Figures

Figures reproduced from arXiv: 2606.22402 by Jack Wei Lun Shi, Justin K.W. Yeoh, Leong Hien Poh, Minghao Dang, Wawan Solihin.

Figure 1
Figure 1. Figure 1: Extracting the code skeleton from the rule interpretation dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall pipeline of the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reward system for the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of (a) tree edit distance and (b) token-level Levenshtein distance measures [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompts relating to (a) zero-shot setting in the proposed framework and (b) few-shot setting [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Average tree edit distance and (b) average token-level Levenshtein distance. Lower [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Smoothed average logit entropy during generation. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (a) CodeBERTScore Precision and (b) CodeBERTScore Recall. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example pertaining to travel distance, showing (a) the reference, (b) the SFT-generated, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example pertaining to one-way escape arrangements, showing (a) the reference, (b) the [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example pertaining to underground pedestrian linkages, showing (a) the reference, (b) the [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Example pertaining to shadow area calculation, showing (a) the reference, (b) the SFT [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

Large language model (LLM)-based approaches for automated code compliance (ACC) of building regulations are prone to generating incorrect and hallucinated computer-processable rules. This paper introduces P4IR, a two-stage framework that uses supervised fine-tuning (SFT) to instill domain knowledge in an LLM, followed by Group Relative Policy Optimization (GRPO) to improve the accuracy of the generated intermediate representations in the form of high-level code skeletons. The framework achieved reductions of up to 23.8% and 38.6% in tree edit distance and token-level Levenshtein distance respectively, relative to the SFT baselines. Comparative analysis demonstrates that this approach in a zero-shot setting outperforms leading LLMs in both code structure and semantics, specifically Claude Opus and Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7, evaluated via few-shot prompting. Additionally, the GRPO stage produced a small yet statistically significant reduction in false positives. By combining SFT with GRPO to optimize directly for domain-specific objectives, this approach offers a path toward more accurate and reliable LLM-based ACC systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces P4IR, a two-stage framework that applies supervised fine-tuning (SFT) to instill domain knowledge in an LLM, followed by Group Relative Policy Optimization (GRPO) to refine high-level code skeletons for LLM-based automated code compliance (ACC) checking of building regulations. It reports up to 23.8% and 38.6% reductions in tree edit distance and token-level Levenshtein distance versus SFT baselines, zero-shot outperformance over Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 (evaluated few-shot), and a small but statistically significant false-positive reduction after GRPO.

Significance. If the proxy improvements in skeleton fidelity demonstrably increase end-to-end compliance accuracy on real building designs, the work would provide a concrete route to more reliable LLM-based ACC by directly optimizing for domain-specific structural and semantic objectives via GRPO. The absence of that linkage currently confines the contribution to intermediate metrics.

major comments (2)
  1. [Abstract and Evaluation] Abstract and Evaluation section: The central claim is that P4IR improves LLM-based ACC systems. This requires that lower tree edit distance and Levenshtein distance on generated skeletons produce higher accuracy when the skeletons are used to check actual building designs against regulations. The manuscript reports only skeleton-reference comparisons plus an unspecified false-positive reduction; no end-to-end evaluation on real designs is supplied, leaving the headline performance numbers without demonstrated connection to the stated goal.
  2. [Methods / Experimental setup] Methods / Experimental setup: The abstract states concrete percentage reductions and statistical significance, yet supplies no information on dataset size, exclusion rules, baseline implementation details, or how the distance metrics were computed. These omissions prevent verification that the reported numbers support the comparative claims against SFT and other LLMs.
minor comments (2)
  1. [Abstract] Abstract: Model names such as 'GPT-5.2' and 'Sonnet 4.5' are non-standard; clarify the exact versions or checkpoints used.
  2. [Evaluation] Notation: 'token-level Levenshtein distance' and 'tree edit distance' should be defined with explicit formulas or references on first use to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central claim is that P4IR improves LLM-based ACC systems. This requires that lower tree edit distance and Levenshtein distance on generated skeletons produce higher accuracy when the skeletons are used to check actual building designs against regulations. The manuscript reports only skeleton-reference comparisons plus an unspecified false-positive reduction; no end-to-end evaluation on real designs is supplied, leaving the headline performance numbers without demonstrated connection to the stated goal.

    Authors: The manuscript's contribution centers on improving the quality of high-level code skeletons as an intermediate representation, which the abstract and introduction identify as a key source of hallucinations in LLM-based ACC. The GRPO stage is explicitly optimized for tree edit distance and token-level Levenshtein distance to these skeletons, and the reported false-positive reduction supplies limited but statistically significant evidence of a downstream effect. We acknowledge that a full end-to-end evaluation on complete building designs would provide a stronger link to final compliance accuracy; however, the current scope demonstrates that direct optimization of domain-specific structural and semantic objectives via GRPO yields measurable gains over SFT and frontier LLMs on the skeleton task itself. The headline claims are therefore tied to skeleton fidelity rather than end-to-end accuracy, which is stated as an intended future direction. revision: no

  2. Referee: [Methods / Experimental setup] Methods / Experimental setup: The abstract states concrete percentage reductions and statistical significance, yet supplies no information on dataset size, exclusion rules, baseline implementation details, or how the distance metrics were computed. These omissions prevent verification that the reported numbers support the comparative claims against SFT and other LLMs.

    Authors: We agree that the current manuscript lacks sufficient methodological detail for full reproducibility. In the revised version we will expand the Experimental Setup and Evaluation sections to report dataset size and any exclusion criteria, precise baseline prompting and fine-tuning configurations, and the exact implementations used to compute tree edit distance and token-level Levenshtein distance (including any normalization or tokenization steps). revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparisons with no derivations or self-referential predictions

full rationale

The paper reports measured reductions in tree edit distance and Levenshtein distance from SFT+GRPO versus baselines, plus comparisons to other LLMs, all obtained through direct experimental evaluation on held-out data. No equations, parameter fitting presented as prediction, uniqueness theorems, or self-citation chains appear in the provided text or abstract. The central claims rest on observable performance deltas against external references rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.1-grok · 5755 in / 1185 out tokens · 26658 ms · 2026-06-26T10:08:38.601848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 35 canonical work pages · 4 internal anchors

  1. [1]

    https://www1.bca.gov.sg/ about-us/news-and-publications/media-releases/2025/01/23/ construction-demand-to-remain-strong-for-2025(accessed August 4, 2025)

    Building and Construction Authority (BCA), Construction demand to remain strong for 2025, BCA (2025). https://www1.bca.gov.sg/ about-us/news-and-publications/media-releases/2025/01/23/ construction-demand-to-remain-strong-for-2025(accessed August 4, 2025)

  2. [2]

    Zheng, J

    Z. Zheng, J. Han, K.-Y . Chen, X.-Y . Cao, X.-Z. Lu, J.-R. Lin, Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing, Eng. Appl. Artif. Intell. 163 (2026) 112823. https://doi.org/10. 1016/j.engappai.2025.112823

  3. [3]

    Zhang, N

    R. Zhang, N. El-Gohary, Hierarchical representation and deep learning–based method for auto- matically transforming textual building codes into semantic computable requirements, J. Com- put. Civ. Eng. 36 (2022) 04022022. https://doi.org/10.1061/(ASCE)CP.1943-5487. 0001014

  4. [4]

    Fuchs, J

    S. Fuchs, J. Dimyadi, M. Witbrock, R. Amor, Intermediate representations to improve the semantic parsing of building regulations, Adv. Eng. Inform. 62 (2024) 102735. https://doi. org/10.1016/j.aei.2024.102735

  5. [5]

    Zhang, How can ChatGPT help in automated building code compliance checking?, Int

    J. Zhang, How can ChatGPT help in automated building code compliance checking?, Int. Symp. Autom. Robot. Constr. ISARC Proc. 2023 Proceedings of the 40th ISARC, Chennai, India (2023) 63–70.https://doi.org/10.22260/ISARC2023/0011

  6. [6]

    Fuchs, M

    S. Fuchs, M. Witbrock, J. Dimyadi, R. Amor, Using large language models for the interpretation of building regulations, (2024).https://doi.org/10.48550/arXiv.2407.21060

  7. [7]

    S. Lin, J. Hilton, O. Evans, TruthfulQA: measuring how models mimic human falsehoods, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proc. 60th Annu. Meet. Assoc. Comput. Linguist. V ol. 1 Long Pap., Association for Computational Linguistics, Dublin, Ireland, 2022: pp. 3214–3252.https://doi.org/10.18653/v1/2022.acl-long.229

  8. [8]

    Huang, W

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions, ACM Trans. Inf. Syst. 43 (2025) 1–55. https://doi.org/10.1145/ 3703155

  9. [9]

    Fuchs, M

    S. Fuchs, M. Witbrock, J. Dimyadi, R. Amor, Neural semantic parsing of building regulations for compliance checking, IOP Conf. Ser. Earth Environ. Sci. 1101 (2022) 092022. https: //doi.org/10.1088/1755-1315/1101/9/092022

  10. [10]

    K. Tian, E. Mitchell, H. Yao, C. Manning, C. Finn, Fine-tuning language models for factuality, in: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. https: //openreview.net/forum?id=kEK08VdSO5(accessed November 10, 2025)

  11. [11]

    P. Roit, J. Ferret, L. Shani, R. Aharoni, G. Cideron, R. Dadashi, M. Geist, S. Girgin, L. Hussenot, O. Keller, N. Momchev, S. Ramos Garea, P. Stanczyk, N. Vieillard, O. Bachem, G. Elidan, A. Hassidim, O. Pietquin, I. Szpektor, Factually consistent summarization via reinforcement learning with textual entailment feedback, in: A. Rogers, J. Boyd-Graber, N. ...

  12. [12]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y .K. Li, Y . Wu, D. Guo, DeepSeekMath: pushing the limits of mathematical reasoning in open language models, (2024). https://doi.org/10.48550/arXiv.2402.03300

  13. [13]

    J.W.L. Shi, W. Solihin, J.K.W. Yeoh, Fine-tuning a large language model for automated code compliance of building regulations, Adv. Eng. Inform. 68 (2025) 103676. https://doi.org/ 10.1016/j.aei.2025.103676 19

  14. [15]

    Eastman, J

    C. Eastman, J. Lee, Y . Jeong, J. Lee, Automatic rule-based checking of building designs, Autom. Constr. 18 (2009) 1011–1033.https://doi.org/10.1016/j.autcon.2009.07.002

  15. [16]

    Hjelseth, N

    E. Hjelseth, N. Nisbet, Capturing normative constraints by use of semantic mark-up RASE methodology, Proc. CIB W78-W102 Conf. Pp 1-10 (2011). https://itc.scix.net/pdfs/ w78-2011-Paper-45.pdf

  16. [17]

    Fitkau, T

    I. Fitkau, T. Hartmann, An ontology-based approach of automatic compliance checking for structural fire safety requirements, Adv. Eng. Inform. 59 (2024) 102314. https://doi.org/ 10.1016/j.aei.2023.102314

  17. [18]

    Lee, C.M

    J.-K. Lee, C.M. Eastman, Y .C. Lee, Implementation of a BIM domain-specific language for the building environment rule and analysis, J. Intell. Robot. Syst. 79 (2015) 507–522. https://doi.org/10.1007/s10846-014-0117-7

  18. [19]

    Kim, J.-K

    H. Kim, J.-K. Lee, J. Shin, J. Choi, Visual language approach to representing KBimCode-based korea building code sentences for automated rule checking, J. Comput. Des. Eng. 6 (2019) 143–148.https://doi.org/10.1016/j.jcde.2018.08.002

  19. [20]

    Zhang, N.M

    J. Zhang, N.M. El-Gohary, Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking, J. Comput. Civ. Eng. 30 (2016) 04015014.https://doi.org/10.1061/(ASCE)CP.1943-5487.0000346

  20. [21]

    Song, J.-K

    J. Song, J.-K. Lee, J. Choi, I. Kim, Deep learning-based extraction of predicate-argument structure (PAS) in building design rule sentences, J. Comput. Des. Eng. 7 (2020) 563–576. https://doi.org/10.1093/jcde/qwaa046

  21. [22]

    D. Guo, E. Onstein, A.D.L. Rosa, A semantic approach for automated rule compliance check- ing in construction industry, IEEE Access 9 (2021) 129648–129660. https://doi.org/10. 1109/ACCESS.2021.3108226

  22. [23]

    Zheng, Y .-C

    Z. Zheng, Y .-C. Zhou, X.-Z. Lu, J.-R. Lin, Knowledge-informed semantic alignment and rule interpretation for automated compliance checking, Autom. Constr. 142 (2022) 104524. https://doi.org/10.1016/j.autcon.2022.104524

  23. [24]

    J. Peng, X. Liu, Automated code compliance checking research based on BIM and knowledge graph, Sci. Rep. 13 (2023) 7065.https://doi.org/10.1038/s41598-023-34342-1

  24. [25]

    P. Zhou, N. El-Gohary, Ontology-based automated information extraction from building en- ergy conservation codes, Autom. Constr. 74 (2017) 103–117. https://doi.org/10.1016/j. autcon.2016.09.004

  25. [26]

    Zhang, N.M

    J. Zhang, N.M. El-Gohary, Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking, Autom. Constr. 73 (2017) 45–57. https://doi.org/10. 1016/j.autcon.2016.08.027

  26. [27]

    J. Wu, X. Xue, J. Zhang, Invariant signature, logic reasoning, and semantic natural language processing (NLP)-based automated building code compliance checking (I-SNACC) framework, J. Inf. Technol. Constr. 28 (2023) 1–18.https://doi.org/10.36680/j.itcon.2023.001

  27. [28]

    Wang, R.R.A

    N. Wang, R.R.A. Issa, C.J. Anumba, NLP-based query-answering system for information extraction from building information models, J. Comput. Civ. Eng. 36 (2022) 04022004.https: //doi.org/10.1061/(ASCE)CP.1943-5487.0001019

  28. [29]

    Okonkwo, Leveraging word embeddings and transformers to extract semantics from building regulations text, 11th Linked Data Archit

    O. Okonkwo, Leveraging word embeddings and transformers to extract semantics from building regulations text, 11th Linked Data Archit. Constr. Workshop June 11–15 2023 Matera Italy (2023).https://ceur-ws.org/Vol-3633/paper14.pdf

  29. [30]

    Zheng, Y .-C

    Z. Zheng, Y .-C. Zhou, K.-Y . Chen, X.-Z. Lu, Z.-T. She, J.-R. Lin, A text classification-based approach for evaluating and enhancing the machine interpretability of building codes, Eng. Appl. Artif. Intell. 127 (2024) 107207.https://doi.org/10.1016/j.engappai.2023.107207 20

  30. [31]

    Iranmanesh, H

    S. Iranmanesh, H. Saadany, E. Vakaj, LLM-assisted graph-RAG information extraction from IFC data, in: Proceedings of the 2025 European Conference on Computing in Construction, Porto, Portugal, 2025: pp. 263–270.https://doi.org/10.35490/EC3.2025.366

  31. [32]

    X. Xue, J. Zhang, Y . Chen, Question-answering framework for building codes using fine- tuned and distilled pre-trained transformer models, Autom. Constr. 168 (2024) 105730. https: //doi.org/10.1016/j.autcon.2024.105730

  32. [33]

    Shields, and Lori Graham-Brady

    B. Zhong, W. He, Z. Huang, P.E.D. Love, J. Tang, H. Luo, A building regulation question answering system: a deep learning methodology, Adv. Eng. Inform. 46 (2020) 101195. https: //doi.org/10.1016/j.aei.2020.101195

  33. [34]

    H. Ying, R. Sacks, From automatic to autonomous: a large language model- driven approach for generic building compliance checking, Proc. 41st Int. Conf. CIB W78 Marrakech Moroc. 2-3 Oct. ISSN 2706-6568 ISSN 2706-6568 (2024).http://itc.scix.net/paper/w78-2024-59

  34. [35]

    Training language models to follow instructions with human feedback

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C.L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human feedback, (2022).https://doi.org/10.48550/arXiv.2203.02155

  35. [36]

    Y . Zhai, H. Zhang, Y . Lei, Y . Yu, K. Xu, D. Feng, B. Ding, H. Wang, Uncertainty-penalized reinforcement learning from human feedback with diverse reward LoRA ensembles, (2023). https://doi.org/10.48550/arXiv.2401.00243

  36. [37]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Gu...

  37. [38]

    Biomistral: A collection of open-source pretrained large language models for medical domains

    Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, R. Dufour, BioMistral: a collection of open-source pretrained large language models for medical domains, (2024). https://doi.org/10.48550/arXiv.2402.10373

  38. [39]

    Lefort, E

    B. Lefort, E. Benhamou, J.-J. Ohana, D. Saltiel, B. Guez, Optimizing performance: how compact models match or exceed GPT’s classification capabilities through fine-tuning, (2024). https://doi.org/10.48550/arXiv.2409.11408

  39. [40]

    Y . Lai, J. Zhong, M. Li, S. Zhao, X. Yang, Med-R1: reinforcement learning for generalizable medical reasoning in vision-language models, (2025). https://doi.org/10.48550/arXiv. 2503.13939

  40. [41]

    A.P. Gema, A. Hägele, R. Chen, A. Arditi, J. Goldman-Wetzler, K. Fraser-Taliente, H. Sleight, L. Petrini, J. Michael, B. Alex, P. Minervini, Y . Chen, J. Benton, E. Perez, Inverse scaling in test-time compute, (2025).https://doi.org/10.48550/arXiv.2507.14417 21

  41. [42]

    Zhang, N

    R. Zhang, N. El-Gohary, Clustering-based approach for building code computability anal- ysis, J. Comput. Civ. Eng. 35 (2021) 04021021. https://doi.org/10.1061/(ASCE)CP. 1943-5487.0000967

  42. [43]

    X. Xue, J. Zhang, Regulatory information transformation ruleset expansion to support automated building code compliance checking, Autom. Constr. 138 (2022) 104230. https://doi.org/ 10.1016/j.autcon.2022.104230

  43. [44]

    N.N. Minh, A. Baker, C. Neo, A.G. Roush, A. Kirsch, R. Shwartz-Ziv, Turning up the heat: min-p sampling for creative and coherent LLM outputs, in: 2024.https://openreview.net/ forum?id=FBkpCyujtS(accessed January 26, 2026)

  44. [45]

    M. Uhm, J. Kim, S. Ahn, H. Jeong, H. Kim, Effectiveness of retrieval augmented generation- based large language models for generating construction safety information, Autom. Constr. 170 (2025) 105926.https://doi.org/10.1016/j.autcon.2024.105926

  45. [46]

    J.W.L. Shi, M. Dang, W. Solihin, J.K.W. Yeoh, LLM attribution analysis across different fine- tuning strategies and model scales for automated code compliance, in: The 21st International Conference on Computing in Civil and Building Engineering (ICCCBE 2026), Taipei, Taiwan, 2026.https://doi.org/10.48550/arXiv.2604.15589 22