Reinforcement learning to improve large language model-based automated code compliance systems

Jack Wei Lun Shi; Justin K.W. Yeoh; Leong Hien Poh; Minghao Dang; Wawan Solihin

arxiv: 2606.22402 · v1 · pith:JEW367K5new · submitted 2026-06-21 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

Reinforcement learning to improve large language model-based automated code compliance systems

Jack Wei Lun Shi , Minghao Dang , Wawan Solihin , Leong Hien Poh , Justin K.W. Yeoh This is my paper

Pith reviewed 2026-06-26 10:08 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG

keywords automated code compliancelarge language modelsreinforcement learningcode skeletonsbuilding regulationssupervised fine-tuninggroup relative policy optimizationGRPO

0 comments

The pith

P4IR combines supervised fine-tuning with group relative policy optimization to produce more accurate code skeletons for automated building code compliance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents P4IR as a two-stage process that first uses supervised fine-tuning to embed domain knowledge about building regulations into an LLM, then applies Group Relative Policy Optimization to refine the high-level code skeletons it generates. This yields measurable gains: tree edit distance drops by up to 23.8 percent and token-level Levenshtein distance by up to 38.6 percent versus SFT-only baselines. In zero-shot use the resulting model also surpasses several frontier LLMs that rely on few-shot prompting, and the GRPO stage produces a small but statistically significant drop in false positives. The authors position the method as a route to more reliable LLM-based automated code compliance systems.

Core claim

P4IR demonstrates that applying Group Relative Policy Optimization after supervised fine-tuning lets an LLM generate higher-fidelity intermediate code skeletons for building regulation compliance, cutting structural and token-level edit distances relative to SFT alone while also lowering false positives and outperforming leading LLMs in zero-shot settings.

What carries the argument

The P4IR two-stage framework: supervised fine-tuning to instill domain knowledge, followed by Group Relative Policy Optimization (GRPO) to optimize the generated code skeletons directly against domain-specific objectives.

If this is right

Lower tree edit and Levenshtein distances produce code skeletons that better preserve both structure and semantics of the target regulations.
Zero-shot performance exceeding few-shot frontier models reduces the need for prompt engineering when deploying the system.
The GRPO stage's reduction in false positives improves the precision of the downstream compliance checker.
Direct optimization for domain objectives via GRPO offers a general recipe for improving LLM outputs on other structured generation tasks in regulatory domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SFT-plus-GRPO pipeline could be tested on generating full executable compliance rules rather than skeletons alone.
Integration of the improved skeletons into an actual design-checking engine would provide a direct test of whether the distance metrics correlate with practical compliance accuracy.
The method may transfer to other code-generation settings where an intermediate structured representation must match a regulatory or specification template.

Load-bearing premise

Reductions in edit distance between generated code skeletons and reference skeletons will produce higher accuracy when those skeletons are used to check actual building designs against real regulations.

What would settle it

An end-to-end evaluation that runs the generated skeletons on a held-out set of real building designs and regulations and measures the rate of correct compliance decisions against a human-annotated ground truth.

Figures

Figures reproduced from arXiv: 2606.22402 by Jack Wei Lun Shi, Justin K.W. Yeoh, Leong Hien Poh, Minghao Dang, Wawan Solihin.

**Figure 2.** Figure 2: Overall pipeline of the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Reward system for the proposed framework. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of (a) tree edit distance and (b) token-level Levenshtein distance measures [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Prompts relating to (a) zero-shot setting in the proposed framework and (b) few-shot setting [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Average tree edit distance and (b) average token-level Levenshtein distance. Lower [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Smoothed average logit entropy during generation. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: (a) CodeBERTScore Precision and (b) CodeBERTScore Recall. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Example pertaining to travel distance, showing (a) the reference, (b) the SFT-generated, [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Example pertaining to one-way escape arrangements, showing (a) the reference, (b) the [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Example pertaining to underground pedestrian linkages, showing (a) the reference, (b) the [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Example pertaining to shadow area calculation, showing (a) the reference, (b) the SFT [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Large language model (LLM)-based approaches for automated code compliance (ACC) of building regulations are prone to generating incorrect and hallucinated computer-processable rules. This paper introduces P4IR, a two-stage framework that uses supervised fine-tuning (SFT) to instill domain knowledge in an LLM, followed by Group Relative Policy Optimization (GRPO) to improve the accuracy of the generated intermediate representations in the form of high-level code skeletons. The framework achieved reductions of up to 23.8% and 38.6% in tree edit distance and token-level Levenshtein distance respectively, relative to the SFT baselines. Comparative analysis demonstrates that this approach in a zero-shot setting outperforms leading LLMs in both code structure and semantics, specifically Claude Opus and Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7, evaluated via few-shot prompting. Additionally, the GRPO stage produced a small yet statistically significant reduction in false positives. By combining SFT with GRPO to optimize directly for domain-specific objectives, this approach offers a path toward more accurate and reliable LLM-based ACC systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

P4IR adds GRPO after SFT to cut edit distances on building-code skeletons by 24-39 percent, but never checks whether those skeletons actually improve compliance checks on real designs.

read the letter

The paper's concrete contribution is the P4IR pipeline: supervised fine-tuning followed by Group Relative Policy Optimization to refine high-level code skeletons for building regulations. It reports up to 23.8 percent lower tree edit distance and 38.6 percent lower token Levenshtein distance versus the SFT-only baseline, plus a small statistically significant drop in false positives. In zero-shot use the resulting model also beats several frontier LLMs that were given few-shot prompts.

The GRPO step is the part that is new in this domain. Most prior LLM work on code compliance stops at supervised fine-tuning or prompt engineering; optimizing the policy directly against a domain-specific distance on the intermediate representation is a reasonable next step and the numbers show it moves the needle on the chosen proxies.

The main weakness is that the claimed benefit for automated code compliance systems rests on an untested assumption. Lower edit distance to reference skeletons is treated as evidence of better compliance checking, yet the paper supplies no end-to-end evaluation on actual building designs against real regulations. The false-positive reduction is measured only on the skeleton task. Without that link, the headline percentages do not yet demonstrate improved ACC accuracy.

Dataset size, exclusion criteria, and exact baseline implementations are also not described in enough detail to let a reader reproduce or stress-test the distance calculations. The zero-shot versus few-shot comparison is another minor fairness issue.

This is useful reading for anyone already working on LLM pipelines for regulated engineering domains who wants to try policy optimization on intermediate artifacts. It is not yet strong enough to change practice. I would send it to peer review so the proxy-metric gap and reproducibility details can be addressed, but I would not cite it until the end-to-end results appear.

Referee Report

2 major / 2 minor

Summary. The paper introduces P4IR, a two-stage framework that applies supervised fine-tuning (SFT) to instill domain knowledge in an LLM, followed by Group Relative Policy Optimization (GRPO) to refine high-level code skeletons for LLM-based automated code compliance (ACC) checking of building regulations. It reports up to 23.8% and 38.6% reductions in tree edit distance and token-level Levenshtein distance versus SFT baselines, zero-shot outperformance over Claude Opus/Sonnet 4.5, GPT-5.2, Qwen-3-Max, and GLM-4.7 (evaluated few-shot), and a small but statistically significant false-positive reduction after GRPO.

Significance. If the proxy improvements in skeleton fidelity demonstrably increase end-to-end compliance accuracy on real building designs, the work would provide a concrete route to more reliable LLM-based ACC by directly optimizing for domain-specific structural and semantic objectives via GRPO. The absence of that linkage currently confines the contribution to intermediate metrics.

major comments (2)

[Abstract and Evaluation] Abstract and Evaluation section: The central claim is that P4IR improves LLM-based ACC systems. This requires that lower tree edit distance and Levenshtein distance on generated skeletons produce higher accuracy when the skeletons are used to check actual building designs against regulations. The manuscript reports only skeleton-reference comparisons plus an unspecified false-positive reduction; no end-to-end evaluation on real designs is supplied, leaving the headline performance numbers without demonstrated connection to the stated goal.
[Methods / Experimental setup] Methods / Experimental setup: The abstract states concrete percentage reductions and statistical significance, yet supplies no information on dataset size, exclusion rules, baseline implementation details, or how the distance metrics were computed. These omissions prevent verification that the reported numbers support the comparative claims against SFT and other LLMs.

minor comments (2)

[Abstract] Abstract: Model names such as 'GPT-5.2' and 'Sonnet 4.5' are non-standard; clarify the exact versions or checkpoints used.
[Evaluation] Notation: 'token-level Levenshtein distance' and 'tree edit distance' should be defined with explicit formulas or references on first use to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: The central claim is that P4IR improves LLM-based ACC systems. This requires that lower tree edit distance and Levenshtein distance on generated skeletons produce higher accuracy when the skeletons are used to check actual building designs against regulations. The manuscript reports only skeleton-reference comparisons plus an unspecified false-positive reduction; no end-to-end evaluation on real designs is supplied, leaving the headline performance numbers without demonstrated connection to the stated goal.

Authors: The manuscript's contribution centers on improving the quality of high-level code skeletons as an intermediate representation, which the abstract and introduction identify as a key source of hallucinations in LLM-based ACC. The GRPO stage is explicitly optimized for tree edit distance and token-level Levenshtein distance to these skeletons, and the reported false-positive reduction supplies limited but statistically significant evidence of a downstream effect. We acknowledge that a full end-to-end evaluation on complete building designs would provide a stronger link to final compliance accuracy; however, the current scope demonstrates that direct optimization of domain-specific structural and semantic objectives via GRPO yields measurable gains over SFT and frontier LLMs on the skeleton task itself. The headline claims are therefore tied to skeleton fidelity rather than end-to-end accuracy, which is stated as an intended future direction. revision: no
Referee: [Methods / Experimental setup] Methods / Experimental setup: The abstract states concrete percentage reductions and statistical significance, yet supplies no information on dataset size, exclusion rules, baseline implementation details, or how the distance metrics were computed. These omissions prevent verification that the reported numbers support the comparative claims against SFT and other LLMs.

Authors: We agree that the current manuscript lacks sufficient methodological detail for full reproducibility. In the revised version we will expand the Experimental Setup and Evaluation sections to report dataset size and any exclusion criteria, precise baseline prompting and fine-tuning configurations, and the exact implementations used to compute tree edit distance and token-level Levenshtein distance (including any normalization or tokenization steps). revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical comparisons with no derivations or self-referential predictions

full rationale

The paper reports measured reductions in tree edit distance and Levenshtein distance from SFT+GRPO versus baselines, plus comparisons to other LLMs, all obtained through direct experimental evaluation on held-out data. No equations, parameter fitting presented as prediction, uniqueness theorems, or self-citation chains appear in the provided text or abstract. The central claims rest on observable performance deltas against external references rather than any reduction to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.1-grok · 5755 in / 1185 out tokens · 26658 ms · 2026-06-26T10:08:38.601848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 35 canonical work pages · 4 internal anchors

[1]

https://www1.bca.gov.sg/ about-us/news-and-publications/media-releases/2025/01/23/ construction-demand-to-remain-strong-for-2025(accessed August 4, 2025)

Building and Construction Authority (BCA), Construction demand to remain strong for 2025, BCA (2025). https://www1.bca.gov.sg/ about-us/news-and-publications/media-releases/2025/01/23/ construction-demand-to-remain-strong-for-2025(accessed August 4, 2025)

2025
[2]

Zheng, J

Z. Zheng, J. Han, K.-Y . Chen, X.-Y . Cao, X.-Z. Lu, J.-R. Lin, Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing, Eng. Appl. Artif. Intell. 163 (2026) 112823. https://doi.org/10. 1016/j.engappai.2025.112823

arXiv 2026
[3]

Zhang, N

R. Zhang, N. El-Gohary, Hierarchical representation and deep learning–based method for auto- matically transforming textual building codes into semantic computable requirements, J. Com- put. Civ. Eng. 36 (2022) 04022022. https://doi.org/10.1061/(ASCE)CP.1943-5487. 0001014

work page doi:10.1061/(asce)cp.1943-5487 2022
[4]

Fuchs, J

S. Fuchs, J. Dimyadi, M. Witbrock, R. Amor, Intermediate representations to improve the semantic parsing of building regulations, Adv. Eng. Inform. 62 (2024) 102735. https://doi. org/10.1016/j.aei.2024.102735

work page doi:10.1016/j.aei.2024.102735 2024
[5]

Zhang, How can ChatGPT help in automated building code compliance checking?, Int

J. Zhang, How can ChatGPT help in automated building code compliance checking?, Int. Symp. Autom. Robot. Constr. ISARC Proc. 2023 Proceedings of the 40th ISARC, Chennai, India (2023) 63–70.https://doi.org/10.22260/ISARC2023/0011

work page doi:10.22260/isarc2023/0011 2023
[6]

Fuchs, M

S. Fuchs, M. Witbrock, J. Dimyadi, R. Amor, Using large language models for the interpretation of building regulations, (2024).https://doi.org/10.48550/arXiv.2407.21060

work page doi:10.48550/arxiv.2407.21060 2024
[7]

S. Lin, J. Hilton, O. Evans, TruthfulQA: measuring how models mimic human falsehoods, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proc. 60th Annu. Meet. Assoc. Comput. Linguist. V ol. 1 Long Pap., Association for Computational Linguistics, Dublin, Ireland, 2022: pp. 3214–3252.https://doi.org/10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022
[8]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions, ACM Trans. Inf. Syst. 43 (2025) 1–55. https://doi.org/10.1145/ 3703155

2025
[9]

Fuchs, M

S. Fuchs, M. Witbrock, J. Dimyadi, R. Amor, Neural semantic parsing of building regulations for compliance checking, IOP Conf. Ser. Earth Environ. Sci. 1101 (2022) 092022. https: //doi.org/10.1088/1755-1315/1101/9/092022

work page doi:10.1088/1755-1315/1101/9/092022 2022
[10]

K. Tian, E. Mitchell, H. Yao, C. Manning, C. Finn, Fine-tuning language models for factuality, in: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. https: //openreview.net/forum?id=kEK08VdSO5(accessed November 10, 2025)

2023
[11]

P. Roit, J. Ferret, L. Shani, R. Aharoni, G. Cideron, R. Dadashi, M. Geist, S. Girgin, L. Hussenot, O. Keller, N. Momchev, S. Ramos Garea, P. Stanczyk, N. Vieillard, O. Bachem, G. Elidan, A. Hassidim, O. Pietquin, I. Szpektor, Factually consistent summarization via reinforcement learning with textual entailment feedback, in: A. Rogers, J. Boyd-Graber, N. ...

work page doi:10.18653/v1/2023.acl-long.344 2023
[12]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y .K. Li, Y . Wu, D. Guo, DeepSeekMath: pushing the limits of mathematical reasoning in open language models, (2024). https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[13]

J.W.L. Shi, W. Solihin, J.K.W. Yeoh, Fine-tuning a large language model for automated code compliance of building regulations, Adv. Eng. Inform. 68 (2025) 103676. https://doi.org/ 10.1016/j.aei.2025.103676 19

work page doi:10.1016/j.aei.2025.103676 2025
[15]

Eastman, J

C. Eastman, J. Lee, Y . Jeong, J. Lee, Automatic rule-based checking of building designs, Autom. Constr. 18 (2009) 1011–1033.https://doi.org/10.1016/j.autcon.2009.07.002

work page doi:10.1016/j.autcon.2009.07.002 2009
[16]

Hjelseth, N

E. Hjelseth, N. Nisbet, Capturing normative constraints by use of semantic mark-up RASE methodology, Proc. CIB W78-W102 Conf. Pp 1-10 (2011). https://itc.scix.net/pdfs/ w78-2011-Paper-45.pdf

2011
[17]

Fitkau, T

I. Fitkau, T. Hartmann, An ontology-based approach of automatic compliance checking for structural fire safety requirements, Adv. Eng. Inform. 59 (2024) 102314. https://doi.org/ 10.1016/j.aei.2023.102314

work page doi:10.1016/j.aei.2023.102314 2024
[18]

Lee, C.M

J.-K. Lee, C.M. Eastman, Y .C. Lee, Implementation of a BIM domain-specific language for the building environment rule and analysis, J. Intell. Robot. Syst. 79 (2015) 507–522. https://doi.org/10.1007/s10846-014-0117-7

work page doi:10.1007/s10846-014-0117-7 2015
[19]

Kim, J.-K

H. Kim, J.-K. Lee, J. Shin, J. Choi, Visual language approach to representing KBimCode-based korea building code sentences for automated rule checking, J. Comput. Des. Eng. 6 (2019) 143–148.https://doi.org/10.1016/j.jcde.2018.08.002

work page doi:10.1016/j.jcde.2018.08.002 2019
[20]

Zhang, N.M

J. Zhang, N.M. El-Gohary, Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking, J. Comput. Civ. Eng. 30 (2016) 04015014.https://doi.org/10.1061/(ASCE)CP.1943-5487.0000346

work page doi:10.1061/(asce)cp.1943-5487.0000346 2016
[21]

Song, J.-K

J. Song, J.-K. Lee, J. Choi, I. Kim, Deep learning-based extraction of predicate-argument structure (PAS) in building design rule sentences, J. Comput. Des. Eng. 7 (2020) 563–576. https://doi.org/10.1093/jcde/qwaa046

work page doi:10.1093/jcde/qwaa046 2020
[22]

D. Guo, E. Onstein, A.D.L. Rosa, A semantic approach for automated rule compliance check- ing in construction industry, IEEE Access 9 (2021) 129648–129660. https://doi.org/10. 1109/ACCESS.2021.3108226

arXiv 2021
[23]

Zheng, Y .-C

Z. Zheng, Y .-C. Zhou, X.-Z. Lu, J.-R. Lin, Knowledge-informed semantic alignment and rule interpretation for automated compliance checking, Autom. Constr. 142 (2022) 104524. https://doi.org/10.1016/j.autcon.2022.104524

work page doi:10.1016/j.autcon.2022.104524 2022
[24]

J. Peng, X. Liu, Automated code compliance checking research based on BIM and knowledge graph, Sci. Rep. 13 (2023) 7065.https://doi.org/10.1038/s41598-023-34342-1

work page doi:10.1038/s41598-023-34342-1 2023
[25]

P. Zhou, N. El-Gohary, Ontology-based automated information extraction from building en- ergy conservation codes, Autom. Constr. 74 (2017) 103–117. https://doi.org/10.1016/j. autcon.2016.09.004

work page doi:10.1016/j 2017
[26]

Zhang, N.M

J. Zhang, N.M. El-Gohary, Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking, Autom. Constr. 73 (2017) 45–57. https://doi.org/10. 1016/j.autcon.2016.08.027

2017
[27]

J. Wu, X. Xue, J. Zhang, Invariant signature, logic reasoning, and semantic natural language processing (NLP)-based automated building code compliance checking (I-SNACC) framework, J. Inf. Technol. Constr. 28 (2023) 1–18.https://doi.org/10.36680/j.itcon.2023.001

work page doi:10.36680/j.itcon.2023.001 2023
[28]

Wang, R.R.A

N. Wang, R.R.A. Issa, C.J. Anumba, NLP-based query-answering system for information extraction from building information models, J. Comput. Civ. Eng. 36 (2022) 04022004.https: //doi.org/10.1061/(ASCE)CP.1943-5487.0001019

work page doi:10.1061/(asce)cp.1943-5487.0001019 2022
[29]

Okonkwo, Leveraging word embeddings and transformers to extract semantics from building regulations text, 11th Linked Data Archit

O. Okonkwo, Leveraging word embeddings and transformers to extract semantics from building regulations text, 11th Linked Data Archit. Constr. Workshop June 11–15 2023 Matera Italy (2023).https://ceur-ws.org/Vol-3633/paper14.pdf

2023
[30]

Zheng, Y .-C

Z. Zheng, Y .-C. Zhou, K.-Y . Chen, X.-Z. Lu, Z.-T. She, J.-R. Lin, A text classification-based approach for evaluating and enhancing the machine interpretability of building codes, Eng. Appl. Artif. Intell. 127 (2024) 107207.https://doi.org/10.1016/j.engappai.2023.107207 20

work page doi:10.1016/j.engappai.2023.107207 2024
[31]

Iranmanesh, H

S. Iranmanesh, H. Saadany, E. Vakaj, LLM-assisted graph-RAG information extraction from IFC data, in: Proceedings of the 2025 European Conference on Computing in Construction, Porto, Portugal, 2025: pp. 263–270.https://doi.org/10.35490/EC3.2025.366

work page doi:10.35490/ec3.2025.366 2025
[32]

X. Xue, J. Zhang, Y . Chen, Question-answering framework for building codes using fine- tuned and distilled pre-trained transformer models, Autom. Constr. 168 (2024) 105730. https: //doi.org/10.1016/j.autcon.2024.105730

work page doi:10.1016/j.autcon.2024.105730 2024
[33]

Shields, and Lori Graham-Brady

B. Zhong, W. He, Z. Huang, P.E.D. Love, J. Tang, H. Luo, A building regulation question answering system: a deep learning methodology, Adv. Eng. Inform. 46 (2020) 101195. https: //doi.org/10.1016/j.aei.2020.101195

work page doi:10.1016/j.aei.2020.101195 2020
[34]

H. Ying, R. Sacks, From automatic to autonomous: a large language model- driven approach for generic building compliance checking, Proc. 41st Int. Conf. CIB W78 Marrakech Moroc. 2-3 Oct. ISSN 2706-6568 ISSN 2706-6568 (2024).http://itc.scix.net/paper/w78-2024-59

2024
[35]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C.L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human feedback, (2022).https://doi.org/10.48550/arXiv.2203.02155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[36]

Y . Zhai, H. Zhang, Y . Lei, Y . Yu, K. Xu, D. Feng, B. Ding, H. Wang, Uncertainty-penalized reinforcement learning from human feedback with diverse reward LoRA ensembles, (2023). https://doi.org/10.48550/arXiv.2401.00243

work page doi:10.48550/arxiv.2401.00243 2023
[37]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Gu...

work page doi:10.1038/s41586-025-09422-z 2025
[38]

Biomistral: A collection of open-source pretrained large language models for medical domains

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, R. Dufour, BioMistral: a collection of open-source pretrained large language models for medical domains, (2024). https://doi.org/10.48550/arXiv.2402.10373

work page doi:10.48550/arxiv.2402.10373 2024
[39]

Lefort, E

B. Lefort, E. Benhamou, J.-J. Ohana, D. Saltiel, B. Guez, Optimizing performance: how compact models match or exceed GPT’s classification capabilities through fine-tuning, (2024). https://doi.org/10.48550/arXiv.2409.11408

work page doi:10.48550/arxiv.2409.11408 2024
[40]

Y . Lai, J. Zhong, M. Li, S. Zhao, X. Yang, Med-R1: reinforcement learning for generalizable medical reasoning in vision-language models, (2025). https://doi.org/10.48550/arXiv. 2503.13939

work page internal anchor Pith review doi:10.48550/arxiv 2025
[41]

A.P. Gema, A. Hägele, R. Chen, A. Arditi, J. Goldman-Wetzler, K. Fraser-Taliente, H. Sleight, L. Petrini, J. Michael, B. Alex, P. Minervini, Y . Chen, J. Benton, E. Perez, Inverse scaling in test-time compute, (2025).https://doi.org/10.48550/arXiv.2507.14417 21

work page doi:10.48550/arxiv.2507.14417 2025
[42]

Zhang, N

R. Zhang, N. El-Gohary, Clustering-based approach for building code computability anal- ysis, J. Comput. Civ. Eng. 35 (2021) 04021021. https://doi.org/10.1061/(ASCE)CP. 1943-5487.0000967

work page doi:10.1061/(asce)cp 2021
[43]

X. Xue, J. Zhang, Regulatory information transformation ruleset expansion to support automated building code compliance checking, Autom. Constr. 138 (2022) 104230. https://doi.org/ 10.1016/j.autcon.2022.104230

work page doi:10.1016/j.autcon.2022.104230 2022
[44]

N.N. Minh, A. Baker, C. Neo, A.G. Roush, A. Kirsch, R. Shwartz-Ziv, Turning up the heat: min-p sampling for creative and coherent LLM outputs, in: 2024.https://openreview.net/ forum?id=FBkpCyujtS(accessed January 26, 2026)

2024
[45]

M. Uhm, J. Kim, S. Ahn, H. Jeong, H. Kim, Effectiveness of retrieval augmented generation- based large language models for generating construction safety information, Autom. Constr. 170 (2025) 105926.https://doi.org/10.1016/j.autcon.2024.105926

work page doi:10.1016/j.autcon.2024.105926 2025
[46]

J.W.L. Shi, M. Dang, W. Solihin, J.K.W. Yeoh, LLM attribution analysis across different fine- tuning strategies and model scales for automated code compliance, in: The 21st International Conference on Computing in Civil and Building Engineering (ICCCBE 2026), Taipei, Taiwan, 2026.https://doi.org/10.48550/arXiv.2604.15589 22

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15589 2026

[1] [1]

https://www1.bca.gov.sg/ about-us/news-and-publications/media-releases/2025/01/23/ construction-demand-to-remain-strong-for-2025(accessed August 4, 2025)

Building and Construction Authority (BCA), Construction demand to remain strong for 2025, BCA (2025). https://www1.bca.gov.sg/ about-us/news-and-publications/media-releases/2025/01/23/ construction-demand-to-remain-strong-for-2025(accessed August 4, 2025)

2025

[2] [2]

Zheng, J

Z. Zheng, J. Han, K.-Y . Chen, X.-Y . Cao, X.-Z. Lu, J.-R. Lin, Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing, Eng. Appl. Artif. Intell. 163 (2026) 112823. https://doi.org/10. 1016/j.engappai.2025.112823

arXiv 2026

[3] [3]

Zhang, N

R. Zhang, N. El-Gohary, Hierarchical representation and deep learning–based method for auto- matically transforming textual building codes into semantic computable requirements, J. Com- put. Civ. Eng. 36 (2022) 04022022. https://doi.org/10.1061/(ASCE)CP.1943-5487. 0001014

work page doi:10.1061/(asce)cp.1943-5487 2022

[4] [4]

Fuchs, J

S. Fuchs, J. Dimyadi, M. Witbrock, R. Amor, Intermediate representations to improve the semantic parsing of building regulations, Adv. Eng. Inform. 62 (2024) 102735. https://doi. org/10.1016/j.aei.2024.102735

work page doi:10.1016/j.aei.2024.102735 2024

[5] [5]

Zhang, How can ChatGPT help in automated building code compliance checking?, Int

J. Zhang, How can ChatGPT help in automated building code compliance checking?, Int. Symp. Autom. Robot. Constr. ISARC Proc. 2023 Proceedings of the 40th ISARC, Chennai, India (2023) 63–70.https://doi.org/10.22260/ISARC2023/0011

work page doi:10.22260/isarc2023/0011 2023

[6] [6]

Fuchs, M

S. Fuchs, M. Witbrock, J. Dimyadi, R. Amor, Using large language models for the interpretation of building regulations, (2024).https://doi.org/10.48550/arXiv.2407.21060

work page doi:10.48550/arxiv.2407.21060 2024

[7] [7]

S. Lin, J. Hilton, O. Evans, TruthfulQA: measuring how models mimic human falsehoods, in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proc. 60th Annu. Meet. Assoc. Comput. Linguist. V ol. 1 Long Pap., Association for Computational Linguistics, Dublin, Ireland, 2022: pp. 3214–3252.https://doi.org/10.18653/v1/2022.acl-long.229

work page doi:10.18653/v1/2022.acl-long.229 2022

[8] [8]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, T. Liu, A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions, ACM Trans. Inf. Syst. 43 (2025) 1–55. https://doi.org/10.1145/ 3703155

2025

[9] [9]

Fuchs, M

S. Fuchs, M. Witbrock, J. Dimyadi, R. Amor, Neural semantic parsing of building regulations for compliance checking, IOP Conf. Ser. Earth Environ. Sci. 1101 (2022) 092022. https: //doi.org/10.1088/1755-1315/1101/9/092022

work page doi:10.1088/1755-1315/1101/9/092022 2022

[10] [10]

K. Tian, E. Mitchell, H. Yao, C. Manning, C. Finn, Fine-tuning language models for factuality, in: NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023. https: //openreview.net/forum?id=kEK08VdSO5(accessed November 10, 2025)

2023

[11] [11]

P. Roit, J. Ferret, L. Shani, R. Aharoni, G. Cideron, R. Dadashi, M. Geist, S. Girgin, L. Hussenot, O. Keller, N. Momchev, S. Ramos Garea, P. Stanczyk, N. Vieillard, O. Bachem, G. Elidan, A. Hassidim, O. Pietquin, I. Szpektor, Factually consistent summarization via reinforcement learning with textual entailment feedback, in: A. Rogers, J. Boyd-Graber, N. ...

work page doi:10.18653/v1/2023.acl-long.344 2023

[12] [12]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y .K. Li, Y . Wu, D. Guo, DeepSeekMath: pushing the limits of mathematical reasoning in open language models, (2024). https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024

[13] [13]

J.W.L. Shi, W. Solihin, J.K.W. Yeoh, Fine-tuning a large language model for automated code compliance of building regulations, Adv. Eng. Inform. 68 (2025) 103676. https://doi.org/ 10.1016/j.aei.2025.103676 19

work page doi:10.1016/j.aei.2025.103676 2025

[14] [15]

Eastman, J

C. Eastman, J. Lee, Y . Jeong, J. Lee, Automatic rule-based checking of building designs, Autom. Constr. 18 (2009) 1011–1033.https://doi.org/10.1016/j.autcon.2009.07.002

work page doi:10.1016/j.autcon.2009.07.002 2009

[15] [16]

Hjelseth, N

E. Hjelseth, N. Nisbet, Capturing normative constraints by use of semantic mark-up RASE methodology, Proc. CIB W78-W102 Conf. Pp 1-10 (2011). https://itc.scix.net/pdfs/ w78-2011-Paper-45.pdf

2011

[16] [17]

Fitkau, T

I. Fitkau, T. Hartmann, An ontology-based approach of automatic compliance checking for structural fire safety requirements, Adv. Eng. Inform. 59 (2024) 102314. https://doi.org/ 10.1016/j.aei.2023.102314

work page doi:10.1016/j.aei.2023.102314 2024

[17] [18]

Lee, C.M

J.-K. Lee, C.M. Eastman, Y .C. Lee, Implementation of a BIM domain-specific language for the building environment rule and analysis, J. Intell. Robot. Syst. 79 (2015) 507–522. https://doi.org/10.1007/s10846-014-0117-7

work page doi:10.1007/s10846-014-0117-7 2015

[18] [19]

Kim, J.-K

H. Kim, J.-K. Lee, J. Shin, J. Choi, Visual language approach to representing KBimCode-based korea building code sentences for automated rule checking, J. Comput. Des. Eng. 6 (2019) 143–148.https://doi.org/10.1016/j.jcde.2018.08.002

work page doi:10.1016/j.jcde.2018.08.002 2019

[19] [20]

Zhang, N.M

J. Zhang, N.M. El-Gohary, Semantic NLP-based information extraction from construction regulatory documents for automated compliance checking, J. Comput. Civ. Eng. 30 (2016) 04015014.https://doi.org/10.1061/(ASCE)CP.1943-5487.0000346

work page doi:10.1061/(asce)cp.1943-5487.0000346 2016

[20] [21]

Song, J.-K

J. Song, J.-K. Lee, J. Choi, I. Kim, Deep learning-based extraction of predicate-argument structure (PAS) in building design rule sentences, J. Comput. Des. Eng. 7 (2020) 563–576. https://doi.org/10.1093/jcde/qwaa046

work page doi:10.1093/jcde/qwaa046 2020

[21] [22]

D. Guo, E. Onstein, A.D.L. Rosa, A semantic approach for automated rule compliance check- ing in construction industry, IEEE Access 9 (2021) 129648–129660. https://doi.org/10. 1109/ACCESS.2021.3108226

arXiv 2021

[22] [23]

Zheng, Y .-C

Z. Zheng, Y .-C. Zhou, X.-Z. Lu, J.-R. Lin, Knowledge-informed semantic alignment and rule interpretation for automated compliance checking, Autom. Constr. 142 (2022) 104524. https://doi.org/10.1016/j.autcon.2022.104524

work page doi:10.1016/j.autcon.2022.104524 2022

[23] [24]

J. Peng, X. Liu, Automated code compliance checking research based on BIM and knowledge graph, Sci. Rep. 13 (2023) 7065.https://doi.org/10.1038/s41598-023-34342-1

work page doi:10.1038/s41598-023-34342-1 2023

[24] [25]

P. Zhou, N. El-Gohary, Ontology-based automated information extraction from building en- ergy conservation codes, Autom. Constr. 74 (2017) 103–117. https://doi.org/10.1016/j. autcon.2016.09.004

work page doi:10.1016/j 2017

[25] [26]

Zhang, N.M

J. Zhang, N.M. El-Gohary, Integrating semantic NLP and logic reasoning into a unified system for fully-automated code checking, Autom. Constr. 73 (2017) 45–57. https://doi.org/10. 1016/j.autcon.2016.08.027

2017

[26] [27]

J. Wu, X. Xue, J. Zhang, Invariant signature, logic reasoning, and semantic natural language processing (NLP)-based automated building code compliance checking (I-SNACC) framework, J. Inf. Technol. Constr. 28 (2023) 1–18.https://doi.org/10.36680/j.itcon.2023.001

work page doi:10.36680/j.itcon.2023.001 2023

[27] [28]

Wang, R.R.A

N. Wang, R.R.A. Issa, C.J. Anumba, NLP-based query-answering system for information extraction from building information models, J. Comput. Civ. Eng. 36 (2022) 04022004.https: //doi.org/10.1061/(ASCE)CP.1943-5487.0001019

work page doi:10.1061/(asce)cp.1943-5487.0001019 2022

[28] [29]

Okonkwo, Leveraging word embeddings and transformers to extract semantics from building regulations text, 11th Linked Data Archit

O. Okonkwo, Leveraging word embeddings and transformers to extract semantics from building regulations text, 11th Linked Data Archit. Constr. Workshop June 11–15 2023 Matera Italy (2023).https://ceur-ws.org/Vol-3633/paper14.pdf

2023

[29] [30]

Zheng, Y .-C

Z. Zheng, Y .-C. Zhou, K.-Y . Chen, X.-Z. Lu, Z.-T. She, J.-R. Lin, A text classification-based approach for evaluating and enhancing the machine interpretability of building codes, Eng. Appl. Artif. Intell. 127 (2024) 107207.https://doi.org/10.1016/j.engappai.2023.107207 20

work page doi:10.1016/j.engappai.2023.107207 2024

[30] [31]

Iranmanesh, H

S. Iranmanesh, H. Saadany, E. Vakaj, LLM-assisted graph-RAG information extraction from IFC data, in: Proceedings of the 2025 European Conference on Computing in Construction, Porto, Portugal, 2025: pp. 263–270.https://doi.org/10.35490/EC3.2025.366

work page doi:10.35490/ec3.2025.366 2025

[31] [32]

X. Xue, J. Zhang, Y . Chen, Question-answering framework for building codes using fine- tuned and distilled pre-trained transformer models, Autom. Constr. 168 (2024) 105730. https: //doi.org/10.1016/j.autcon.2024.105730

work page doi:10.1016/j.autcon.2024.105730 2024

[32] [33]

Shields, and Lori Graham-Brady

B. Zhong, W. He, Z. Huang, P.E.D. Love, J. Tang, H. Luo, A building regulation question answering system: a deep learning methodology, Adv. Eng. Inform. 46 (2020) 101195. https: //doi.org/10.1016/j.aei.2020.101195

work page doi:10.1016/j.aei.2020.101195 2020

[33] [34]

H. Ying, R. Sacks, From automatic to autonomous: a large language model- driven approach for generic building compliance checking, Proc. 41st Int. Conf. CIB W78 Marrakech Moroc. 2-3 Oct. ISSN 2706-6568 ISSN 2706-6568 (2024).http://itc.scix.net/paper/w78-2024-59

2024

[34] [35]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C.L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human feedback, (2022).https://doi.org/10.48550/arXiv.2203.02155

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[35] [36]

Y . Zhai, H. Zhang, Y . Lei, Y . Yu, K. Xu, D. Feng, B. Ding, H. Wang, Uncertainty-penalized reinforcement learning from human feedback with diverse reward LoRA ensembles, (2023). https://doi.org/10.48550/arXiv.2401.00243

work page doi:10.48550/arxiv.2401.00243 2023

[36] [37]

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y . Wu, Z.F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Gu...

work page doi:10.1038/s41586-025-09422-z 2025

[37] [38]

Biomistral: A collection of open-source pretrained large language models for medical domains

Y . Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, R. Dufour, BioMistral: a collection of open-source pretrained large language models for medical domains, (2024). https://doi.org/10.48550/arXiv.2402.10373

work page doi:10.48550/arxiv.2402.10373 2024

[38] [39]

Lefort, E

B. Lefort, E. Benhamou, J.-J. Ohana, D. Saltiel, B. Guez, Optimizing performance: how compact models match or exceed GPT’s classification capabilities through fine-tuning, (2024). https://doi.org/10.48550/arXiv.2409.11408

work page doi:10.48550/arxiv.2409.11408 2024

[39] [40]

Y . Lai, J. Zhong, M. Li, S. Zhao, X. Yang, Med-R1: reinforcement learning for generalizable medical reasoning in vision-language models, (2025). https://doi.org/10.48550/arXiv. 2503.13939

work page internal anchor Pith review doi:10.48550/arxiv 2025

[40] [41]

A.P. Gema, A. Hägele, R. Chen, A. Arditi, J. Goldman-Wetzler, K. Fraser-Taliente, H. Sleight, L. Petrini, J. Michael, B. Alex, P. Minervini, Y . Chen, J. Benton, E. Perez, Inverse scaling in test-time compute, (2025).https://doi.org/10.48550/arXiv.2507.14417 21

work page doi:10.48550/arxiv.2507.14417 2025

[41] [42]

Zhang, N

R. Zhang, N. El-Gohary, Clustering-based approach for building code computability anal- ysis, J. Comput. Civ. Eng. 35 (2021) 04021021. https://doi.org/10.1061/(ASCE)CP. 1943-5487.0000967

work page doi:10.1061/(asce)cp 2021

[42] [43]

X. Xue, J. Zhang, Regulatory information transformation ruleset expansion to support automated building code compliance checking, Autom. Constr. 138 (2022) 104230. https://doi.org/ 10.1016/j.autcon.2022.104230

work page doi:10.1016/j.autcon.2022.104230 2022

[43] [44]

N.N. Minh, A. Baker, C. Neo, A.G. Roush, A. Kirsch, R. Shwartz-Ziv, Turning up the heat: min-p sampling for creative and coherent LLM outputs, in: 2024.https://openreview.net/ forum?id=FBkpCyujtS(accessed January 26, 2026)

2024

[44] [45]

M. Uhm, J. Kim, S. Ahn, H. Jeong, H. Kim, Effectiveness of retrieval augmented generation- based large language models for generating construction safety information, Autom. Constr. 170 (2025) 105926.https://doi.org/10.1016/j.autcon.2024.105926

work page doi:10.1016/j.autcon.2024.105926 2025

[45] [46]

J.W.L. Shi, M. Dang, W. Solihin, J.K.W. Yeoh, LLM attribution analysis across different fine- tuning strategies and model scales for automated code compliance, in: The 21st International Conference on Computing in Civil and Building Engineering (ICCCBE 2026), Taipei, Taiwan, 2026.https://doi.org/10.48550/arXiv.2604.15589 22

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.15589 2026