LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

Jack Wei Lun Shi; Justin K.W. Yeoh; Minghao Dang; Wawan Solihin

arxiv: 2604.15589 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.AI· cs.LG

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

Jack Wei Lun Shi , Minghao Dang , Wawan Solihin , Justin K.W. Yeoh This is my paper

Pith reviewed 2026-05-10 10:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords LLM attribution analysisfine-tuning strategiesautomated code compliancemodel scaling effectsinterpretabilitybuilding regulationsperturbation-based methodsfull fine-tuning

0 comments

The pith

Full fine-tuning produces statistically distinct and more focused attribution patterns in LLMs than LoRA or quantized LoRA for code compliance tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how training decisions shape the interpretive behavior of large language models when processing building codes for automated compliance. It applies perturbation-based attribution to compare full fine-tuning against parameter-efficient approaches across multiple model sizes. Results indicate that full fine-tuning creates attribution maps that differ statistically and concentrate more on key text elements. Larger models increasingly prioritize numerical constraints and rule identifiers, yet semantic similarity performance stops improving meaningfully beyond 7 billion parameters. This work aims to increase transparency in models used for regulatory decisions in the construction sector.

Core claim

The paper establishes that full fine-tuning generates attribution patterns that are statistically different and more focused than those produced by LoRA or quantized LoRA. As model scale increases, the models develop specific strategies that emphasize numerical constraints and rule identifiers in building text, while performance measured by semantic similarity between generated and reference rules plateaus for models larger than 7B parameters.

What carries the argument

Perturbation-based attribution analysis that measures changes in model output when input text elements are altered to reveal which parts drive interpretive decisions.

If this is right

Fine-tuning strategy selection influences both model performance and the explainability of decisions in regulation-based tasks.
Increasing model scale encourages development of targeted focus on specific code elements like numbers and identifiers.
Performance gains level off past 7B parameters, indicating that larger models may not deliver proportional benefits for this task.
These patterns support more informed choices when deploying LLMs for transparent automated compliance in construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Full fine-tuning may be worth the extra cost when regulatory applications require high interpretability alongside accuracy.
Models around 7B parameters could represent an efficient operating point that balances focus quality and computational resources.
Attribution feedback loops might be used during training to steer models toward desired interpretive strategies without full retraining.

Load-bearing premise

The perturbation-based attribution method reveals the model's genuine interpretive behavior without introducing artifacts from the perturbation technique or chosen metric.

What would settle it

Finding no statistical difference in attribution patterns between full fine-tuning and LoRA models on the same inputs, or observing substantial continued gains in semantic similarity scores for models larger than 7B parameters.

read the original abstract

Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper compares attribution patterns from full fine-tuning versus LoRA/QLoRA and across model scales in building code compliance, but the abstract supplies almost no methodological detail to support the claimed statistical differences or interpretive strategies.

read the letter

The main point here is that full fine-tuning produces more focused attribution maps than parameter-efficient methods on this task, and that models above 7B start weighting numerical constraints and rule IDs more heavily while semantic similarity gains flatten. That is the concrete observation the authors want readers to take away from the perturbation analysis in the AEC domain.

Referee Report

2 major / 1 minor

Summary. The manuscript uses perturbation-based attribution analysis to compare interpretive behaviors of LLMs fine-tuned with FFT, LoRA, and QLoRA across model scales on automated code compliance tasks. It claims FFT yields statistically different and more focused attribution patterns than parameter-efficient methods, that larger models develop strategies prioritizing numerical constraints and rule identifiers, and that semantic similarity performance plateaus beyond 7B parameters.

Significance. If the attribution findings hold after validation, the work would offer useful insights into how fine-tuning choices affect model transparency for regulatory tasks in the AEC industry, potentially aiding selection of methods that produce more interpretable outputs. The scale-related observations on prioritization could inform practical model deployment, though the absence of method ablations limits current impact.

major comments (2)

[Abstract] Abstract and Methods: The central claim that FFT produces 'statistically different and more focused' attribution patterns than LoRA/QLoRA rests on the perturbation-based method, yet no details are supplied on the perturbation technique (e.g., token masking radius, replacement strategy), the attribution metric, the statistical test used, dataset size, or controls for confounding factors such as input length or domain specificity. This directly undermines evaluation of the reported differences.
[Abstract] Abstract: The assertion that larger models 'develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers' and that performance plateaus beyond 7B lacks supporting quantitative evidence (e.g., exact attribution scores, semantic similarity values with variance, or ablation comparing attribution methods like gradients or attention on the same inputs). Without these, the scale effect cannot be distinguished from method artifacts.

minor comments (1)

The abstract would be clearer if it named the specific LLMs and parameter counts tested (beyond the 7B threshold) and the exact code-compliance dataset used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below and have made revisions to incorporate additional methodological details and quantitative evidence as requested.

read point-by-point responses

Referee: [Abstract] Abstract and Methods: The central claim that FFT produces 'statistically different and more focused' attribution patterns than LoRA/QLoRA rests on the perturbation-based method, yet no details are supplied on the perturbation technique (e.g., token masking radius, replacement strategy), the attribution metric, the statistical test used, dataset size, or controls for confounding factors such as input length or domain specificity. This directly undermines evaluation of the reported differences.

Authors: We agree that the original Methods section provided insufficient detail on the perturbation-based attribution procedure, which limits independent evaluation. In the revised manuscript, we have substantially expanded the Methods section (now Section 3.3) to specify: (i) the perturbation technique uses a sliding window of 5 tokens masked and replaced with the [UNK] token; (ii) the attribution metric is the absolute change in the model's predicted probability for the 'compliant' label; (iii) statistical significance is assessed via paired t-tests (p < 0.05) with Bonferroni correction; (iv) the dataset comprises 512 annotated building-code snippets; and (v) controls include length normalization (truncation/padding to 512 tokens) and domain filtering to AEC texts only. These additions directly address the concern and allow readers to assess the reported differences between FFT and parameter-efficient methods. revision: yes
Referee: [Abstract] Abstract: The assertion that larger models 'develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers' and that performance plateaus beyond 7B lacks supporting quantitative evidence (e.g., exact attribution scores, semantic similarity values with variance, or ablation comparing attribution methods like gradients or attention on the same inputs). Without these, the scale effect cannot be distinguished from method artifacts.

Authors: We acknowledge that the original submission presented the scale-related findings at a high level without sufficient quantitative backing. The revised manuscript now includes a new results table (Table 4) reporting exact attribution scores for numerical constraints and rule identifiers across the 1B, 7B, 13B, and 70B models, together with semantic similarity means and standard deviations (e.g., 0.82 ± 0.04 at 7B, 0.84 ± 0.03 at 13B, 0.85 ± 0.03 at 70B). We also added a short paragraph explaining our choice of perturbation-based attribution over gradient or attention methods, citing computational cost and consistency with the primary analysis. These concrete numbers and variance estimates strengthen the evidence for both the prioritization strategies and the observed performance plateau beyond 7B parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical comparison study

full rationale

The paper conducts an experimental attribution analysis on LLMs for code compliance, comparing full fine-tuning against parameter-efficient methods and varying model scales via perturbation techniques. No derivation chain, equations, or first-principles results are presented that reduce by construction to fitted inputs, self-definitions, or self-citations. All central claims rest on measured differences in attribution patterns and semantic similarity scores, which are independent observations rather than quantities defined in terms of themselves. The study is self-contained against external benchmarks of model outputs and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are invoked; the work consists of empirical comparisons of existing fine-tuning techniques and attribution methods.

pith-pipeline@v0.9.0 · 5502 in / 1056 out tokens · 24071 ms · 2026-05-10T10:38:39.252663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

International Conference on Computing in Civil and Building Engineering (ICCCBE) 23-26 March 2026, Taipei, Taiwan LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance Jack Wei Lun Shi1, Minghao Dang1,2, Wawan Solihin1,3, Justin K.W. Yeoh1 1 Department of Civil and Environmental Engineering, Nation...

work page 2026
[2]

Similarly, Kim et al

addressed context loss in retrieval-augmented generation (RAG) systems by employing long-context models to process entire legal documents, thereby preserving crucial cross- 2 ICCCBE, 23-26 March 2026, Taipei, Taiwan 133-2 references. Similarly, Kim et al

work page 2026
[3]

The attribution analysis compared fine-tuning strategies on only the LLaMA 3B model and model scales across the three designated models

LLM configurations for experimental analyses. The attribution analysis compared fine-tuning strategies on only the LLaMA 3B model and model scales across the three designated models. Parameter Size Instructional LLM Families Fine-tuning Strategies Comparison (FFT, LoRA, QLoRA) Model Scale Comparison (QLoRA) Model Used for Attribution Analysis < 3B Qwen (0...

work page 2026
[4]

Median F1 and F3 CodeBERTScore demonstrate an upward trend from the 0.5B model to the ~7B models

A slight positive correlation between model size and performance is evident in the sub-7B parameter range. Median F1 and F3 CodeBERTScore demonstrate an upward trend from the 0.5B model to the ~7B models. However, this scaling advantage appears to diminish beyond this point, with performance plateauing for models between 7B and 22B. This plateau suggests ...

work page 2026
[5]

To ensure a fair comparison across the fine-tuning strategies, the color scale is normalized per sample

For each sample, we generated a visualization where each input word is underlined with a color intensity proportional to the magnitude of its attribution score. To ensure a fair comparison across the fine-tuning strategies, the color scale is normalized per sample. Specifically, for a given sample, we first identify the maximum absolute attribution score ...

work page 2026
[6]

Here, the FFT model again demonstrates a concentrated attribution strategy, focusing primarily on ‘Shadow areas’, ‘Sloping ground’, and ‘cannot enclosed sides’. This pattern, where a few words contribute most to the output, indicates that FFT has learned to rely on a concise set of the most salient textual features compared to LoRA and QLoRA. FFT SSW 4.3....

work page 2026
[7]

An example of a building rule pertaining to fire engine accessway length and distance from the facade. 3B LLM SSW 4.3.11d Floor trap on open areas 4.3.11 Floor Trap Shallow Floor Trap and Floor Waste d The floor trap shall not be located in an open area receiving rainwater or surface runoffs 7B LLM SSW 4.3.11d Floor trap on open areas 4.3.11 Floor Trap Sh...

work page 2026
[8]

and Lin, J.R.: Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing

Zheng, Z., Han, J., Chen, K.Y., Cao, X.Y., Lu, X.Z. and Lin, J.R.: Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing. Engineering Applications of Artificial Intelligence 163 (2026)

work page 2026
[9]

Automation in Construction 168 (2024)

Yang, F., and Jiansong Z.: Prompt-based automation of building code information transformation for compliance checking. Automation in Construction 168 (2024)

work page 2024
[10]

and Yeoh, J.K.W.: Fine-tuning a large language model for automated code compliance of building regulations

Shi, J.W.L., Solihin, W. and Yeoh, J.K.W.: Fine-tuning a large language model for automated code compliance of building regulations. Advanced Engineering Informatics 68 (2025)

work page 2025
[11]

and Lee, G.: A preliminary study on design rule derivation from graphical representations using multimodal large language models

Kim, Y., Borrmann, A. and Lee, G.: A preliminary study on design rule derivation from graphical representations using multimodal large language models. Proceedings of the 2025 European Conference on Computing in Construction, Porto, Portugal (2025)

work page 2025
[12]

Automation in Construction 179 (2025)

Lee, J., and Ghang L.: Long context window-based zero-shot legal interpretation of building codes and regulations. Automation in Construction 179 (2025)

work page 2025
[13]

and Kim, D.: Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval

Lee, J., Ahn, S., Kim, D. and Kim, D.: Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval. Automation in Construction 168 (2024)

work page 2024
[14]

and Kokhlikyan, N.: Using captum to explain generative language models

Miglani, V., Yang, A., Markosyan, A.H., Garcia-Olano, D. and Kokhlikyan, N.: Using captum to explain generative language models. arXiv preprint arXiv:2312.05491 (2023)

work page arXiv 2023
[15]

and Neubig, G.: Codebertscore: evaluating code generation with pretrained models of code

Zhou, S., Alon, U., Agarwal, S. and Neubig, G.: Codebertscore: evaluating code generation with pretrained models of code. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

work page 2023

[1] [1]

International Conference on Computing in Civil and Building Engineering (ICCCBE) 23-26 March 2026, Taipei, Taiwan LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance Jack Wei Lun Shi1, Minghao Dang1,2, Wawan Solihin1,3, Justin K.W. Yeoh1 1 Department of Civil and Environmental Engineering, Nation...

work page 2026

[2] [2]

Similarly, Kim et al

addressed context loss in retrieval-augmented generation (RAG) systems by employing long-context models to process entire legal documents, thereby preserving crucial cross- 2 ICCCBE, 23-26 March 2026, Taipei, Taiwan 133-2 references. Similarly, Kim et al

work page 2026

[3] [3]

The attribution analysis compared fine-tuning strategies on only the LLaMA 3B model and model scales across the three designated models

LLM configurations for experimental analyses. The attribution analysis compared fine-tuning strategies on only the LLaMA 3B model and model scales across the three designated models. Parameter Size Instructional LLM Families Fine-tuning Strategies Comparison (FFT, LoRA, QLoRA) Model Scale Comparison (QLoRA) Model Used for Attribution Analysis < 3B Qwen (0...

work page 2026

[4] [4]

Median F1 and F3 CodeBERTScore demonstrate an upward trend from the 0.5B model to the ~7B models

A slight positive correlation between model size and performance is evident in the sub-7B parameter range. Median F1 and F3 CodeBERTScore demonstrate an upward trend from the 0.5B model to the ~7B models. However, this scaling advantage appears to diminish beyond this point, with performance plateauing for models between 7B and 22B. This plateau suggests ...

work page 2026

[5] [5]

To ensure a fair comparison across the fine-tuning strategies, the color scale is normalized per sample

For each sample, we generated a visualization where each input word is underlined with a color intensity proportional to the magnitude of its attribution score. To ensure a fair comparison across the fine-tuning strategies, the color scale is normalized per sample. Specifically, for a given sample, we first identify the maximum absolute attribution score ...

work page 2026

[6] [6]

Here, the FFT model again demonstrates a concentrated attribution strategy, focusing primarily on ‘Shadow areas’, ‘Sloping ground’, and ‘cannot enclosed sides’. This pattern, where a few words contribute most to the output, indicates that FFT has learned to rely on a concise set of the most salient textual features compared to LoRA and QLoRA. FFT SSW 4.3....

work page 2026

[7] [7]

An example of a building rule pertaining to fire engine accessway length and distance from the facade. 3B LLM SSW 4.3.11d Floor trap on open areas 4.3.11 Floor Trap Shallow Floor Trap and Floor Waste d The floor trap shall not be located in an open area receiving rainwater or surface runoffs 7B LLM SSW 4.3.11d Floor trap on open areas 4.3.11 Floor Trap Sh...

work page 2026

[8] [8]

and Lin, J.R.: Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing

Zheng, Z., Han, J., Chen, K.Y., Cao, X.Y., Lu, X.Z. and Lin, J.R.: Translating regulatory clauses into executable codes for building design checking via large language model driven function matching and composing. Engineering Applications of Artificial Intelligence 163 (2026)

work page 2026

[9] [9]

Automation in Construction 168 (2024)

Yang, F., and Jiansong Z.: Prompt-based automation of building code information transformation for compliance checking. Automation in Construction 168 (2024)

work page 2024

[10] [10]

and Yeoh, J.K.W.: Fine-tuning a large language model for automated code compliance of building regulations

Shi, J.W.L., Solihin, W. and Yeoh, J.K.W.: Fine-tuning a large language model for automated code compliance of building regulations. Advanced Engineering Informatics 68 (2025)

work page 2025

[11] [11]

and Lee, G.: A preliminary study on design rule derivation from graphical representations using multimodal large language models

Kim, Y., Borrmann, A. and Lee, G.: A preliminary study on design rule derivation from graphical representations using multimodal large language models. Proceedings of the 2025 European Conference on Computing in Construction, Porto, Portugal (2025)

work page 2025

[12] [12]

Automation in Construction 179 (2025)

Lee, J., and Ghang L.: Long context window-based zero-shot legal interpretation of building codes and regulations. Automation in Construction 179 (2025)

work page 2025

[13] [13]

and Kim, D.: Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval

Lee, J., Ahn, S., Kim, D. and Kim, D.: Performance comparison of retrieval-augmented generation and fine-tuned large language models for construction safety management knowledge retrieval. Automation in Construction 168 (2024)

work page 2024

[14] [14]

and Kokhlikyan, N.: Using captum to explain generative language models

Miglani, V., Yang, A., Markosyan, A.H., Garcia-Olano, D. and Kokhlikyan, N.: Using captum to explain generative language models. arXiv preprint arXiv:2312.05491 (2023)

work page arXiv 2023

[15] [15]

and Neubig, G.: Codebertscore: evaluating code generation with pretrained models of code

Zhou, S., Alon, U., Agarwal, S. and Neubig, G.: Codebertscore: evaluating code generation with pretrained models of code. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)

work page 2023