Recognition: no theorem link
Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs
Pith reviewed 2026-05-13 17:02 UTC · model grok-4.3
The pith
A small window of recent prior decisions improves LLM-generated Architecture Decision Records more than full history or larger models alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that context-aware prompting substantially improves ADR generation fidelity over no-context and full-history baselines, with a small recency window of typically three to five prior records delivering the best quality-efficiency balance across model families; retrieval-augmented selection yields only marginal gains and no statistically significant advantage in typical linear workflows, leading to the claim that context engineering rather than model scale is the dominant factor in effective ADR automation.
What carries the argument
The five context selection strategies (no context, All-history, First-K, Last-K, and RAFG) evaluated against a validated corpus of sequential ADRs drawn from 750 open-source repositories.
If this is right
- Tool builders should default to a recency window of three to five prior records for typical linear ADR sequences.
- Retrieval-based fallbacks should be reserved for non-sequential or cross-cutting decision scenarios rather than used as the primary strategy.
- Increasing model scale alone will not close the quality gap created by poor context selection.
- Automation of ADR writing becomes practical when context is limited to recent records, lowering both token cost and latency.
Where Pith is reading between the lines
- Teams maintaining parallel decision threads across components may still need retrieval methods even if recency works for sequential flows.
- The same recency principle could be tested on other sequential artifacts such as commit messages or architecture diagrams.
- If the recency advantage persists, low-cost local models paired with recent context could replace larger cloud models for routine ADR generation.
Load-bearing premise
The curated set of sequential ADRs from 750 open-source repositories represents real-world ADR authoring patterns and the chosen automatic metrics accurately reflect generation fidelity and usefulness.
What would settle it
A replication study on a fresh corpus of closed-source or differently structured repositories where a larger model with no context or full history produces equal or higher fidelity scores than a smaller model given a 3-5 record recency window.
Figures
read the original abstract
Architecture Decision Records (ADRs) play a critical role in preserving the rationale behind system design, yet their creation and maintenance are often neglected due to the associated authoring overhead. This paper investigates whether Large Language Models (LLMs) can mitigate this burden and, more importantly, how different strategies for presenting historical ADRs as context influence generation quality. We curate and validate a large corpus of sequential ADRs drawn from 750 open-source repositories and systematically evaluate five context selection strategies (no context, All-history, First-K, Last-K, and RAFG) across multiple model families. Our results show that context-aware prompting substantially improves ADR generation fidelity, with a small recency window (typically 3-5 prior records) providing the best balance between quality and efficiency. Retrieval-based context selection yields marginal gains primarily in non-sequential or cross-cutting decision scenarios, while offering no statistically significant advantage in typical linear ADR workflows. Overall, our findings demonstrate that context engineering, rather than model scale alone, is the dominant factor in effective ADR automation, and we outline practical defaults for tool builders along with targeted retrieval fallbacks for complex architectural settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper curates a corpus of sequential ADRs from 750 open-source repositories and evaluates five context-selection strategies (no context, All-history, First-K, Last-K, RAFG) across multiple LLM families. It reports that context-aware prompting improves generation fidelity, a small recency window (3-5 records) offers the best quality-efficiency trade-off, retrieval yields only marginal gains outside linear workflows, and context engineering dominates model scale.
Significance. If the empirical results hold under human validation, the work provides practical defaults for LLM-based ADR tooling in software engineering and identifies when retrieval fallbacks are warranted, directly addressing documentation overhead in large codebases.
major comments (2)
- [Evaluation Metrics] Evaluation section: the central claim that context engineering is the dominant factor rests on automatic metrics for 'generation fidelity,' yet no correlation with human judgments of rationale usefulness or decision-trade-off capture is reported; this leaves open whether surface-level metrics (e.g., n-gram overlap) actually support the dominance conclusion over model scale.
- [§3 and Results] §3 (Corpus Curation) and Results: the 750-repo sequential corpus is presented as representative, but the manuscript provides insufficient detail on how non-sequential or cross-cutting decisions were sampled or on the exact statistical tests and effect sizes supporting 'no statistically significant advantage' for retrieval in linear workflows.
minor comments (2)
- [Abstract and §2] Abstract and §2: expand the acronym RAFG on first use and briefly motivate why it was chosen over other retrieval variants.
- [Results tables] Results tables: report exact p-values, confidence intervals, and model-size controls when claiming context effects exceed scale effects.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Evaluation Metrics] Evaluation section: the central claim that context engineering is the dominant factor rests on automatic metrics for 'generation fidelity,' yet no correlation with human judgments of rationale usefulness or decision-trade-off capture is reported; this leaves open whether surface-level metrics (e.g., n-gram overlap) actually support the dominance conclusion over model scale.
Authors: We acknowledge that our evaluation relies exclusively on automatic metrics (BLEU, ROUGE, and BERTScore) for measuring generation fidelity, following standard practices in LLM-based text generation research. While these metrics do not directly correlate with human judgments of rationale usefulness, the consistent performance patterns observed across multiple model families and the large-scale corpus provide empirical support for the dominance of context engineering over model scale. In the revised manuscript, we will expand the Evaluation and Limitations sections to explicitly discuss the known limitations of automatic metrics, reference relevant literature on their correlation with human assessments, and note that future work should include human validation studies. revision: partial
-
Referee: [§3 and Results] §3 (Corpus Curation) and Results: the 750-repo sequential corpus is presented as representative, but the manuscript provides insufficient detail on how non-sequential or cross-cutting decisions were sampled or on the exact statistical tests and effect sizes supporting 'no statistically significant advantage' for retrieval in linear workflows.
Authors: We will revise §3 to include additional details on the corpus curation methodology, specifically the heuristics and filtering steps used to identify sequential ADRs from the 750 repositories and the exclusion criteria applied to non-sequential or cross-cutting decisions to maintain focus on linear workflows. In the Results section, we will report the precise statistical tests employed (e.g., paired t-tests with Bonferroni correction), associated p-values, and effect sizes (Cohen's d) to substantiate the finding of no statistically significant advantage for retrieval-augmented methods in linear settings. revision: yes
Circularity Check
No significant circularity: purely empirical evaluation on external corpus
full rationale
The paper reports results from curating a corpus of sequential ADRs from 750 open-source repositories and running controlled prompting experiments across five context strategies and multiple model families. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps. All claims rest on observed differences in automatic metrics between strategies, which are directly comparable to the input corpus and therefore externally falsifiable. This is the standard non-circular outcome for an empirical benchmarking study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Springer Berlin Heidelberg, 2009
Philippe Kruchten.Documentation of Software Architecture from a Knowledge Management Perspective – Design Representation, page 39–57. Springer Berlin Heidelberg, 2009. ISBN 9783642023743. doi: 10.1007/978-3-642-02374-3\_3
-
[2]
Documenting architecture decisions
Michael Nygard. Documenting architecture decisions. https://cognitect.com/ blog/2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-12- 03
work page 2011
-
[3]
Architectural decisions as reusable design assets.IEEE Software, 28(1):64–69, January 2011
Olaf Zimmermann. Architectural decisions as reusable design assets.IEEE Software, 28(1):64–69, January 2011. ISSN 0740-7459. doi: 10.1109/ms.2011.3
-
[4]
Rafael Capilla, Anton Jansen, Antony Tang, Paris Avgeriou, and Muhammad Ali Babar. 10 years of software architecture knowledge management: Practice and future.Journal of Systems and Software, 116:191–205, June 2016. ISSN 0164-1212. doi: 10.1016/j.jss.2015.08.054
-
[5]
Georg Buchgeher, Stefan Schöberl, Verena Geist, Bernhard Dorninger, Philipp Haindl, and Rainer Weinreich. Using architecture decision records in open source projects—an msr study on github.IEEE Access, 11:63725–63740, 2023. ISSN 2169-3536. doi: 10.1109/access.2023.3287654
-
[6]
Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling
Muhammet Kürşat Görmez, Murat Yılmaz, and Paul M. Clarke.Large Lan- guage Models for Software Engineering: A Systematic Mapping Study, page 64–79. Springer Nature Switzerland, 2024. ISBN 9783031711398. doi: 10.1007/978-3-031- 71139-8_5
-
[7]
Large language models for software engineering: A systematic literature review, 2024
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2024. URL https://arxiv.org/abs/2308. 10620
work page 2024
-
[8]
Can llms generate architectural design decisions? - an exploratory empirical study
Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions? - an exploratory empirical study. In2024 IEEE 21st International Conference on Software Architecture (ICSA), page 79–89. IEEE, June 2024. doi: 10.1109/icsa59870.2024.00016
-
[9]
Can llms generate architectural design decisions? -an exploratory empirical study, 2024
Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions? -an exploratory empirical study, 2024. URL https: //arxiv.org/abs/2403.01709
-
[10]
Understanding the design decisions of retrieval- augmented generation systems, 2025
Shengming Zhao, Yuchen Shao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, and Lei Ma. Understanding the design decisions of retrieval- augmented generation systems, 2025. URL https://arxiv.org/abs/2411.19463
-
[11]
A Survey of Context Engineering for Large Language Models
Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URL https://arxiv.org/abs/2507.13334
work page internal anchor Pith review arXiv 2025
-
[12]
A. Jansen and J. Bosch. Software architecture as a set of architectural design de- cisions. In5th Working IEEE/IFIP Conference on Software Architecture (WICSA’05), pages 109–120, 2005. doi: 10.1109/WICSA.2005.61
-
[13]
Root cause analysis for microservice system based on causal inference: How far are we?
Xinran Yu, Chun Li, Minxue Pan, and Xuandong Li. Droidcoder: Enhanced an- droid code completion with context-enriched retrieval-augmented generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, page 681–693. ACM, October 2024. doi: 10.1145/3691620.3695063
-
[14]
In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)
Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, and Anne Koziolek. Lissa: Toward generic traceability link recovery through retrieval-augmented generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), page 1396–1408. IEEE, April 2025. doi: 10.1109/ icse55347.2025.00186
-
[15]
Atlas: Few-shot learning with retrieval augmented language models,
Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models, 2022. URL https://arxiv.org/abs/2208.03299
-
[16]
Basili, Gianluigi Caldiera, and H
Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach. The goal question metric approach. InEncyclopedia of Software Engineering, pages 528–532. Wiley, 1994
work page 1994
-
[17]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Push- ing the frontier with advanced reasoning and long context.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Glm-4.6: Advanced agentic and reasoning model
Zhipu AI. Glm-4.6: Advanced agentic and reasoning model. Technical report, Z.ai Developers, 2025. URL https://docs.z.ai/glm-4-6
work page 2025
-
[19]
Aishwarya Kamath, Johan Ferret, Shreya Pathak, et al. Gemma 3: Open models for advanced reasoning and multimodal tasks.arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://arxiv.org/abs/ 1904.09675
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[21]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083. 1073135
-
[22]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, page 10, 01 2004
work page 2004
-
[23]
Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments
Alon Lavie and Abhaya Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. pages 228–231, 07 2007
work page 2007
-
[24]
Generative ai for software architecture
Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, and Davide Taibi. Generative ai for software architecture. applications, challenges, and future directions, 2025. URL https://arxiv.org/abs/2503.13310
-
[25]
Leveraging generative ai for architecture knowledge management
Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Leveraging generative ai for architecture knowledge management. In2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C), pages 163–166. IEEE, 2024
work page 2024
-
[26]
Helping novice architects to make quality design decisions using an llm-based assistant
J Andrés Díaz-Pace, Antonela Tommasel, and Rafael Capilla. Helping novice architects to make quality design decisions using an llm-based assistant. In European Conference on Software Architecture, pages 324–332. Springer, 2024
work page 2024
-
[27]
Mohamed Soliman and Jan Keim. Do large language models contain software architectural knowledge?: An exploratory case study with gpt. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), page 13–24. IEEE, March
-
[28]
doi: 10.1109/icsa65012.2025.00012
-
[29]
Tobias Eisenreich, Sandro Speth, and Stefan Wagner. From requirements to architecture: An ai-based journey to semi-automatically generate software archi- tectures. InProceedings of the 1st International Workshop on Designing Software, Designing ’24, page 52–55, New York, NY, USA, 2024. Association for Com- puting Machinery. ISBN 9798400705632. doi: 10.114...
-
[30]
State of practice: Llms in software engineering and software architecture
Jasmin Jahić and Ashkan Sami. State of practice: Llms in software engineering and software architecture. In2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C), pages 311–318. IEEE, 2024
work page 2024
-
[31]
In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)
Dominik Fuchß, Haoyu Liu, Tobias Hey, Jan Keim, and Anne Koziolek. Enabling architecture traceability by llm-based architecture component name extraction. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), pages 1–12, 2025. doi: 10.1109/ICSA65012.2025.00011
-
[32]
In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)
Shrikara Arun, Meghana Tedla, and Karthik Vaidhyanathan. Llms for generation of architectural components: An exploratory empirical study in the serverless world. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), page 25–36. IEEE, March 2025. doi: 10.1109/icsa65012.2025.00013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.