arxiv: 2604.03826 · v2 · submitted 2026-04-04 · 💻 cs.SE

Recognition: no theorem link

Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs

Aviral Gupta , Rudra Dhar , Daniel Feitosa , Karthik Vaidhyanathan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:02 UTC · model grok-4.3

classification 💻 cs.SE

keywords Architecture Decision RecordsLarge Language ModelsContext SelectionAutomated DocumentationSoftware ArchitecturePrompt EngineeringRetrieval-Augmented Generation

0 comments

The pith

A small window of recent prior decisions improves LLM-generated Architecture Decision Records more than full history or larger models alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether different ways of supplying historical ADRs as context to LLMs can reduce the effort of writing these design-rationale documents. It finds that context-aware strategies, especially feeding just the last three to five records, raise generation fidelity over baselines that use nothing or the entire past. Retrieval methods add only small gains and mainly in non-linear cases. The work concludes that careful choice of what context to include matters more for quality than simply using a bigger model. This matters for teams that want to keep design decisions documented without extra manual work.

Core claim

The paper establishes that context-aware prompting substantially improves ADR generation fidelity over no-context and full-history baselines, with a small recency window of typically three to five prior records delivering the best quality-efficiency balance across model families; retrieval-augmented selection yields only marginal gains and no statistically significant advantage in typical linear workflows, leading to the claim that context engineering rather than model scale is the dominant factor in effective ADR automation.

What carries the argument

The five context selection strategies (no context, All-history, First-K, Last-K, and RAFG) evaluated against a validated corpus of sequential ADRs drawn from 750 open-source repositories.

If this is right

Tool builders should default to a recency window of three to five prior records for typical linear ADR sequences.
Retrieval-based fallbacks should be reserved for non-sequential or cross-cutting decision scenarios rather than used as the primary strategy.
Increasing model scale alone will not close the quality gap created by poor context selection.
Automation of ADR writing becomes practical when context is limited to recent records, lowering both token cost and latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams maintaining parallel decision threads across components may still need retrieval methods even if recency works for sequential flows.
The same recency principle could be tested on other sequential artifacts such as commit messages or architecture diagrams.
If the recency advantage persists, low-cost local models paired with recent context could replace larger cloud models for routine ADR generation.

Load-bearing premise

The curated set of sequential ADRs from 750 open-source repositories represents real-world ADR authoring patterns and the chosen automatic metrics accurately reflect generation fidelity and usefulness.

What would settle it

A replication study on a fresh corpus of closed-source or differently structured repositories where a larger model with no context or full history produces equal or higher fidelity scores than a smaller model given a 3-5 record recency window.

Figures

Figures reproduced from arXiv: 2604.03826 by Aviral Gupta, Daniel Feitosa, Karthik Vaidhyanathan, Rudra Dhar.

**Figure 2.** Figure 2: Study Design More formally, by utilizing the Goal-Question-Metric approach [16], the objective of this study can be described as follows: Analyze the effectiveness of context For the purpose of generating Architecture Decision Records With respect to different context selection strategies From the viewpoint of Software Architects In the context of using Generative AI for AKM. 3.2 Research Questions To achi… view at source ↗

**Figure 3.** Figure 3: ADR Frequency Distribution maintains a longitudinal history exceeding 25 records. This distribution highlights the challenge of capturing long-term architectural evolution in open-source software. 3.3.2 Large Language Models (LLMs). To ensure a comprehensive evaluation, we selected a heterogeneous suite of models representing a broad spectrum of computational scales, licensing, and architectural paradigm… view at source ↗

**Figure 5.** Figure 5: Comparison of model performance across the context strategies (K=3) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Longitudinal analysis of generation fidelity across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Architecture Decision Records (ADRs) play a critical role in preserving the rationale behind system design, yet their creation and maintenance are often neglected due to the associated authoring overhead. This paper investigates whether Large Language Models (LLMs) can mitigate this burden and, more importantly, how different strategies for presenting historical ADRs as context influence generation quality. We curate and validate a large corpus of sequential ADRs drawn from 750 open-source repositories and systematically evaluate five context selection strategies (no context, All-history, First-K, Last-K, and RAFG) across multiple model families. Our results show that context-aware prompting substantially improves ADR generation fidelity, with a small recency window (typically 3-5 prior records) providing the best balance between quality and efficiency. Retrieval-based context selection yields marginal gains primarily in non-sequential or cross-cutting decision scenarios, while offering no statistically significant advantage in typical linear ADR workflows. Overall, our findings demonstrate that context engineering, rather than model scale alone, is the dominant factor in effective ADR automation, and we outline practical defaults for tool builders along with targeted retrieval fallbacks for complex architectural settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Small recency windows outperform full history or retrieval for ADR generation on this corpus, but automatic metrics without human checks weaken the dominance claim.

read the letter

The paper's core result is that a short recency window of 3-5 prior ADRs produces higher-fidelity outputs than no context, full history, or retrieval-augmented selection when generating Architecture Decision Records with LLMs. They built a corpus of sequential ADRs from 750 open-source repositories and ran the comparison across multiple model families, which is a step up from smaller or synthetic evaluations in the area. The practical note that retrieval only adds value in non-linear cases and that small windows balance quality with efficiency is the part worth taking away for tool builders. The work is straightforward empirical prompting research with a clear setup and a usable takeaway on context choices. The main limitation is the dependence on automatic metrics for generation fidelity. Without any reported human evaluation or correlation check, it is hard to know whether those scores actually reflect whether the generated ADR captures real decision rationale and trade-offs. The corpus focus on linear sequences from open-source projects also leaves open whether the marginal gains for retrieval would hold in more complex, cross-cutting architectures. This is the sort of applied study that could help teams reduce documentation overhead, so it has enough grounding to go to peer review. Referees could usefully press on metric validation and broader generalizability. I would bring it to a reading group to discuss the evaluation design.

Referee Report

2 major / 2 minor

Summary. The paper curates a corpus of sequential ADRs from 750 open-source repositories and evaluates five context-selection strategies (no context, All-history, First-K, Last-K, RAFG) across multiple LLM families. It reports that context-aware prompting improves generation fidelity, a small recency window (3-5 records) offers the best quality-efficiency trade-off, retrieval yields only marginal gains outside linear workflows, and context engineering dominates model scale.

Significance. If the empirical results hold under human validation, the work provides practical defaults for LLM-based ADR tooling in software engineering and identifies when retrieval fallbacks are warranted, directly addressing documentation overhead in large codebases.

major comments (2)

[Evaluation Metrics] Evaluation section: the central claim that context engineering is the dominant factor rests on automatic metrics for 'generation fidelity,' yet no correlation with human judgments of rationale usefulness or decision-trade-off capture is reported; this leaves open whether surface-level metrics (e.g., n-gram overlap) actually support the dominance conclusion over model scale.
[§3 and Results] §3 (Corpus Curation) and Results: the 750-repo sequential corpus is presented as representative, but the manuscript provides insufficient detail on how non-sequential or cross-cutting decisions were sampled or on the exact statistical tests and effect sizes supporting 'no statistically significant advantage' for retrieval in linear workflows.

minor comments (2)

[Abstract and §2] Abstract and §2: expand the acronym RAFG on first use and briefly motivate why it was chosen over other retrieval variants.
[Results tables] Results tables: report exact p-values, confidence intervals, and model-size controls when claiming context effects exceed scale effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Evaluation Metrics] Evaluation section: the central claim that context engineering is the dominant factor rests on automatic metrics for 'generation fidelity,' yet no correlation with human judgments of rationale usefulness or decision-trade-off capture is reported; this leaves open whether surface-level metrics (e.g., n-gram overlap) actually support the dominance conclusion over model scale.

Authors: We acknowledge that our evaluation relies exclusively on automatic metrics (BLEU, ROUGE, and BERTScore) for measuring generation fidelity, following standard practices in LLM-based text generation research. While these metrics do not directly correlate with human judgments of rationale usefulness, the consistent performance patterns observed across multiple model families and the large-scale corpus provide empirical support for the dominance of context engineering over model scale. In the revised manuscript, we will expand the Evaluation and Limitations sections to explicitly discuss the known limitations of automatic metrics, reference relevant literature on their correlation with human assessments, and note that future work should include human validation studies. revision: partial
Referee: [§3 and Results] §3 (Corpus Curation) and Results: the 750-repo sequential corpus is presented as representative, but the manuscript provides insufficient detail on how non-sequential or cross-cutting decisions were sampled or on the exact statistical tests and effect sizes supporting 'no statistically significant advantage' for retrieval in linear workflows.

Authors: We will revise §3 to include additional details on the corpus curation methodology, specifically the heuristics and filtering steps used to identify sequential ADRs from the 750 repositories and the exclusion criteria applied to non-sequential or cross-cutting decisions to maintain focus on linear workflows. In the Results section, we will report the precise statistical tests employed (e.g., paired t-tests with Bonferroni correction), associated p-values, and effect sizes (Cohen's d) to substantiate the finding of no statistically significant advantage for retrieval-augmented methods in linear settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation on external corpus

full rationale

The paper reports results from curating a corpus of sequential ADRs from 750 open-source repositories and running controlled prompting experiments across five context strategies and multiple model families. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps. All claims rest on observed differences in automatic metrics between strategies, which are directly comparable to the input corpus and therefore externally falsifiable. This is the standard non-circular outcome for an empirical benchmarking study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the open-source ADR corpus and the validity of the automatic quality metrics used to score generated records; no free parameters, domain axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5507 in / 1057 out tokens · 35076 ms · 2026-05-13T17:02:48.713448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Springer Berlin Heidelberg, 2009

Philippe Kruchten.Documentation of Software Architecture from a Knowledge Management Perspective – Design Representation, page 39–57. Springer Berlin Heidelberg, 2009. ISBN 9783642023743. doi: 10.1007/978-3-642-02374-3\_3

work page doi:10.1007/978-3-642-02374-3 2009
[2]

Documenting architecture decisions

Michael Nygard. Documenting architecture decisions. https://cognitect.com/ blog/2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-12- 03

work page 2011
[3]

Architectural decisions as reusable design assets.IEEE Software, 28(1):64–69, January 2011

Olaf Zimmermann. Architectural decisions as reusable design assets.IEEE Software, 28(1):64–69, January 2011. ISSN 0740-7459. doi: 10.1109/ms.2011.3

work page doi:10.1109/ms.2011.3 2011
[4]

10 years of software architecture knowledge management: Practice and future.Journal of Systems and Software, 116:191–205, June 2016

Rafael Capilla, Anton Jansen, Antony Tang, Paris Avgeriou, and Muhammad Ali Babar. 10 years of software architecture knowledge management: Practice and future.Journal of Systems and Software, 116:191–205, June 2016. ISSN 0164-1212. doi: 10.1016/j.jss.2015.08.054

work page doi:10.1016/j.jss.2015.08.054 2016
[5]

Using architecture decision records in open source projects—an msr study on github.IEEE Access, 11:63725–63740, 2023

Georg Buchgeher, Stefan Schöberl, Verena Geist, Bernhard Dorninger, Philipp Haindl, and Rainer Weinreich. Using architecture decision records in open source projects—an msr study on github.IEEE Access, 11:63725–63740, 2023. ISSN 2169-3536. doi: 10.1109/access.2023.3287654

work page doi:10.1109/access.2023.3287654 2023
[6]

Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling

Muhammet Kürşat Görmez, Murat Yılmaz, and Paul M. Clarke.Large Lan- guage Models for Software Engineering: A Systematic Mapping Study, page 64–79. Springer Nature Switzerland, 2024. ISBN 9783031711398. doi: 10.1007/978-3-031- 71139-8_5

work page doi:10.1007/978-3-031- 2024
[7]

Large language models for software engineering: A systematic literature review, 2024

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2024. URL https://arxiv.org/abs/2308. 10620

work page 2024
[8]

Can llms generate architectural design decisions? - an exploratory empirical study

Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions? - an exploratory empirical study. In2024 IEEE 21st International Conference on Software Architecture (ICSA), page 79–89. IEEE, June 2024. doi: 10.1109/icsa59870.2024.00016

work page doi:10.1109/icsa59870.2024.00016 2024
[9]

Can llms generate architectural design decisions? -an exploratory empirical study, 2024

Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions? -an exploratory empirical study, 2024. URL https: //arxiv.org/abs/2403.01709

work page arXiv 2024
[10]

Understanding the design decisions of retrieval- augmented generation systems, 2025

Shengming Zhao, Yuchen Shao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, and Lei Ma. Understanding the design decisions of retrieval- augmented generation systems, 2025. URL https://arxiv.org/abs/2411.19463

work page arXiv 2025
[11]

A Survey of Context Engineering for Large Language Models

Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URL https://arxiv.org/abs/2507.13334

work page internal anchor Pith review arXiv 2025
[12]

Jansen and J

A. Jansen and J. Bosch. Software architecture as a set of architectural design de- cisions. In5th Working IEEE/IFIP Conference on Software Architecture (WICSA’05), pages 109–120, 2005. doi: 10.1109/WICSA.2005.61

work page doi:10.1109/wicsa.2005.61 2005
[13]

Root cause analysis for microservice system based on causal inference: How far are we?

Xinran Yu, Chun Li, Minxue Pan, and Xuandong Li. Droidcoder: Enhanced an- droid code completion with context-enriched retrieval-augmented generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, page 681–693. ACM, October 2024. doi: 10.1145/3691620.3695063

work page doi:10.1145/3691620.3695063 2024
[14]

In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, and Anne Koziolek. Lissa: Toward generic traceability link recovery through retrieval-augmented generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), page 1396–1408. IEEE, April 2025. doi: 10.1109/ icse55347.2025.00186

work page arXiv 2025
[15]

Atlas: Few-shot learning with retrieval augmented language models,

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models, 2022. URL https://arxiv.org/abs/2208.03299

work page arXiv 2022
[16]

Basili, Gianluigi Caldiera, and H

Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach. The goal question metric approach. InEncyclopedia of Software Engineering, pages 528–532. Wiley, 1994

work page 1994
[17]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Push- ing the frontier with advanced reasoning and long context.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Glm-4.6: Advanced agentic and reasoning model

Zhipu AI. Glm-4.6: Advanced agentic and reasoning model. Technical report, Z.ai Developers, 2025. URL https://docs.z.ai/glm-4-6

work page 2025
[19]

Gemma 3 Technical Report

Aishwarya Kamath, Johan Ferret, Shreya Pathak, et al. Gemma 3: Open models for advanced reasoning and multimodal tasks.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://arxiv.org/abs/ 1904.09675

work page internal anchor Pith review Pith/arXiv arXiv 2020
[21]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083. 1073135

work page doi:10.3115/1073083 2002
[22]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, page 10, 01 2004

work page 2004
[23]

Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments

Alon Lavie and Abhaya Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. pages 228–231, 07 2007

work page 2007
[24]

Generative ai for software architecture

Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, and Davide Taibi. Generative ai for software architecture. applications, challenges, and future directions, 2025. URL https://arxiv.org/abs/2503.13310

work page arXiv 2025
[25]

Leveraging generative ai for architecture knowledge management

Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Leveraging generative ai for architecture knowledge management. In2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C), pages 163–166. IEEE, 2024

work page 2024
[26]

Helping novice architects to make quality design decisions using an llm-based assistant

J Andrés Díaz-Pace, Antonela Tommasel, and Rafael Capilla. Helping novice architects to make quality design decisions using an llm-based assistant. In European Conference on Software Architecture, pages 324–332. Springer, 2024

work page 2024
[27]

Do large language models contain software architectural knowledge?: An exploratory case study with gpt

Mohamed Soliman and Jan Keim. Do large language models contain software architectural knowledge?: An exploratory case study with gpt. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), page 13–24. IEEE, March

work page
[28]

doi: 10.1109/icsa65012.2025.00012

work page doi:10.1109/icsa65012.2025.00012 2025
[29]

From requirements to architecture: An ai-based journey to semi-automatically generate software archi- tectures

Tobias Eisenreich, Sandro Speth, and Stefan Wagner. From requirements to architecture: An ai-based journey to semi-automatically generate software archi- tectures. InProceedings of the 1st International Workshop on Designing Software, Designing ’24, page 52–55, New York, NY, USA, 2024. Association for Com- puting Machinery. ISBN 9798400705632. doi: 10.114...

work page doi:10.1145/3643660.3643942 2024
[30]

State of practice: Llms in software engineering and software architecture

Jasmin Jahić and Ashkan Sami. State of practice: Llms in software engineering and software architecture. In2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C), pages 311–318. IEEE, 2024

work page 2024
[31]

In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)

Dominik Fuchß, Haoyu Liu, Tobias Hey, Jan Keim, and Anne Koziolek. Enabling architecture traceability by llm-based architecture component name extraction. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), pages 1–12, 2025. doi: 10.1109/ICSA65012.2025.00011

work page doi:10.1109/icsa65012.2025.00011 2025
[32]

In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)

Shrikara Arun, Meghana Tedla, and Karthik Vaidhyanathan. Llms for generation of architectural components: An exploratory empirical study in the serverless world. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), page 25–36. IEEE, March 2025. doi: 10.1109/icsa65012.2025.00013

work page doi:10.1109/icsa65012.2025.00013 2025