pith. machine review for the scientific record. sign in

arxiv: 2604.03826 · v2 · submitted 2026-04-04 · 💻 cs.SE

Recognition: no theorem link

Context Matters: Evaluating Context Strategies for Automated ADR Generation Using LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:02 UTC · model grok-4.3

classification 💻 cs.SE
keywords Architecture Decision RecordsLarge Language ModelsContext SelectionAutomated DocumentationSoftware ArchitecturePrompt EngineeringRetrieval-Augmented Generation
0
0 comments X

The pith

A small window of recent prior decisions improves LLM-generated Architecture Decision Records more than full history or larger models alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether different ways of supplying historical ADRs as context to LLMs can reduce the effort of writing these design-rationale documents. It finds that context-aware strategies, especially feeding just the last three to five records, raise generation fidelity over baselines that use nothing or the entire past. Retrieval methods add only small gains and mainly in non-linear cases. The work concludes that careful choice of what context to include matters more for quality than simply using a bigger model. This matters for teams that want to keep design decisions documented without extra manual work.

Core claim

The paper establishes that context-aware prompting substantially improves ADR generation fidelity over no-context and full-history baselines, with a small recency window of typically three to five prior records delivering the best quality-efficiency balance across model families; retrieval-augmented selection yields only marginal gains and no statistically significant advantage in typical linear workflows, leading to the claim that context engineering rather than model scale is the dominant factor in effective ADR automation.

What carries the argument

The five context selection strategies (no context, All-history, First-K, Last-K, and RAFG) evaluated against a validated corpus of sequential ADRs drawn from 750 open-source repositories.

If this is right

  • Tool builders should default to a recency window of three to five prior records for typical linear ADR sequences.
  • Retrieval-based fallbacks should be reserved for non-sequential or cross-cutting decision scenarios rather than used as the primary strategy.
  • Increasing model scale alone will not close the quality gap created by poor context selection.
  • Automation of ADR writing becomes practical when context is limited to recent records, lowering both token cost and latency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams maintaining parallel decision threads across components may still need retrieval methods even if recency works for sequential flows.
  • The same recency principle could be tested on other sequential artifacts such as commit messages or architecture diagrams.
  • If the recency advantage persists, low-cost local models paired with recent context could replace larger cloud models for routine ADR generation.

Load-bearing premise

The curated set of sequential ADRs from 750 open-source repositories represents real-world ADR authoring patterns and the chosen automatic metrics accurately reflect generation fidelity and usefulness.

What would settle it

A replication study on a fresh corpus of closed-source or differently structured repositories where a larger model with no context or full history produces equal or higher fidelity scores than a smaller model given a 3-5 record recency window.

Figures

Figures reproduced from arXiv: 2604.03826 by Aviral Gupta, Daniel Feitosa, Karthik Vaidhyanathan, Rudra Dhar.

Figure 1
Figure 1. Figure 1: Example of an Architecture Decision Record (ADR) [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Study Design More formally, by utilizing the Goal-Question-Metric approach [16], the objective of this study can be described as follows: Analyze the effectiveness of context For the purpose of generating Architecture Decision Records With respect to different context selection strategies From the viewpoint of Software Architects In the context of using Generative AI for AKM. 3.2 Research Questions To achi… view at source ↗
Figure 3
Figure 3. Figure 3: ADR Frequency Distribution maintains a longitudinal history exceeding 25 records. This distribu￾tion highlights the challenge of capturing long-term architectural evolution in open-source software. 3.3.2 Large Language Models (LLMs). To ensure a compre￾hensive evaluation, we selected a heterogeneous suite of models representing a broad spectrum of computational scales, licensing, and architectural paradigm… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of model performance across the context strategies (K=3) [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Longitudinal analysis of generation fidelity across [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Architecture Decision Records (ADRs) play a critical role in preserving the rationale behind system design, yet their creation and maintenance are often neglected due to the associated authoring overhead. This paper investigates whether Large Language Models (LLMs) can mitigate this burden and, more importantly, how different strategies for presenting historical ADRs as context influence generation quality. We curate and validate a large corpus of sequential ADRs drawn from 750 open-source repositories and systematically evaluate five context selection strategies (no context, All-history, First-K, Last-K, and RAFG) across multiple model families. Our results show that context-aware prompting substantially improves ADR generation fidelity, with a small recency window (typically 3-5 prior records) providing the best balance between quality and efficiency. Retrieval-based context selection yields marginal gains primarily in non-sequential or cross-cutting decision scenarios, while offering no statistically significant advantage in typical linear ADR workflows. Overall, our findings demonstrate that context engineering, rather than model scale alone, is the dominant factor in effective ADR automation, and we outline practical defaults for tool builders along with targeted retrieval fallbacks for complex architectural settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper curates a corpus of sequential ADRs from 750 open-source repositories and evaluates five context-selection strategies (no context, All-history, First-K, Last-K, RAFG) across multiple LLM families. It reports that context-aware prompting improves generation fidelity, a small recency window (3-5 records) offers the best quality-efficiency trade-off, retrieval yields only marginal gains outside linear workflows, and context engineering dominates model scale.

Significance. If the empirical results hold under human validation, the work provides practical defaults for LLM-based ADR tooling in software engineering and identifies when retrieval fallbacks are warranted, directly addressing documentation overhead in large codebases.

major comments (2)
  1. [Evaluation Metrics] Evaluation section: the central claim that context engineering is the dominant factor rests on automatic metrics for 'generation fidelity,' yet no correlation with human judgments of rationale usefulness or decision-trade-off capture is reported; this leaves open whether surface-level metrics (e.g., n-gram overlap) actually support the dominance conclusion over model scale.
  2. [§3 and Results] §3 (Corpus Curation) and Results: the 750-repo sequential corpus is presented as representative, but the manuscript provides insufficient detail on how non-sequential or cross-cutting decisions were sampled or on the exact statistical tests and effect sizes supporting 'no statistically significant advantage' for retrieval in linear workflows.
minor comments (2)
  1. [Abstract and §2] Abstract and §2: expand the acronym RAFG on first use and briefly motivate why it was chosen over other retrieval variants.
  2. [Results tables] Results tables: report exact p-values, confidence intervals, and model-size controls when claiming context effects exceed scale effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Evaluation Metrics] Evaluation section: the central claim that context engineering is the dominant factor rests on automatic metrics for 'generation fidelity,' yet no correlation with human judgments of rationale usefulness or decision-trade-off capture is reported; this leaves open whether surface-level metrics (e.g., n-gram overlap) actually support the dominance conclusion over model scale.

    Authors: We acknowledge that our evaluation relies exclusively on automatic metrics (BLEU, ROUGE, and BERTScore) for measuring generation fidelity, following standard practices in LLM-based text generation research. While these metrics do not directly correlate with human judgments of rationale usefulness, the consistent performance patterns observed across multiple model families and the large-scale corpus provide empirical support for the dominance of context engineering over model scale. In the revised manuscript, we will expand the Evaluation and Limitations sections to explicitly discuss the known limitations of automatic metrics, reference relevant literature on their correlation with human assessments, and note that future work should include human validation studies. revision: partial

  2. Referee: [§3 and Results] §3 (Corpus Curation) and Results: the 750-repo sequential corpus is presented as representative, but the manuscript provides insufficient detail on how non-sequential or cross-cutting decisions were sampled or on the exact statistical tests and effect sizes supporting 'no statistically significant advantage' for retrieval in linear workflows.

    Authors: We will revise §3 to include additional details on the corpus curation methodology, specifically the heuristics and filtering steps used to identify sequential ADRs from the 750 repositories and the exclusion criteria applied to non-sequential or cross-cutting decisions to maintain focus on linear workflows. In the Results section, we will report the precise statistical tests employed (e.g., paired t-tests with Bonferroni correction), associated p-values, and effect sizes (Cohen's d) to substantiate the finding of no statistically significant advantage for retrieval-augmented methods in linear settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical evaluation on external corpus

full rationale

The paper reports results from curating a corpus of sequential ADRs from 750 open-source repositories and running controlled prompting experiments across five context strategies and multiple model families. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps. All claims rest on observed differences in automatic metrics between strategies, which are directly comparable to the input corpus and therefore externally falsifiable. This is the standard non-circular outcome for an empirical benchmarking study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the open-source ADR corpus and the validity of the automatic quality metrics used to score generated records; no free parameters, domain axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5507 in / 1057 out tokens · 35076 ms · 2026-05-13T17:02:48.713448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 4 internal anchors

  1. [1]

    Springer Berlin Heidelberg, 2009

    Philippe Kruchten.Documentation of Software Architecture from a Knowledge Management Perspective – Design Representation, page 39–57. Springer Berlin Heidelberg, 2009. ISBN 9783642023743. doi: 10.1007/978-3-642-02374-3\_3

  2. [2]

    Documenting architecture decisions

    Michael Nygard. Documenting architecture decisions. https://cognitect.com/ blog/2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-12- 03

  3. [3]

    Architectural decisions as reusable design assets.IEEE Software, 28(1):64–69, January 2011

    Olaf Zimmermann. Architectural decisions as reusable design assets.IEEE Software, 28(1):64–69, January 2011. ISSN 0740-7459. doi: 10.1109/ms.2011.3

  4. [4]

    10 years of software architecture knowledge management: Practice and future.Journal of Systems and Software, 116:191–205, June 2016

    Rafael Capilla, Anton Jansen, Antony Tang, Paris Avgeriou, and Muhammad Ali Babar. 10 years of software architecture knowledge management: Practice and future.Journal of Systems and Software, 116:191–205, June 2016. ISSN 0164-1212. doi: 10.1016/j.jss.2015.08.054

  5. [5]

    Using architecture decision records in open source projects—an msr study on github.IEEE Access, 11:63725–63740, 2023

    Georg Buchgeher, Stefan Schöberl, Verena Geist, Bernhard Dorninger, Philipp Haindl, and Rainer Weinreich. Using architecture decision records in open source projects—an msr study on github.IEEE Access, 11:63725–63740, 2023. ISSN 2169-3536. doi: 10.1109/access.2023.3287654

  6. [6]

    Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling

    Muhammet Kürşat Görmez, Murat Yılmaz, and Paul M. Clarke.Large Lan- guage Models for Software Engineering: A Systematic Mapping Study, page 64–79. Springer Nature Switzerland, 2024. ISBN 9783031711398. doi: 10.1007/978-3-031- 71139-8_5

  7. [7]

    Large language models for software engineering: A systematic literature review, 2024

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review, 2024. URL https://arxiv.org/abs/2308. 10620

  8. [8]

    Can llms generate architectural design decisions? - an exploratory empirical study

    Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions? - an exploratory empirical study. In2024 IEEE 21st International Conference on Software Architecture (ICSA), page 79–89. IEEE, June 2024. doi: 10.1109/icsa59870.2024.00016

  9. [9]

    Can llms generate architectural design decisions? -an exploratory empirical study, 2024

    Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Can llms generate architectural design decisions? -an exploratory empirical study, 2024. URL https: //arxiv.org/abs/2403.01709

  10. [10]

    Understanding the design decisions of retrieval- augmented generation systems, 2025

    Shengming Zhao, Yuchen Shao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, and Lei Ma. Understanding the design decisions of retrieval- augmented generation systems, 2025. URL https://arxiv.org/abs/2411.19463

  11. [11]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu. A survey of context engineering for large language models, 2025. URL https://arxiv.org/abs/2507.13334

  12. [12]

    Jansen and J

    A. Jansen and J. Bosch. Software architecture as a set of architectural design de- cisions. In5th Working IEEE/IFIP Conference on Software Architecture (WICSA’05), pages 109–120, 2005. doi: 10.1109/WICSA.2005.61

  13. [13]

    Root cause analysis for microservice system based on causal inference: How far are we?

    Xinran Yu, Chun Li, Minxue Pan, and Xuandong Li. Droidcoder: Enhanced an- droid code completion with context-enriched retrieval-augmented generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, page 681–693. ACM, October 2024. doi: 10.1145/3691620.3695063

  14. [14]

    In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)

    Dominik Fuchß, Tobias Hey, Jan Keim, Haoyu Liu, Niklas Ewald, Tobias Thirolf, and Anne Koziolek. Lissa: Toward generic traceability link recovery through retrieval-augmented generation. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), page 1396–1408. IEEE, April 2025. doi: 10.1109/ icse55347.2025.00186

  15. [15]

    Atlas: Few-shot learning with retrieval augmented language models,

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models, 2022. URL https://arxiv.org/abs/2208.03299

  16. [16]

    Basili, Gianluigi Caldiera, and H

    Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach. The goal question metric approach. InEncyclopedia of Software Engineering, pages 528–532. Wiley, 1994

  17. [17]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Push- ing the frontier with advanced reasoning and long context.arXiv preprint arXiv:2507.06261, 2025

  18. [18]

    Glm-4.6: Advanced agentic and reasoning model

    Zhipu AI. Glm-4.6: Advanced agentic and reasoning model. Technical report, Z.ai Developers, 2025. URL https://docs.z.ai/glm-4-6

  19. [19]

    Gemma 3 Technical Report

    Aishwarya Kamath, Johan Ferret, Shreya Pathak, et al. Gemma 3: Open models for advanced reasoning and multimodal tasks.arXiv preprint arXiv:2503.19786, 2025

  20. [20]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://arxiv.org/abs/ 1904.09675

  21. [21]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei Jing Zhu. Bleu: a method for automatic evaluation of machine translation. 10 2002. doi: 10.3115/1073083. 1073135

  22. [22]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop: Text Summarization Braches Out 2004, page 10, 01 2004

  23. [23]

    Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments

    Alon Lavie and Abhaya Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. pages 228–231, 07 2007

  24. [24]

    Generative ai for software architecture

    Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, and Davide Taibi. Generative ai for software architecture. applications, challenges, and future directions, 2025. URL https://arxiv.org/abs/2503.13310

  25. [25]

    Leveraging generative ai for architecture knowledge management

    Rudra Dhar, Karthik Vaidhyanathan, and Vasudeva Varma. Leveraging generative ai for architecture knowledge management. In2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C), pages 163–166. IEEE, 2024

  26. [26]

    Helping novice architects to make quality design decisions using an llm-based assistant

    J Andrés Díaz-Pace, Antonela Tommasel, and Rafael Capilla. Helping novice architects to make quality design decisions using an llm-based assistant. In European Conference on Software Architecture, pages 324–332. Springer, 2024

  27. [27]

    Do large language models contain software architectural knowledge?: An exploratory case study with gpt

    Mohamed Soliman and Jan Keim. Do large language models contain software architectural knowledge?: An exploratory case study with gpt. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), page 13–24. IEEE, March

  28. [28]

    doi: 10.1109/icsa65012.2025.00012

  29. [29]

    From requirements to architecture: An ai-based journey to semi-automatically generate software archi- tectures

    Tobias Eisenreich, Sandro Speth, and Stefan Wagner. From requirements to architecture: An ai-based journey to semi-automatically generate software archi- tectures. InProceedings of the 1st International Workshop on Designing Software, Designing ’24, page 52–55, New York, NY, USA, 2024. Association for Com- puting Machinery. ISBN 9798400705632. doi: 10.114...

  30. [30]

    State of practice: Llms in software engineering and software architecture

    Jasmin Jahić and Ashkan Sami. State of practice: Llms in software engineering and software architecture. In2024 IEEE 21st International Conference on Software Architecture Companion (ICSA-C), pages 311–318. IEEE, 2024

  31. [31]

    In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)

    Dominik Fuchß, Haoyu Liu, Tobias Hey, Jan Keim, and Anne Koziolek. Enabling architecture traceability by llm-based architecture component name extraction. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), pages 1–12, 2025. doi: 10.1109/ICSA65012.2025.00011

  32. [32]

    In: 2025 IEEE 22nd International Conference on Software Architecture (ICSA)

    Shrikara Arun, Meghana Tedla, and Karthik Vaidhyanathan. Llms for generation of architectural components: An exploratory empirical study in the serverless world. In2025 IEEE 22nd International Conference on Software Architecture (ICSA), page 25–36. IEEE, March 2025. doi: 10.1109/icsa65012.2025.00013