Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings
Pith reviewed 2026-05-10 00:09 UTC · model grok-4.3
The pith
Embedding machine-readable requirements and architectural artifacts stabilizes AI agent behavior and reduces implementation drift in software development.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Shift-Up reinterprets executable requirements, architectural models, and architecture decision records as machine-readable structural guardrails that stabilize generative AI agent behavior, reduce implementation drift from intended designs, and redirect human effort toward design and validation in AI-native software development.
What carries the argument
The Shift-Up framework, which converts traditional software engineering artifacts into machine-readable guardrails to guide and constrain generative AI agents during implementation.
Load-bearing premise
That results from an exploratory evaluation on a single web application using preliminary comparisons are sufficient to indicate effectiveness across general AI-native development.
What would settle it
A larger controlled study across multiple projects and agent types that finds no measurable reduction in implementation drift or increase in behavioral stability when using the embedded artifacts compared to unstructured prompting would falsify the central claim.
Figures
read the original abstract
Generative AI (GenAI) is reshaping software engineering by shifting development from manual coding toward agent-driven implementation. While vibe coding promises rapid prototyping, it often suffers from architectural drift, limited traceability, and reduced maintainability. Applying the design science research (DSR) methodology, this paper proposes Shift-Up, a framework that reinterprets established software engineering practices, like executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs), as structural guardrails for GenAI-native development. Preliminary findings from our exploratory evaluation compare unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application. These findings indicate that embedding machine-readable requirements and architectural artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation activities. The results suggest that traditional software engineering artifacts can serve as effective control mechanisms in AI-assisted development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies design science research (DSR) to propose the Shift-Up framework, which reinterprets executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs) as machine-readable guardrails for GenAI-native software development. It reports preliminary findings from an exploratory evaluation that compares unstructured 'vibe coding', structured prompt engineering, and the Shift-Up approach while building one web application, claiming that embedding these artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation.
Significance. If the stabilization and drift-reduction effects hold under more rigorous conditions, the work could provide a practical bridge between established software engineering artifacts and agent-driven development, improving traceability and maintainability in AI-assisted projects. The explicit reuse of BDD, C4, and ADRs as control mechanisms is a concrete contribution that merits further investigation.
major comments (2)
- [Evaluation] Evaluation section (preliminary findings): The comparison of the three approaches on a single web application reports no quantitative metrics for implementation drift (e.g., deviation counts from ADRs or C4 models), agent adherence rates, or statistical controls for variables such as prompt skill or temperature; without these, the observed differences cannot be confidently attributed to the guardrails rather than case-specific factors.
- [Framework Description] §3 (framework description): The claim that Shift-Up 'stabilizes agent behavior' is presented as a direct outcome of embedding the artifacts, yet the manuscript provides no operational definition or measurement protocol for 'stability' or 'drift,' leaving the central causal assertion unsupported by the reported data.
minor comments (2)
- [Introduction] The term 'vibe coding' is introduced without a formal definition or reference to prior usage, which reduces precision for readers outside the immediate community.
- [Methodology] The DSR methodology is invoked but the paper does not explicitly map the six DSR activities (problem identification, definition of objectives, design and development, demonstration, evaluation, communication) to the reported steps, making the research process harder to replicate.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the evaluation is exploratory and that the central claims require clearer operational definitions and metrics. We respond to each major comment below and indicate the revisions planned.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (preliminary findings): The comparison of the three approaches on a single web application reports no quantitative metrics for implementation drift (e.g., deviation counts from ADRs or C4 models), agent adherence rates, or statistical controls for variables such as prompt skill or temperature; without these, the observed differences cannot be confidently attributed to the guardrails rather than case-specific factors.
Authors: We acknowledge that the current evaluation is limited to a single web application and relies on qualitative observations rather than quantitative metrics or statistical controls. This was intentional for an initial exploratory study under the DSR methodology. We will revise the evaluation section to define and report quantitative metrics, including deviation counts from ADRs and C4 models as well as agent adherence rates. We will also add an explicit limitations subsection noting the absence of statistical controls and the single-case nature of the study. These changes will allow readers to better assess the strength of the observed differences. revision: partial
-
Referee: [Framework Description] §3 (framework description): The claim that Shift-Up 'stabilizes agent behavior' is presented as a direct outcome of embedding the artifacts, yet the manuscript provides no operational definition or measurement protocol for 'stability' or 'drift,' leaving the central causal assertion unsupported by the reported data.
Authors: We agree that operational definitions are needed to support the claims. In the revised §3 we will define 'stability' as the consistency of generated code and artifacts with the provided BDD scenarios, C4 models, and ADRs across development iterations. 'Drift' will be defined as the accumulation of inconsistencies between the implemented system and the guardrail artifacts. We will also describe a measurement protocol based on systematic code review against the artifacts and logging of deviations. These additions will ground the causal assertions in explicit criteria. revision: yes
Circularity Check
No circularity: framework reinterprets external practices via standard DSR with independent exploratory evaluation
full rationale
The paper applies the established Design Science Research (DSR) methodology to reinterpret well-known external artifacts (executable requirements via BDD, C4 modeling, and ADRs) as guardrails for GenAI development. The central claim—that embedding these artifacts stabilizes agent behavior and reduces drift—is presented as an observation from a preliminary single-application comparison rather than a mathematical derivation, fitted parameter, or self-referential definition. No equations exist, no predictions reduce to inputs by construction, and no load-bearing self-citations or uniqueness theorems from the authors' prior work are invoked to justify the framework. The evaluation summary stands as an independent (if limited) empirical step separate from the framework's definitional content.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Design science research methodology is suitable for creating and evaluating the Shift-Up framework
- domain assumption Traditional software engineering artifacts like BDD, C4, and ADRs can function as effective machine-readable guardrails for GenAI agents
invented entities (1)
-
Shift-Up framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
InInternational conference on bridging the gap between AI and reality(2023), Springer, pp
Belzner, L., Gabor, T., and Wirsing, M.Large language model assisted software engineering: prospects, challenges, and a case study. InInternational conference on bridging the gap between AI and reality(2023), Springer, pp. 355–374
work page 2023
-
[2]
https://leanpub.com/visualising-software- architecture
Brown, S.Visualising software architecture with the c4 model: Context, con- tainers, components, and code, 2018. https://leanpub.com/visualising-software- architecture
work page 2018
-
[3]
Fowler, K. R., and Silver, C. L.Developing and Managing Embedded Systems and Products - Methods, Techniques, Tools, Processes, and Teamwork. Elsevier, 2015
work page 2015
-
[4]
Fuchs, N. E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language. InLogic-Based Program Synthesis and Transformation(Manchester, UK, June 1999), P. Flener, Ed., no. 1559 in Lecture Notes in Computer Science, Eighth International Workshop LOPSTR’98, Springer
work page 1999
- [5]
-
[6]
Journal of Information Technology 36(08 2020), 026839622094571
Hevner, A.The duality of science: Knowledge in information systems research. Journal of Information Technology 36(08 2020), 026839622094571
work page 2020
-
[7]
S., and Dinakar, C.Shifting testing beyond the deployment boundary
Kaulgud, V., Saxena, A., Podder, S., Sharma, V. S., and Dinakar, C.Shifting testing beyond the deployment boundary. InProceedings of the International Workshop on Continuous Software Evolution and Delivery(New York, NY, USA, 2016), CSED ’16, Association for Computing Machinery, p. 30–33
work page 2016
-
[8]
InProceedings of Inter- netware’25(2025)
Mikkonen, T., and Taivalsaari, A.Software reuse in the generative ai era: From cargo cult towards ai native software engineering. InProceedings of Inter- netware’25(2025)
work page 2025
-
[9]
Nguyen-Duc, A., Abrahamsson, P., and Khomh, F.Generative AI for effective software development. Springer, 2024
work page 2024
-
[10]
Technical AD1180035, Carnegie Mellon University, Sept
Nooper, D.Secure Software Development Life Cycle Processes. Technical AD1180035, Carnegie Mellon University, Sept. 2022
work page 2022
-
[11]
https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011
Nygard, M.Documenting architecture decisions. https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-09-17
work page 2011
-
[12]
Ozkaya, I.A paradigm shift in automating software engineering tasks: Bots. IEEE Software 39, 5 (Sep. 2022), 4–8
work page 2022
-
[13]
Pattyn, F., and Goetz, P.The Vibe Coding Trap: How AI Accelerates Delivery - and Quietly Breaks Responsibility. The Vibe Coding Series. AI Ventures Press, Jan. 2026
work page 2026
-
[14]
Russo, D.Navigating the complexity of generative ai adoption in software engineering.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–50
work page 2024
-
[15]
Journal of Systems and Software 216(2024), 112115
Russo, D., Baltes, S., van Berkel, N., Avgeriou, P., Calefato, F., Cabrero- Daniel, B., Catolino, G., Cito, J., Ernst, N., Fritz, T., et al.Generative ai in software engineering must be human-centered: The copenhagen manifesto. Journal of Systems and Software 216(2024), 112115
work page 2024
-
[16]
[17]Smith, L.Shift-left testing.Dr
Schoormann, T., Möller, F., Chandra Kruse, L., and Otto, B.Baustein—a design tool for configuring and representing design research.Information Systems Journal 34, 6 (2024), 1871–1901. [17]Smith, L.Shift-left testing.Dr. Dobb’s J. 26, 9 (Sept. 2001), 56–ff
work page 2024
-
[17]
In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp
Solis, C., and W ang, X.A study of the characteristics of behaviour driven development. In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp. 383–387
work page 2011
-
[18]
Stray, V., Hanssen, G. K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025
work page 2025
-
[19]
Thomas, D. R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246
work page 2006
-
[20]
Tuunanen, T., Winter, R., and Brocke, J. v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458
work page 2024
-
[21]
In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp
Zimmermann, O., Wegmann, L., Koziolek, H., and Goldschmidt, T.Archi- tectural decision guidance across projects - problem space modeling, decision backlog management and cloud computing knowledge. In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp. 85–94
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.