Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings

Fran\c{c}ois Christophe; Konsta Kalliokoski; Liisa Rannikko; Petrus Lipsanen; Tommi Mikkonen; Vlad Stirbu

arxiv: 2604.20436 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.AI

Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings

Petrus Lipsanen , Liisa Rannikko , Fran\c{c}ois Christophe , Konsta Kalliokoski , Vlad Stirbu , Tommi Mikkonen This is my paper

Pith reviewed 2026-05-10 00:09 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI-native software developmentgenerative AI guardrailsimplementation driftexecutable requirementsarchitectural modelingsoftware engineering artifactsagent behavior stabilization

0 comments

The pith

Embedding machine-readable requirements and architectural artifacts stabilizes AI agent behavior and reduces implementation drift in software development.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Shift-Up, a framework that adapts established software engineering practices such as executable requirements, architectural modeling, and decision records for use as guardrails with generative AI agents. Through a preliminary comparison of unstructured coding, prompt engineering, and the guarded approach on a web application project, the work shows that providing agents with these structured artifacts leads to more consistent implementation. This matters because current AI-driven development often produces code that drifts from intended architecture and becomes difficult to maintain, potentially allowing developers to focus on higher-level decisions instead of repeated fixes.

Core claim

Shift-Up reinterprets executable requirements, architectural models, and architecture decision records as machine-readable structural guardrails that stabilize generative AI agent behavior, reduce implementation drift from intended designs, and redirect human effort toward design and validation in AI-native software development.

What carries the argument

The Shift-Up framework, which converts traditional software engineering artifacts into machine-readable guardrails to guide and constrain generative AI agents during implementation.

Load-bearing premise

That results from an exploratory evaluation on a single web application using preliminary comparisons are sufficient to indicate effectiveness across general AI-native development.

What would settle it

A larger controlled study across multiple projects and agent types that finds no measurable reduction in implementation drift or increase in behavioral stability when using the embedded artifacts compared to unstructured prompting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20436 by Fran\c{c}ois Christophe, Konsta Kalliokoski, Liisa Rannikko, Petrus Lipsanen, Tommi Mikkonen, Vlad Stirbu.

**Figure 2.** Figure 2: Shift-Up workflow: (a) requirements and architectural grounding, and (b) GenAI-assisted implementation. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Generative AI (GenAI) is reshaping software engineering by shifting development from manual coding toward agent-driven implementation. While vibe coding promises rapid prototyping, it often suffers from architectural drift, limited traceability, and reduced maintainability. Applying the design science research (DSR) methodology, this paper proposes Shift-Up, a framework that reinterprets established software engineering practices, like executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs), as structural guardrails for GenAI-native development. Preliminary findings from our exploratory evaluation compare unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application. These findings indicate that embedding machine-readable requirements and architectural artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation activities. The results suggest that traditional software engineering artifacts can serve as effective control mechanisms in AI-assisted development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shift-Up repurposes BDD, C4, and ADRs as guardrails for AI agents, but the single-app exploratory comparison lacks the metrics and controls needed to back the stabilization claims.

read the letter

The key takeaway is that Shift-Up gives a structured way to use BDD, C4 models, and ADRs to guide AI agents and cut down on implementation drift, but the evidence from their single-app test is still too light to confirm it works reliably. The new part is the specific framing of these old practices as guardrails in an AI-native setting. They apply design science research to combine them into one framework and test it against plain vibe coding and structured prompts. That combination feels fresh even if the pieces are familiar, and it directly tackles the maintainability issues that come up when agents generate code without enough constraints. What they do well is keep the proposal grounded in real software engineering artifacts that teams already produce. The preliminary results suggest that having machine-readable requirements and decisions helps the agents stay on track and lets humans focus more on design. The soft spots are in the evaluation. They compared the three approaches on one web application and saw better outcomes with Shift-Up, but there's no detail on how they measured drift, no repeated trials, and no controls for things like prompt quality or agent settings. Without those, it's tough to say the framework is what made the difference rather than just better upfront work. This is for people building or studying AI tools for software development who care about long-term code quality. A reader who wants practical ideas for controlling GenAI output would find it useful, even with the limitations. It deserves a serious referee because the idea is actionable and the problem matters now. I'd recommend sending it to peer review, with notes to expand the metrics and perhaps add more cases or a clearer methodology section.

Referee Report

2 major / 2 minor

Summary. The paper applies design science research (DSR) to propose the Shift-Up framework, which reinterprets executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs) as machine-readable guardrails for GenAI-native software development. It reports preliminary findings from an exploratory evaluation that compares unstructured 'vibe coding', structured prompt engineering, and the Shift-Up approach while building one web application, claiming that embedding these artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation.

Significance. If the stabilization and drift-reduction effects hold under more rigorous conditions, the work could provide a practical bridge between established software engineering artifacts and agent-driven development, improving traceability and maintainability in AI-assisted projects. The explicit reuse of BDD, C4, and ADRs as control mechanisms is a concrete contribution that merits further investigation.

major comments (2)

[Evaluation] Evaluation section (preliminary findings): The comparison of the three approaches on a single web application reports no quantitative metrics for implementation drift (e.g., deviation counts from ADRs or C4 models), agent adherence rates, or statistical controls for variables such as prompt skill or temperature; without these, the observed differences cannot be confidently attributed to the guardrails rather than case-specific factors.
[Framework Description] §3 (framework description): The claim that Shift-Up 'stabilizes agent behavior' is presented as a direct outcome of embedding the artifacts, yet the manuscript provides no operational definition or measurement protocol for 'stability' or 'drift,' leaving the central causal assertion unsupported by the reported data.

minor comments (2)

[Introduction] The term 'vibe coding' is introduced without a formal definition or reference to prior usage, which reduces precision for readers outside the immediate community.
[Methodology] The DSR methodology is invoked but the paper does not explicitly map the six DSR activities (problem identification, definition of objectives, design and development, demonstration, evaluation, communication) to the reported steps, making the research process harder to replicate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the evaluation is exploratory and that the central claims require clearer operational definitions and metrics. We respond to each major comment below and indicate the revisions planned.

read point-by-point responses

Referee: [Evaluation] Evaluation section (preliminary findings): The comparison of the three approaches on a single web application reports no quantitative metrics for implementation drift (e.g., deviation counts from ADRs or C4 models), agent adherence rates, or statistical controls for variables such as prompt skill or temperature; without these, the observed differences cannot be confidently attributed to the guardrails rather than case-specific factors.

Authors: We acknowledge that the current evaluation is limited to a single web application and relies on qualitative observations rather than quantitative metrics or statistical controls. This was intentional for an initial exploratory study under the DSR methodology. We will revise the evaluation section to define and report quantitative metrics, including deviation counts from ADRs and C4 models as well as agent adherence rates. We will also add an explicit limitations subsection noting the absence of statistical controls and the single-case nature of the study. These changes will allow readers to better assess the strength of the observed differences. revision: partial
Referee: [Framework Description] §3 (framework description): The claim that Shift-Up 'stabilizes agent behavior' is presented as a direct outcome of embedding the artifacts, yet the manuscript provides no operational definition or measurement protocol for 'stability' or 'drift,' leaving the central causal assertion unsupported by the reported data.

Authors: We agree that operational definitions are needed to support the claims. In the revised §3 we will define 'stability' as the consistency of generated code and artifacts with the provided BDD scenarios, C4 models, and ADRs across development iterations. 'Drift' will be defined as the accumulation of inconsistencies between the implemented system and the guardrail artifacts. We will also describe a measurement protocol based on systematic code review against the artifacts and logging of deviations. These additions will ground the causal assertions in explicit criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: framework reinterprets external practices via standard DSR with independent exploratory evaluation

full rationale

The paper applies the established Design Science Research (DSR) methodology to reinterpret well-known external artifacts (executable requirements via BDD, C4 modeling, and ADRs) as guardrails for GenAI development. The central claim—that embedding these artifacts stabilizes agent behavior and reduces drift—is presented as an observation from a preliminary single-application comparison rather than a mathematical derivation, fitted parameter, or self-referential definition. No equations exist, no predictions reduce to inputs by construction, and no load-bearing self-citations or uniqueness theorems from the authors' prior work are invoked to justify the framework. The evaluation summary stands as an independent (if limited) empirical step separate from the framework's definitional content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The proposal relies on the assumption that design science research is an appropriate method for framework development and that traditional SE artifacts can be directly repurposed without loss of effectiveness in AI contexts.

axioms (2)

domain assumption Design science research methodology is suitable for creating and evaluating the Shift-Up framework
Paper states it applies DSR to propose and preliminarily evaluate the framework.
domain assumption Traditional software engineering artifacts like BDD, C4, and ADRs can function as effective machine-readable guardrails for GenAI agents
Core reinterpretation presented without additional justification in the abstract.

invented entities (1)

Shift-Up framework no independent evidence
purpose: To provide structural guardrails for GenAI-native software development
Newly proposed construct combining existing practices

pith-pipeline@v0.9.0 · 5481 in / 1406 out tokens · 44907 ms · 2026-05-10T00:09:09.645376+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

InInternational conference on bridging the gap between AI and reality(2023), Springer, pp

Belzner, L., Gabor, T., and Wirsing, M.Large language model assisted software engineering: prospects, challenges, and a case study. InInternational conference on bridging the gap between AI and reality(2023), Springer, pp. 355–374

work page 2023
[2]

https://leanpub.com/visualising-software- architecture

Brown, S.Visualising software architecture with the c4 model: Context, con- tainers, components, and code, 2018. https://leanpub.com/visualising-software- architecture

work page 2018
[3]

R., and Silver, C

Fowler, K. R., and Silver, C. L.Developing and Managing Embedded Systems and Products - Methods, Techniques, Tools, Processes, and Teamwork. Elsevier, 2015

work page 2015
[4]

E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language

Fuchs, N. E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language. InLogic-Based Program Synthesis and Transformation(Manchester, UK, June 1999), P. Flener, Ed., no. 1559 in Lecture Notes in Computer Science, Eighth International Workshop LOPSTR’98, Springer

work page 1999
[5]

Ge, Y., Mei, L., Duan, Z., Li, T., Zheng, Y., W ang, Y., W ang, L., Y ao, J., Liu, T., Cai, Y., Bi, B., Guo, F., Guo, J., Liu, S., and Cheng, X.A Survey of Vibe Coding with Large Language Models, Dec. 2025. arXiv:2510.12399 [cs] version: 2

work page arXiv 2025
[6]

Journal of Information Technology 36(08 2020), 026839622094571

Hevner, A.The duality of science: Knowledge in information systems research. Journal of Information Technology 36(08 2020), 026839622094571

work page 2020
[7]

S., and Dinakar, C.Shifting testing beyond the deployment boundary

Kaulgud, V., Saxena, A., Podder, S., Sharma, V. S., and Dinakar, C.Shifting testing beyond the deployment boundary. InProceedings of the International Workshop on Continuous Software Evolution and Delivery(New York, NY, USA, 2016), CSED ’16, Association for Computing Machinery, p. 30–33

work page 2016
[8]

InProceedings of Inter- netware’25(2025)

Mikkonen, T., and Taivalsaari, A.Software reuse in the generative ai era: From cargo cult towards ai native software engineering. InProceedings of Inter- netware’25(2025)

work page 2025
[9]

Springer, 2024

Nguyen-Duc, A., Abrahamsson, P., and Khomh, F.Generative AI for effective software development. Springer, 2024

work page 2024
[10]

Technical AD1180035, Carnegie Mellon University, Sept

Nooper, D.Secure Software Development Life Cycle Processes. Technical AD1180035, Carnegie Mellon University, Sept. 2022

work page 2022
[11]

https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011

Nygard, M.Documenting architecture decisions. https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-09-17

work page 2011
[12]

IEEE Software 39, 5 (Sep

Ozkaya, I.A paradigm shift in automating software engineering tasks: Bots. IEEE Software 39, 5 (Sep. 2022), 4–8

work page 2022
[13]

The Vibe Coding Series

Pattyn, F., and Goetz, P.The Vibe Coding Trap: How AI Accelerates Delivery - and Quietly Breaks Responsibility. The Vibe Coding Series. AI Ventures Press, Jan. 2026

work page 2026
[14]

Russo, D.Navigating the complexity of generative ai adoption in software engineering.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–50

work page 2024
[15]

Journal of Systems and Software 216(2024), 112115

Russo, D., Baltes, S., van Berkel, N., Avgeriou, P., Calefato, F., Cabrero- Daniel, B., Catolino, G., Cito, J., Ernst, N., Fritz, T., et al.Generative ai in software engineering must be human-centered: The copenhagen manifesto. Journal of Systems and Software 216(2024), 112115

work page 2024
[16]

[17]Smith, L.Shift-left testing.Dr

Schoormann, T., Möller, F., Chandra Kruse, L., and Otto, B.Baustein—a design tool for configuring and representing design research.Information Systems Journal 34, 6 (2024), 1871–1901. [17]Smith, L.Shift-left testing.Dr. Dobb’s J. 26, 9 (Sept. 2001), 56–ff

work page 2024
[17]

In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp

Solis, C., and W ang, X.A study of the characteristics of behaviour driven development. In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp. 383–387

work page 2011
[18]

K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025

Stray, V., Hanssen, G. K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025

work page 2025
[19]

R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246

Thomas, D. R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246

work page 2006
[20]

v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458

Tuunanen, T., Winter, R., and Brocke, J. v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458

work page 2024
[21]

In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp

Zimmermann, O., Wegmann, L., Koziolek, H., and Goldschmidt, T.Archi- tectural decision guidance across projects - problem space modeling, decision backlog management and cloud computing knowledge. In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp. 85–94

work page 2015

[1] [1]

InInternational conference on bridging the gap between AI and reality(2023), Springer, pp

Belzner, L., Gabor, T., and Wirsing, M.Large language model assisted software engineering: prospects, challenges, and a case study. InInternational conference on bridging the gap between AI and reality(2023), Springer, pp. 355–374

work page 2023

[2] [2]

https://leanpub.com/visualising-software- architecture

Brown, S.Visualising software architecture with the c4 model: Context, con- tainers, components, and code, 2018. https://leanpub.com/visualising-software- architecture

work page 2018

[3] [3]

R., and Silver, C

Fowler, K. R., and Silver, C. L.Developing and Managing Embedded Systems and Products - Methods, Techniques, Tools, Processes, and Teamwork. Elsevier, 2015

work page 2015

[4] [4]

E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language

Fuchs, N. E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language. InLogic-Based Program Synthesis and Transformation(Manchester, UK, June 1999), P. Flener, Ed., no. 1559 in Lecture Notes in Computer Science, Eighth International Workshop LOPSTR’98, Springer

work page 1999

[5] [5]

Ge, Y., Mei, L., Duan, Z., Li, T., Zheng, Y., W ang, Y., W ang, L., Y ao, J., Liu, T., Cai, Y., Bi, B., Guo, F., Guo, J., Liu, S., and Cheng, X.A Survey of Vibe Coding with Large Language Models, Dec. 2025. arXiv:2510.12399 [cs] version: 2

work page arXiv 2025

[6] [6]

Journal of Information Technology 36(08 2020), 026839622094571

Hevner, A.The duality of science: Knowledge in information systems research. Journal of Information Technology 36(08 2020), 026839622094571

work page 2020

[7] [7]

S., and Dinakar, C.Shifting testing beyond the deployment boundary

Kaulgud, V., Saxena, A., Podder, S., Sharma, V. S., and Dinakar, C.Shifting testing beyond the deployment boundary. InProceedings of the International Workshop on Continuous Software Evolution and Delivery(New York, NY, USA, 2016), CSED ’16, Association for Computing Machinery, p. 30–33

work page 2016

[8] [8]

InProceedings of Inter- netware’25(2025)

Mikkonen, T., and Taivalsaari, A.Software reuse in the generative ai era: From cargo cult towards ai native software engineering. InProceedings of Inter- netware’25(2025)

work page 2025

[9] [9]

Springer, 2024

Nguyen-Duc, A., Abrahamsson, P., and Khomh, F.Generative AI for effective software development. Springer, 2024

work page 2024

[10] [10]

Technical AD1180035, Carnegie Mellon University, Sept

Nooper, D.Secure Software Development Life Cycle Processes. Technical AD1180035, Carnegie Mellon University, Sept. 2022

work page 2022

[11] [11]

https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011

Nygard, M.Documenting architecture decisions. https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-09-17

work page 2011

[12] [12]

IEEE Software 39, 5 (Sep

Ozkaya, I.A paradigm shift in automating software engineering tasks: Bots. IEEE Software 39, 5 (Sep. 2022), 4–8

work page 2022

[13] [13]

The Vibe Coding Series

Pattyn, F., and Goetz, P.The Vibe Coding Trap: How AI Accelerates Delivery - and Quietly Breaks Responsibility. The Vibe Coding Series. AI Ventures Press, Jan. 2026

work page 2026

[14] [14]

Russo, D.Navigating the complexity of generative ai adoption in software engineering.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–50

work page 2024

[15] [15]

Journal of Systems and Software 216(2024), 112115

Russo, D., Baltes, S., van Berkel, N., Avgeriou, P., Calefato, F., Cabrero- Daniel, B., Catolino, G., Cito, J., Ernst, N., Fritz, T., et al.Generative ai in software engineering must be human-centered: The copenhagen manifesto. Journal of Systems and Software 216(2024), 112115

work page 2024

[16] [16]

[17]Smith, L.Shift-left testing.Dr

Schoormann, T., Möller, F., Chandra Kruse, L., and Otto, B.Baustein—a design tool for configuring and representing design research.Information Systems Journal 34, 6 (2024), 1871–1901. [17]Smith, L.Shift-left testing.Dr. Dobb’s J. 26, 9 (Sept. 2001), 56–ff

work page 2024

[17] [17]

In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp

Solis, C., and W ang, X.A study of the characteristics of behaviour driven development. In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp. 383–387

work page 2011

[18] [18]

K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025

Stray, V., Hanssen, G. K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025

work page 2025

[19] [19]

R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246

Thomas, D. R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246

work page 2006

[20] [20]

v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458

Tuunanen, T., Winter, R., and Brocke, J. v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458

work page 2024

[21] [21]

In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp

Zimmermann, O., Wegmann, L., Koziolek, H., and Goldschmidt, T.Archi- tectural decision guidance across projects - problem space modeling, decision backlog management and cloud computing knowledge. In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp. 85–94

work page 2015