pith. sign in

arxiv: 2604.20436 · v1 · submitted 2026-04-22 · 💻 cs.SE · cs.AI

Shift-Up: A Framework for Software Engineering Guardrails in AI-native Software Development -- Initial Findings

Pith reviewed 2026-05-10 00:09 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords AI-native software developmentgenerative AI guardrailsimplementation driftexecutable requirementsarchitectural modelingsoftware engineering artifactsagent behavior stabilization
0
0 comments X

The pith

Embedding machine-readable requirements and architectural artifacts stabilizes AI agent behavior and reduces implementation drift in software development.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Shift-Up, a framework that adapts established software engineering practices such as executable requirements, architectural modeling, and decision records for use as guardrails with generative AI agents. Through a preliminary comparison of unstructured coding, prompt engineering, and the guarded approach on a web application project, the work shows that providing agents with these structured artifacts leads to more consistent implementation. This matters because current AI-driven development often produces code that drifts from intended architecture and becomes difficult to maintain, potentially allowing developers to focus on higher-level decisions instead of repeated fixes.

Core claim

Shift-Up reinterprets executable requirements, architectural models, and architecture decision records as machine-readable structural guardrails that stabilize generative AI agent behavior, reduce implementation drift from intended designs, and redirect human effort toward design and validation in AI-native software development.

What carries the argument

The Shift-Up framework, which converts traditional software engineering artifacts into machine-readable guardrails to guide and constrain generative AI agents during implementation.

Load-bearing premise

That results from an exploratory evaluation on a single web application using preliminary comparisons are sufficient to indicate effectiveness across general AI-native development.

What would settle it

A larger controlled study across multiple projects and agent types that finds no measurable reduction in implementation drift or increase in behavioral stability when using the embedded artifacts compared to unstructured prompting would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.20436 by Fran\c{c}ois Christophe, Konsta Kalliokoski, Liisa Rannikko, Petrus Lipsanen, Tommi Mikkonen, Vlad Stirbu.

Figure 1
Figure 1. Figure 1: AI-native software development with Shift-Up [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Shift-Up workflow: (a) requirements and architectural grounding, and (b) GenAI-assisted implementation. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Generative AI (GenAI) is reshaping software engineering by shifting development from manual coding toward agent-driven implementation. While vibe coding promises rapid prototyping, it often suffers from architectural drift, limited traceability, and reduced maintainability. Applying the design science research (DSR) methodology, this paper proposes Shift-Up, a framework that reinterprets established software engineering practices, like executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs), as structural guardrails for GenAI-native development. Preliminary findings from our exploratory evaluation compare unstructured vibe coding, structured prompt engineering, and the Shift-Up approach in the development of a web application. These findings indicate that embedding machine-readable requirements and architectural artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation activities. The results suggest that traditional software engineering artifacts can serve as effective control mechanisms in AI-assisted development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper applies design science research (DSR) to propose the Shift-Up framework, which reinterprets executable requirements (BDD), architectural modeling (C4), and architecture decision records (ADRs) as machine-readable guardrails for GenAI-native software development. It reports preliminary findings from an exploratory evaluation that compares unstructured 'vibe coding', structured prompt engineering, and the Shift-Up approach while building one web application, claiming that embedding these artifacts stabilizes agent behavior, reduces implementation drift, and shifts human effort toward higher-level design and validation.

Significance. If the stabilization and drift-reduction effects hold under more rigorous conditions, the work could provide a practical bridge between established software engineering artifacts and agent-driven development, improving traceability and maintainability in AI-assisted projects. The explicit reuse of BDD, C4, and ADRs as control mechanisms is a concrete contribution that merits further investigation.

major comments (2)
  1. [Evaluation] Evaluation section (preliminary findings): The comparison of the three approaches on a single web application reports no quantitative metrics for implementation drift (e.g., deviation counts from ADRs or C4 models), agent adherence rates, or statistical controls for variables such as prompt skill or temperature; without these, the observed differences cannot be confidently attributed to the guardrails rather than case-specific factors.
  2. [Framework Description] §3 (framework description): The claim that Shift-Up 'stabilizes agent behavior' is presented as a direct outcome of embedding the artifacts, yet the manuscript provides no operational definition or measurement protocol for 'stability' or 'drift,' leaving the central causal assertion unsupported by the reported data.
minor comments (2)
  1. [Introduction] The term 'vibe coding' is introduced without a formal definition or reference to prior usage, which reduces precision for readers outside the immediate community.
  2. [Methodology] The DSR methodology is invoked but the paper does not explicitly map the six DSR activities (problem identification, definition of objectives, design and development, demonstration, evaluation, communication) to the reported steps, making the research process harder to replicate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the evaluation is exploratory and that the central claims require clearer operational definitions and metrics. We respond to each major comment below and indicate the revisions planned.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section (preliminary findings): The comparison of the three approaches on a single web application reports no quantitative metrics for implementation drift (e.g., deviation counts from ADRs or C4 models), agent adherence rates, or statistical controls for variables such as prompt skill or temperature; without these, the observed differences cannot be confidently attributed to the guardrails rather than case-specific factors.

    Authors: We acknowledge that the current evaluation is limited to a single web application and relies on qualitative observations rather than quantitative metrics or statistical controls. This was intentional for an initial exploratory study under the DSR methodology. We will revise the evaluation section to define and report quantitative metrics, including deviation counts from ADRs and C4 models as well as agent adherence rates. We will also add an explicit limitations subsection noting the absence of statistical controls and the single-case nature of the study. These changes will allow readers to better assess the strength of the observed differences. revision: partial

  2. Referee: [Framework Description] §3 (framework description): The claim that Shift-Up 'stabilizes agent behavior' is presented as a direct outcome of embedding the artifacts, yet the manuscript provides no operational definition or measurement protocol for 'stability' or 'drift,' leaving the central causal assertion unsupported by the reported data.

    Authors: We agree that operational definitions are needed to support the claims. In the revised §3 we will define 'stability' as the consistency of generated code and artifacts with the provided BDD scenarios, C4 models, and ADRs across development iterations. 'Drift' will be defined as the accumulation of inconsistencies between the implemented system and the guardrail artifacts. We will also describe a measurement protocol based on systematic code review against the artifacts and logging of deviations. These additions will ground the causal assertions in explicit criteria. revision: yes

Circularity Check

0 steps flagged

No circularity: framework reinterprets external practices via standard DSR with independent exploratory evaluation

full rationale

The paper applies the established Design Science Research (DSR) methodology to reinterpret well-known external artifacts (executable requirements via BDD, C4 modeling, and ADRs) as guardrails for GenAI development. The central claim—that embedding these artifacts stabilizes agent behavior and reduces drift—is presented as an observation from a preliminary single-application comparison rather than a mathematical derivation, fitted parameter, or self-referential definition. No equations exist, no predictions reduce to inputs by construction, and no load-bearing self-citations or uniqueness theorems from the authors' prior work are invoked to justify the framework. The evaluation summary stands as an independent (if limited) empirical step separate from the framework's definitional content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The proposal relies on the assumption that design science research is an appropriate method for framework development and that traditional SE artifacts can be directly repurposed without loss of effectiveness in AI contexts.

axioms (2)
  • domain assumption Design science research methodology is suitable for creating and evaluating the Shift-Up framework
    Paper states it applies DSR to propose and preliminarily evaluate the framework.
  • domain assumption Traditional software engineering artifacts like BDD, C4, and ADRs can function as effective machine-readable guardrails for GenAI agents
    Core reinterpretation presented without additional justification in the abstract.
invented entities (1)
  • Shift-Up framework no independent evidence
    purpose: To provide structural guardrails for GenAI-native software development
    Newly proposed construct combining existing practices

pith-pipeline@v0.9.0 · 5481 in / 1406 out tokens · 44907 ms · 2026-05-10T00:09:09.645376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

  1. [1]

    InInternational conference on bridging the gap between AI and reality(2023), Springer, pp

    Belzner, L., Gabor, T., and Wirsing, M.Large language model assisted software engineering: prospects, challenges, and a case study. InInternational conference on bridging the gap between AI and reality(2023), Springer, pp. 355–374

  2. [2]

    https://leanpub.com/visualising-software- architecture

    Brown, S.Visualising software architecture with the c4 model: Context, con- tainers, components, and code, 2018. https://leanpub.com/visualising-software- architecture

  3. [3]

    R., and Silver, C

    Fowler, K. R., and Silver, C. L.Developing and Managing Embedded Systems and Products - Methods, Techniques, Tools, Processes, and Teamwork. Elsevier, 2015

  4. [4]

    E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language

    Fuchs, N. E., Schwertel, U., and Schwitter, R.Attempto controlled english - not just another logic specification language. InLogic-Based Program Synthesis and Transformation(Manchester, UK, June 1999), P. Flener, Ed., no. 1559 in Lecture Notes in Computer Science, Eighth International Workshop LOPSTR’98, Springer

  5. [5]

    Ge, Y., Mei, L., Duan, Z., Li, T., Zheng, Y., W ang, Y., W ang, L., Y ao, J., Liu, T., Cai, Y., Bi, B., Guo, F., Guo, J., Liu, S., and Cheng, X.A Survey of Vibe Coding with Large Language Models, Dec. 2025. arXiv:2510.12399 [cs] version: 2

  6. [6]

    Journal of Information Technology 36(08 2020), 026839622094571

    Hevner, A.The duality of science: Knowledge in information systems research. Journal of Information Technology 36(08 2020), 026839622094571

  7. [7]

    S., and Dinakar, C.Shifting testing beyond the deployment boundary

    Kaulgud, V., Saxena, A., Podder, S., Sharma, V. S., and Dinakar, C.Shifting testing beyond the deployment boundary. InProceedings of the International Workshop on Continuous Software Evolution and Delivery(New York, NY, USA, 2016), CSED ’16, Association for Computing Machinery, p. 30–33

  8. [8]

    InProceedings of Inter- netware’25(2025)

    Mikkonen, T., and Taivalsaari, A.Software reuse in the generative ai era: From cargo cult towards ai native software engineering. InProceedings of Inter- netware’25(2025)

  9. [9]

    Springer, 2024

    Nguyen-Duc, A., Abrahamsson, P., and Khomh, F.Generative AI for effective software development. Springer, 2024

  10. [10]

    Technical AD1180035, Carnegie Mellon University, Sept

    Nooper, D.Secure Software Development Life Cycle Processes. Technical AD1180035, Carnegie Mellon University, Sept. 2022

  11. [11]

    https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011

    Nygard, M.Documenting architecture decisions. https://cognitect.com/blog/ 2011/11/15/documenting-architecture-decisions, 2011. Accessed: 2025-09-17

  12. [12]

    IEEE Software 39, 5 (Sep

    Ozkaya, I.A paradigm shift in automating software engineering tasks: Bots. IEEE Software 39, 5 (Sep. 2022), 4–8

  13. [13]

    The Vibe Coding Series

    Pattyn, F., and Goetz, P.The Vibe Coding Trap: How AI Accelerates Delivery - and Quietly Breaks Responsibility. The Vibe Coding Series. AI Ventures Press, Jan. 2026

  14. [14]

    Russo, D.Navigating the complexity of generative ai adoption in software engineering.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–50

  15. [15]

    Journal of Systems and Software 216(2024), 112115

    Russo, D., Baltes, S., van Berkel, N., Avgeriou, P., Calefato, F., Cabrero- Daniel, B., Catolino, G., Cito, J., Ernst, N., Fritz, T., et al.Generative ai in software engineering must be human-centered: The copenhagen manifesto. Journal of Systems and Software 216(2024), 112115

  16. [16]

    [17]Smith, L.Shift-left testing.Dr

    Schoormann, T., Möller, F., Chandra Kruse, L., and Otto, B.Baustein—a design tool for configuring and representing design research.Information Systems Journal 34, 6 (2024), 1871–1901. [17]Smith, L.Shift-left testing.Dr. Dobb’s J. 26, 9 (Sept. 2001), 56–ff

  17. [17]

    In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp

    Solis, C., and W ang, X.A study of the characteristics of behaviour driven development. In2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications(Aug 2011), pp. 383–387

  18. [18]

    K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025

    Stray, V., Hanssen, G. K., Barbala, A., Šmite, D., and Stol, K.-J.What is generative ai good for? introduction to the special issue on generative ai in software engineering, 2025

  19. [19]

    R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246

    Thomas, D. R.A general inductive approach for analyzing qualitative evaluation data.American Journal of Evaluation 27, 2 (2006), 237–246

  20. [20]

    v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458

    Tuunanen, T., Winter, R., and Brocke, J. v.Dealing with complexity in design science research: A methodology using design echelons1.Management Information Systems Quarterly 48, 2 (06 2024), 427–458

  21. [21]

    In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp

    Zimmermann, O., Wegmann, L., Koziolek, H., and Goldschmidt, T.Archi- tectural decision guidance across projects - problem space modeling, decision backlog management and cloud computing knowledge. In2015 12th Working IEEE/IFIP Conference on Software Architecture(May 2015), pp. 85–94