All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application

Ali Hassaan Mughal (Independent Researcher); Muhammad Bilal (Technical University of Munich)

arxiv: 2606.22475 · v1 · pith:QZGYOF5Anew · submitted 2026-06-21 · 💻 cs.SE · cs.AI· cs.LG

All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application

Muhammad Bilal (Technical University of Munich) , Ali Hassaan Mughal (Independent Researcher) This is my paper

Pith reviewed 2026-06-26 09:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords software testingbug fix analysisintegration seamsLLM web applicationsmulti-market softwaretest coverage gapsproduction defectsend-to-end verification

0 comments

The pith

In an LLM-integrated rental app, 44 percent of bug fixes addressed defects that escaped through four seams invisible to component unit tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks a production web application combining large language model output, multi-market internationalization, and browser-driven front ends over external data. Despite an automated test suite that grew to 1,553 cases and passed continuously, user-facing defects still reached production. The authors examined every one of the 252 bug-fix commits and sorted each by the boundary, or seam, through which the defect had escaped detection. They found that 44 percent of the fixes landed in four seams that ordinary component-level unit tests cannot observe: the live browser runtime, the non-default market, the end-to-end user flow, and the whole-system level. One defect shipped twice because a fix lacked a guard at its seam.

Core claim

Continuous passage of 1,553 automated tests did not prevent defects from reaching users in a live, multi-market, LLM-powered rental-search assistant. Classification of all 252 bug-fix commits showed that 44 percent of the fixes targeted defects escaping through four seams—the live browser runtime, non-default market, end-to-end flow, and whole-system level—that component unit tests cannot reach. A single fix that omitted a guard at its seam allowed the same defect to ship twice.

What carries the argument

The four-seam framework, which sorts each bug-fix commit by the boundary through which its defect escaped automated detection: live browser runtime, non-default market, end-to-end flow, and whole-system level.

If this is right

Component unit tests alone leave roughly half the defect surface unmonitored in this class of application.
A fix placed without a guard at its seam allows the same defect to recur in production.
The seam carrying the largest share of fixes is the one a team should instrument first.
Practices that add targeted checks at each seam reduced escaped defects in the reported project.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar seam analysis could be applied to other LLM-augmented or externally dependent systems to locate their dominant escape routes.
Teams could build lightweight scripts that scan commit messages and test files to suggest which seam a new fix belongs to.
The measured 44 percent figure supplies a baseline for comparing test effectiveness across projects that share the same three hard-to-test ingredients.

Load-bearing premise

The manual classification of the 252 bug-fix commits into the four seam categories accurately captures the boundary through which each defect escaped the automated test suite.

What would settle it

Independent reclassification of the same 252 commits by reviewers unaware of the original labels yields a materially different distribution across the four seams.

read the original abstract

Modern web applications increasingly combine three ingredients that are hard to test: output from large language models, multi-market internationalization, and browser-driven front-ends over external data sources. We report on a production rental-search assistant whose automated suite grew to 1,553 test cases in six weeks. The suite passed continuously, yet user-facing defects continued to reach production. We studied all 252 bug-fix commits in the project and classified each by the boundary, or seam, it escaped through. About 44 percent of the fixes fall in four seams that component-level unit tests cannot observe: the live browser runtime, the non-default market, the end-to-end flow, and the whole-system level. A fix without a guard at the seam let one defect ship twice. We present the four-seam framework, the measured defect distribution, and the practices we adopted, including a simple way for a team to find the seam that carries the most fixes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

One-project case study counts 44% of fixes escaping through four seams in an LLM web app, but the manual commit classification has no reported validation.

read the letter

The main takeaway is that in this rental-search app the authors tracked 252 bug-fix commits and found 44% landed in four seams that component unit tests miss: live browser runtime, non-default market, end-to-end flow, and whole-system level. One defect shipped twice because no guard was placed at the seam.

The paper does a straightforward job of reporting numbers from a real production system whose automated suite reached 1,553 tests in six weeks yet still let defects reach users. The seam framework is presented as a practical way for teams to locate where their tests are blind, and the authors include the practices they adopted after the analysis.

The soft spot is the classification step itself. The 44% figure and the four categories rest on manual review of commit messages and diffs. The abstract gives no pre-defined rubric, no inter-rater reliability number, and no blinding procedure. Because the seam definitions appear derived from the same data, interpretive choices could shift the distribution. The single-project scope is a second limit, though the authors treat the work as lessons from one case rather than a broad claim.

This is for practitioners shipping LLM-integrated, multi-market browser applications and for researchers who collect empirical testing data. Readers who want measured defect distributions in hard-to-test production code will get something concrete.

The paper shows clear engagement with the testing problem in this setting. It deserves a serious referee because the empirical measurement is timely even if the method needs more documentation.

Recommendation: send it for peer review and ask specifically for the classification criteria and any reproducibility steps they took.

Referee Report

1 major / 0 minor

Summary. The paper presents a case study of a production rental-search web application integrating LLMs, multi-market internationalization, and browser front-ends. An automated test suite grew to 1,553 cases yet defects reached production; analysis of all 252 bug-fix commits classified ~44% as escaping through four seams (live browser runtime, non-default market, end-to-end flow, whole-system level) invisible to component-level unit tests. The authors introduce a four-seam framework, report the defect distribution, note one defect shipping twice without a seam guard, and describe practices including a method to identify the highest-impact seam.

Significance. If the classification is reproducible and unbiased, the work supplies concrete empirical data on defect escape paths in LLM-integrated, multi-market web systems and demonstrates that component testing alone is insufficient. The four-seam framework and the simple seam-identification practice are actionable for practitioners; the single-project commit analysis is a strength when the classification criteria are made transparent.

major comments (1)

[Classification / Methods] Classification section (methods describing the 252 commits): no pre-defined coding rubric, inter-rater reliability statistic, blinding procedure, or exclusion rules are supplied for mapping commits to the four seams. Because the seam definitions are derived from the same data and the 44% figure is a direct count from this assignment, the central claim that these seams are the primary escape routes cannot be assessed without evidence that the classification is stable and not post-hoc.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on methodological transparency. We address the concern about the commit classification below and will revise the manuscript to improve clarity.

read point-by-point responses

Referee: [Classification / Methods] Classification section (methods describing the 252 commits): no pre-defined coding rubric, inter-rater reliability statistic, blinding procedure, or exclusion rules are supplied for mapping commits to the four seams. Because the seam definitions are derived from the same data and the 44% figure is a direct count from this assignment, the central claim that these seams are the primary escape routes cannot be assessed without evidence that the classification is stable and not post-hoc.

Authors: We agree that the original manuscript provides insufficient detail on how the 252 commits were mapped to seams. The classification was performed solely by the first author, who had complete project context including commit messages, diffs, and issue trackers. Seams were identified iteratively during analysis rather than from a pre-existing rubric; the four seams emerged from grouping fixes by the boundary at which they escaped component tests. No blinding or multiple raters were involved, as this is a single-project case study. In the revision we will add a dedicated methods subsection that: (1) states the single-rater nature explicitly, (2) supplies the explicit criteria and decision rules used for each seam with two concrete commit examples per seam, and (3) notes that no commits were excluded. We will also clarify that the 44% figure and the framework are observations from this specific system rather than a general claim of primacy. These changes will allow readers to evaluate the process even though inter-rater reliability cannot be reported. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical report based on direct counts from classifying 252 bug-fix commits into four seam categories. The 44% figure is obtained by straightforward aggregation of these manual assignments with no intervening equations, fitted parameters, predictions derived from subsets of the data, or self-citations that bear the central claim. No self-definitional loops, ansatzes smuggled via citation, or renamings of known results appear; the distribution is simply the observed output of the classification process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical case-study report with no mathematical model. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5708 in / 1181 out tokens · 30858 ms · 2026-06-26T09:59:55.839246+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 12 canonical work pages

[1]

Coverage is not strongly correlated with test suite effectiveness,

L. Inozemtseva and R. Holmes, “Coverage is not strongly correlated with test suite effectiveness,” in Proc. 36th Int. Conf. Softw. Eng., 2014, pp. 435–445, doi: 10.1145/2568225.2568271

work page doi:10.1145/2568225.2568271 2014
[2]

Nord, and Ipek Ozkaya

P . Kruchten, R. L. Nord, and I. Ozkaya, “Techni- cal debt: From metaphor to theory and practice,” IEEE Softw., vol. 29, no. 6, pp. 18–21, 2012, doi: 10.1109/MS.2012.167

work page doi:10.1109/ms.2012.167 2012
[3]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B

E. T. Barr, M. Harman, P . McMinn, M. Shahbaz, and S. Y oo, “The oracle problem in software testing: A survey,”IEEE Trans. Softw. Eng., vol. 41, no. 5, pp. 507–525, 2015, doi: 10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015
[4]

SWE-bench: Can lan- guage models resolve real-world GitHub issues?,

C. E. Jimenez et al., “SWE-bench: Can lan- guage models resolve real-world GitHub issues?,” inProc. Int. Conf. Learn. Representations, 2024, arXiv:2310.06770

Pith/arXiv arXiv 2024
[5]

Assessing and advancing benchmarks for evaluating large language models in software en- gineering tasks,

X. Hu et al., “Assessing and advancing benchmarks for evaluating large language models in software en- gineering tasks,”ACM Trans. Softw. Eng. Methodol., 2025, doi: 10.1145/3786771

work page doi:10.1145/3786771 2025
[6]

An empirical study on challenges for LLM applica- tion developers,

X. Chen, C. Gao, C. Chen, G. Zhang, and Y . Liu, “An empirical study on challenges for LLM applica- tion developers,”ACM Trans. Softw. Eng. Methodol., accepted 2025, arXiv:2408.05002

arXiv 2025
[7]

Tracking the moving target: A framework for continu- ous evaluation of LLM test generation in industry,

M. Azanza, B. Pérez Lamancha, and E. Pizarro, “Tracking the moving target: A framework for continu- ous evaluation of LLM test generation in industry,” in Proc. 29th Int. Conf. Eval. Assessment Softw. Eng., 2025, arXiv:2504.18985

arXiv 2025
[8]

TCtracer: Establishing test- to-code traceability links using dynamic and static techniques,

R. White and J. Krinke, “TCtracer: Establishing test- to-code traceability links using dynamic and static techniques,”Empirical Softw. Eng., vol. 27, no. 3, art. 67, 2022, doi: 10.1007/s10664-021-10079-1

work page doi:10.1007/s10664-021-10079-1 2022
[9]

Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing,

M. Maes-Bermejo, A. Serebrenik, M. Gallego, F . Gortázar, G. Robles, and J. M. González-Barahona, “Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing,”Empirical Softw. Eng., vol. 29, 2024, doi: 10.1007/s10664-024-10479-z

work page doi:10.1007/s10664-024-10479-z 2024
[10]

Sungmin Kang, Juyeon Yoon, and Shin Yoo

R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of existing faults to enable controlled testing studies for Java programs,” inProc. Int. Symp. Softw. Testing Anal., 2014, pp. 437–440, doi: 10.1145/2610384.2628055

work page doi:10.1145/2610384.2628055 2014
[11]

Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =

R. Widyasari et al., “BugsInPy: A database of ex- isting bugs in Python programs to enable con- trolled testing and debugging studies,” inProc. 28th ACM ESEC/FSE, 2020, pp. 1556–1560, doi: 10.1145/3368089.3417943

work page doi:10.1145/3368089.3417943 2020
[12]

Openja, B

C. Escobar-Velasquez, M. Osorio-Riaño, J. Dominguez-Osorio, M. Arevalo, and M. Linares- Vásquez, “An empirical study of i18n collateral changes and bugs in GUIs of Android apps,” inProc. IEEE Int. Conf. Softw. Maintenance Evol., 2020, pp. 581–592, doi: 10.1109/ICSME46990.2020.00061

work page doi:10.1109/icsme46990.2020.00061 2020
[13]

An empirical study of internationalization failures in the web,

A. Alameer and W. G. J. Halfond, “An empirical study of internationalization failures in the web,” inProc. IEEE Int. Conf. Softw. Maintenance Evol., 2016, pp. 88–98, doi: 10.1109/ICSME.2016.55

work page doi:10.1109/icsme.2016.55 2016
[14]

An empirical study of real-world variability bugs detected by variability-oblivious tools,

A. Mordahl, J. Oh, U. Koc, S. Wei, and P . Gazz- illo, “An empirical study of real-world variability bugs detected by variability-oblivious tools,” inProc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 50–61, doi: 10.1145/3338906.3338967

work page doi:10.1145/3338906.3338967 2019
[15]

It’s not a bug, it’s a feature: How misclassification impacts bug predic- tion,

K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug predic- tion,” inProc. 35th Int. Conf. Softw. Eng., 2013, pp. 392–401, doi: 10.1109/ICSE.2013.6606585. 6 IEEE Software 2026 ALL GREEN, STILL BROKEN Muhammad Bilalis an AI and Digitalization Consultant in the German industrial sector. He holds a Master of ...

work page doi:10.1109/icse.2013.6606585 2013

[1] [1]

Coverage is not strongly correlated with test suite effectiveness,

L. Inozemtseva and R. Holmes, “Coverage is not strongly correlated with test suite effectiveness,” in Proc. 36th Int. Conf. Softw. Eng., 2014, pp. 435–445, doi: 10.1145/2568225.2568271

work page doi:10.1145/2568225.2568271 2014

[2] [2]

Nord, and Ipek Ozkaya

P . Kruchten, R. L. Nord, and I. Ozkaya, “Techni- cal debt: From metaphor to theory and practice,” IEEE Softw., vol. 29, no. 6, pp. 18–21, 2012, doi: 10.1109/MS.2012.167

work page doi:10.1109/ms.2012.167 2012

[3] [3]

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B

E. T. Barr, M. Harman, P . McMinn, M. Shahbaz, and S. Y oo, “The oracle problem in software testing: A survey,”IEEE Trans. Softw. Eng., vol. 41, no. 5, pp. 507–525, 2015, doi: 10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015

[4] [4]

SWE-bench: Can lan- guage models resolve real-world GitHub issues?,

C. E. Jimenez et al., “SWE-bench: Can lan- guage models resolve real-world GitHub issues?,” inProc. Int. Conf. Learn. Representations, 2024, arXiv:2310.06770

Pith/arXiv arXiv 2024

[5] [5]

Assessing and advancing benchmarks for evaluating large language models in software en- gineering tasks,

X. Hu et al., “Assessing and advancing benchmarks for evaluating large language models in software en- gineering tasks,”ACM Trans. Softw. Eng. Methodol., 2025, doi: 10.1145/3786771

work page doi:10.1145/3786771 2025

[6] [6]

An empirical study on challenges for LLM applica- tion developers,

X. Chen, C. Gao, C. Chen, G. Zhang, and Y . Liu, “An empirical study on challenges for LLM applica- tion developers,”ACM Trans. Softw. Eng. Methodol., accepted 2025, arXiv:2408.05002

arXiv 2025

[7] [7]

Tracking the moving target: A framework for continu- ous evaluation of LLM test generation in industry,

M. Azanza, B. Pérez Lamancha, and E. Pizarro, “Tracking the moving target: A framework for continu- ous evaluation of LLM test generation in industry,” in Proc. 29th Int. Conf. Eval. Assessment Softw. Eng., 2025, arXiv:2504.18985

arXiv 2025

[8] [8]

TCtracer: Establishing test- to-code traceability links using dynamic and static techniques,

R. White and J. Krinke, “TCtracer: Establishing test- to-code traceability links using dynamic and static techniques,”Empirical Softw. Eng., vol. 27, no. 3, art. 67, 2022, doi: 10.1007/s10664-021-10079-1

work page doi:10.1007/s10664-021-10079-1 2022

[9] [9]

Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing,

M. Maes-Bermejo, A. Serebrenik, M. Gallego, F . Gortázar, G. Robles, and J. M. González-Barahona, “Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing,”Empirical Softw. Eng., vol. 29, 2024, doi: 10.1007/s10664-024-10479-z

work page doi:10.1007/s10664-024-10479-z 2024

[10] [10]

Sungmin Kang, Juyeon Yoon, and Shin Yoo

R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of existing faults to enable controlled testing studies for Java programs,” inProc. Int. Symp. Softw. Testing Anal., 2014, pp. 437–440, doi: 10.1145/2610384.2628055

work page doi:10.1145/2610384.2628055 2014

[11] [11]

Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages =

R. Widyasari et al., “BugsInPy: A database of ex- isting bugs in Python programs to enable con- trolled testing and debugging studies,” inProc. 28th ACM ESEC/FSE, 2020, pp. 1556–1560, doi: 10.1145/3368089.3417943

work page doi:10.1145/3368089.3417943 2020

[12] [12]

Openja, B

C. Escobar-Velasquez, M. Osorio-Riaño, J. Dominguez-Osorio, M. Arevalo, and M. Linares- Vásquez, “An empirical study of i18n collateral changes and bugs in GUIs of Android apps,” inProc. IEEE Int. Conf. Softw. Maintenance Evol., 2020, pp. 581–592, doi: 10.1109/ICSME46990.2020.00061

work page doi:10.1109/icsme46990.2020.00061 2020

[13] [13]

An empirical study of internationalization failures in the web,

A. Alameer and W. G. J. Halfond, “An empirical study of internationalization failures in the web,” inProc. IEEE Int. Conf. Softw. Maintenance Evol., 2016, pp. 88–98, doi: 10.1109/ICSME.2016.55

work page doi:10.1109/icsme.2016.55 2016

[14] [14]

An empirical study of real-world variability bugs detected by variability-oblivious tools,

A. Mordahl, J. Oh, U. Koc, S. Wei, and P . Gazz- illo, “An empirical study of real-world variability bugs detected by variability-oblivious tools,” inProc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 50–61, doi: 10.1145/3338906.3338967

work page doi:10.1145/3338906.3338967 2019

[15] [15]

It’s not a bug, it’s a feature: How misclassification impacts bug predic- tion,

K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug predic- tion,” inProc. 35th Int. Conf. Softw. Eng., 2013, pp. 392–401, doi: 10.1109/ICSE.2013.6606585. 6 IEEE Software 2026 ALL GREEN, STILL BROKEN Muhammad Bilalis an AI and Digitalization Consultant in the German industrial sector. He holds a Master of ...

work page doi:10.1109/icse.2013.6606585 2013