All Green, Still Broken: Real-Flow Verification Lessons from an LLM-Integrated, Multi-Market Web Application
Pith reviewed 2026-06-26 09:59 UTC · model grok-4.3
The pith
In an LLM-integrated rental app, 44 percent of bug fixes addressed defects that escaped through four seams invisible to component unit tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Continuous passage of 1,553 automated tests did not prevent defects from reaching users in a live, multi-market, LLM-powered rental-search assistant. Classification of all 252 bug-fix commits showed that 44 percent of the fixes targeted defects escaping through four seams—the live browser runtime, non-default market, end-to-end flow, and whole-system level—that component unit tests cannot reach. A single fix that omitted a guard at its seam allowed the same defect to ship twice.
What carries the argument
The four-seam framework, which sorts each bug-fix commit by the boundary through which its defect escaped automated detection: live browser runtime, non-default market, end-to-end flow, and whole-system level.
If this is right
- Component unit tests alone leave roughly half the defect surface unmonitored in this class of application.
- A fix placed without a guard at its seam allows the same defect to recur in production.
- The seam carrying the largest share of fixes is the one a team should instrument first.
- Practices that add targeted checks at each seam reduced escaped defects in the reported project.
Where Pith is reading between the lines
- Similar seam analysis could be applied to other LLM-augmented or externally dependent systems to locate their dominant escape routes.
- Teams could build lightweight scripts that scan commit messages and test files to suggest which seam a new fix belongs to.
- The measured 44 percent figure supplies a baseline for comparing test effectiveness across projects that share the same three hard-to-test ingredients.
Load-bearing premise
The manual classification of the 252 bug-fix commits into the four seam categories accurately captures the boundary through which each defect escaped the automated test suite.
What would settle it
Independent reclassification of the same 252 commits by reviewers unaware of the original labels yields a materially different distribution across the four seams.
read the original abstract
Modern web applications increasingly combine three ingredients that are hard to test: output from large language models, multi-market internationalization, and browser-driven front-ends over external data sources. We report on a production rental-search assistant whose automated suite grew to 1,553 test cases in six weeks. The suite passed continuously, yet user-facing defects continued to reach production. We studied all 252 bug-fix commits in the project and classified each by the boundary, or seam, it escaped through. About 44 percent of the fixes fall in four seams that component-level unit tests cannot observe: the live browser runtime, the non-default market, the end-to-end flow, and the whole-system level. A fix without a guard at the seam let one defect ship twice. We present the four-seam framework, the measured defect distribution, and the practices we adopted, including a simple way for a team to find the seam that carries the most fixes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a case study of a production rental-search web application integrating LLMs, multi-market internationalization, and browser front-ends. An automated test suite grew to 1,553 cases yet defects reached production; analysis of all 252 bug-fix commits classified ~44% as escaping through four seams (live browser runtime, non-default market, end-to-end flow, whole-system level) invisible to component-level unit tests. The authors introduce a four-seam framework, report the defect distribution, note one defect shipping twice without a seam guard, and describe practices including a method to identify the highest-impact seam.
Significance. If the classification is reproducible and unbiased, the work supplies concrete empirical data on defect escape paths in LLM-integrated, multi-market web systems and demonstrates that component testing alone is insufficient. The four-seam framework and the simple seam-identification practice are actionable for practitioners; the single-project commit analysis is a strength when the classification criteria are made transparent.
major comments (1)
- [Classification / Methods] Classification section (methods describing the 252 commits): no pre-defined coding rubric, inter-rater reliability statistic, blinding procedure, or exclusion rules are supplied for mapping commits to the four seams. Because the seam definitions are derived from the same data and the 44% figure is a direct count from this assignment, the central claim that these seams are the primary escape routes cannot be assessed without evidence that the classification is stable and not post-hoc.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on methodological transparency. We address the concern about the commit classification below and will revise the manuscript to improve clarity.
read point-by-point responses
-
Referee: [Classification / Methods] Classification section (methods describing the 252 commits): no pre-defined coding rubric, inter-rater reliability statistic, blinding procedure, or exclusion rules are supplied for mapping commits to the four seams. Because the seam definitions are derived from the same data and the 44% figure is a direct count from this assignment, the central claim that these seams are the primary escape routes cannot be assessed without evidence that the classification is stable and not post-hoc.
Authors: We agree that the original manuscript provides insufficient detail on how the 252 commits were mapped to seams. The classification was performed solely by the first author, who had complete project context including commit messages, diffs, and issue trackers. Seams were identified iteratively during analysis rather than from a pre-existing rubric; the four seams emerged from grouping fixes by the boundary at which they escaped component tests. No blinding or multiple raters were involved, as this is a single-project case study. In the revision we will add a dedicated methods subsection that: (1) states the single-rater nature explicitly, (2) supplies the explicit criteria and decision rules used for each seam with two concrete commit examples per seam, and (3) notes that no commits were excluded. We will also clarify that the 44% figure and the framework are observations from this specific system rather than a general claim of primacy. These changes will allow readers to evaluate the process even though inter-rater reliability cannot be reported. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents an empirical report based on direct counts from classifying 252 bug-fix commits into four seam categories. The 44% figure is obtained by straightforward aggregation of these manual assignments with no intervening equations, fitted parameters, predictions derived from subsets of the data, or self-citations that bear the central claim. No self-definitional loops, ansatzes smuggled via citation, or renamings of known results appear; the distribution is simply the observed output of the classification process itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Coverage is not strongly correlated with test suite effectiveness,
L. Inozemtseva and R. Holmes, “Coverage is not strongly correlated with test suite effectiveness,” in Proc. 36th Int. Conf. Softw. Eng., 2014, pp. 435–445, doi: 10.1145/2568225.2568271
-
[2]
P . Kruchten, R. L. Nord, and I. Ozkaya, “Techni- cal debt: From metaphor to theory and practice,” IEEE Softw., vol. 29, no. 6, pp. 18–21, 2012, doi: 10.1109/MS.2012.167
-
[3]
Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B
E. T. Barr, M. Harman, P . McMinn, M. Shahbaz, and S. Y oo, “The oracle problem in software testing: A survey,”IEEE Trans. Softw. Eng., vol. 41, no. 5, pp. 507–525, 2015, doi: 10.1109/TSE.2014.2372785
-
[4]
SWE-bench: Can lan- guage models resolve real-world GitHub issues?,
C. E. Jimenez et al., “SWE-bench: Can lan- guage models resolve real-world GitHub issues?,” inProc. Int. Conf. Learn. Representations, 2024, arXiv:2310.06770
Pith/arXiv arXiv 2024
-
[5]
X. Hu et al., “Assessing and advancing benchmarks for evaluating large language models in software en- gineering tasks,”ACM Trans. Softw. Eng. Methodol., 2025, doi: 10.1145/3786771
-
[6]
An empirical study on challenges for LLM applica- tion developers,
X. Chen, C. Gao, C. Chen, G. Zhang, and Y . Liu, “An empirical study on challenges for LLM applica- tion developers,”ACM Trans. Softw. Eng. Methodol., accepted 2025, arXiv:2408.05002
arXiv 2025
-
[7]
M. Azanza, B. Pérez Lamancha, and E. Pizarro, “Tracking the moving target: A framework for continu- ous evaluation of LLM test generation in industry,” in Proc. 29th Int. Conf. Eval. Assessment Softw. Eng., 2025, arXiv:2504.18985
arXiv 2025
-
[8]
TCtracer: Establishing test- to-code traceability links using dynamic and static techniques,
R. White and J. Krinke, “TCtracer: Establishing test- to-code traceability links using dynamic and static techniques,”Empirical Softw. Eng., vol. 27, no. 3, art. 67, 2022, doi: 10.1007/s10664-021-10079-1
-
[9]
M. Maes-Bermejo, A. Serebrenik, M. Gallego, F . Gortázar, G. Robles, and J. M. González-Barahona, “Hunting bugs: Towards an automated approach to identifying which change caused a bug through regression testing,”Empirical Softw. Eng., vol. 29, 2024, doi: 10.1007/s10664-024-10479-z
-
[10]
Sungmin Kang, Juyeon Yoon, and Shin Yoo
R. Just, D. Jalali, and M. D. Ernst, “Defects4J: A database of existing faults to enable controlled testing studies for Java programs,” inProc. Int. Symp. Softw. Testing Anal., 2014, pp. 437–440, doi: 10.1145/2610384.2628055
-
[11]
R. Widyasari et al., “BugsInPy: A database of ex- isting bugs in Python programs to enable con- trolled testing and debugging studies,” inProc. 28th ACM ESEC/FSE, 2020, pp. 1556–1560, doi: 10.1145/3368089.3417943
-
[12]
C. Escobar-Velasquez, M. Osorio-Riaño, J. Dominguez-Osorio, M. Arevalo, and M. Linares- Vásquez, “An empirical study of i18n collateral changes and bugs in GUIs of Android apps,” inProc. IEEE Int. Conf. Softw. Maintenance Evol., 2020, pp. 581–592, doi: 10.1109/ICSME46990.2020.00061
-
[13]
An empirical study of internationalization failures in the web,
A. Alameer and W. G. J. Halfond, “An empirical study of internationalization failures in the web,” inProc. IEEE Int. Conf. Softw. Maintenance Evol., 2016, pp. 88–98, doi: 10.1109/ICSME.2016.55
-
[14]
An empirical study of real-world variability bugs detected by variability-oblivious tools,
A. Mordahl, J. Oh, U. Koc, S. Wei, and P . Gazz- illo, “An empirical study of real-world variability bugs detected by variability-oblivious tools,” inProc. 27th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., 2019, pp. 50–61, doi: 10.1145/3338906.3338967
-
[15]
It’s not a bug, it’s a feature: How misclassification impacts bug predic- tion,
K. Herzig, S. Just, and A. Zeller, “It’s not a bug, it’s a feature: How misclassification impacts bug predic- tion,” inProc. 35th Int. Conf. Softw. Eng., 2013, pp. 392–401, doi: 10.1109/ICSE.2013.6606585. 6 IEEE Software 2026 ALL GREEN, STILL BROKEN Muhammad Bilalis an AI and Digitalization Consultant in the German industrial sector. He holds a Master of ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.