Fragility of Layout-Based and Visual GUI Test Scripts: An Assessment Study on a Hybrid Mobile Application
Pith reviewed 2026-05-24 19:37 UTC · model grok-4.3
The pith
Layout-based GUI test methods for a hybrid mobile app needed changes in 20% of cases and visual methods in 30%, with each release affecting 3-4% of tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.
What carries the argument
Co-evolution observation of test methods with successive GUI releases to quantify the fraction requiring modification under layout-based and visual locators.
If this is right
- Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications.
- Each release can induce fragilities in 3-4% of the test methods.
- Principal causes for fragilities can be identified from the changes observed in the hybrid application.
- Guidelines for developers can be deduced from the identified fragility causes.
Where Pith is reading between the lines
- The relative fragility difference between layout-based and visual approaches may guide tool selection for similar apps.
- The causes identified here could inform test design patterns that reduce updates when GUIs change.
- If the rates scale with app size, large production suites would face substantial cumulative maintenance.
Load-bearing premise
The small test suite evaluated is representative of typical GUI test suites for hybrid mobile applications.
What would settle it
A study of a larger hybrid mobile app or different test suite showing modification rates below 10% across multiple releases would indicate the observed fragility levels do not hold generally.
Figures
read the original abstract
Context: Albeit different approaches exist for automated GUI testing of hybrid mobile applications, the practice appears to be not so commonly adopted by developers. A possible reason for such a low diffusion can be the fragility of the techniques, i.e. the frequent need for maintaining test cases when the GUI of the app is changed. Goal: In this paper, we perform an assessment of the maintenance needed by test cases for a hybrid mobile app, and the related fragility causes. Methods: We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. Results: We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Conclusion: Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an empirical assessment of GUI test fragility for one hybrid mobile application. A small test suite is evaluated with a layout-based tool (Appium) and a visual tool (EyeAutomate) across app releases; the study finds that 20% of layout-based test methods and 30% of visual test methods required at least one modification, with each release inducing fragility in 3-4% of methods. Principal fragility causes are identified and guidelines for developers are deduced.
Significance. If the reported rates and causes are reliable, the work supplies concrete empirical data on maintenance costs for hybrid-app GUI testing and identifies actionable fragility sources. This could inform testing-tool choice and test-design practices, but the single-app, small-suite design restricts extrapolation to larger suites or other frameworks.
major comments (3)
- [Methods/Results] Methods/Results sections: the abstract and results report the headline percentages (20%, 30%, 3-4%) without stating the total number of test methods, the number of releases examined, or any statistical measures; this information is required to assess whether the observed rates are robust or merely descriptive of a tiny sample.
- [Conclusion] Conclusion: the claim that fragility 'can induce relevant maintenance efforts in test suites of large applications' rests on data from a single small test suite on one hybrid app; no multi-app comparison, scaling analysis, or evidence that the per-release rate generalizes is provided, rendering the broader claim load-bearing yet unsupported.
- [Results] Results: the study reports fragility rates but supplies no breakdown by cause frequency, no comparison of modification effort between the two tools, and no discussion of how the 'small test suite' was constructed, all of which are needed to substantiate the identified principal causes.
minor comments (2)
- [Abstract] Abstract: the phrase 'a small test suite' should be replaced by the actual counts once they are added to the Methods section.
- [Introduction] The paper could cite additional recent studies on GUI-test maintenance for mobile or hybrid apps to better situate its contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our empirical assessment of GUI test fragility. We will revise the manuscript to supply the requested contextual details and to qualify our conclusions to match the scope of the single-app case study. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Methods/Results] Methods/Results sections: the abstract and results report the headline percentages (20%, 30%, 3-4%) without stating the total number of test methods, the number of releases examined, or any statistical measures; this information is required to assess whether the observed rates are robust or merely descriptive of a tiny sample.
Authors: We agree these details should be explicit. The full paper describes the test suite size and the sequence of releases examined; we will add the exact totals for test methods and releases to the abstract, Methods, and Results sections and will state that the percentages are descriptive statistics from this case study, with no inferential statistics applied given the small sample. revision: yes
-
Referee: [Conclusion] Conclusion: the claim that fragility 'can induce relevant maintenance efforts in test suites of large applications' rests on data from a single small test suite on one hybrid app; no multi-app comparison, scaling analysis, or evidence that the per-release rate generalizes is provided, rendering the broader claim load-bearing yet unsupported.
Authors: We accept that the original wording overreaches the evidence. We will revise the Conclusion to limit the claim to the observed rates in the studied hybrid application and will add an explicit statement that generalization to large suites or other apps requires further work. revision: yes
-
Referee: [Results] Results: the study reports fragility rates but supplies no breakdown by cause frequency, no comparison of modification effort between the two tools, and no discussion of how the 'small test suite' was constructed, all of which are needed to substantiate the identified principal causes.
Authors: We will add a frequency breakdown of fragility causes to the Results section. The study recorded whether a modification was required rather than measuring effort (time or change size); we will note this scope limitation. We will expand the Methods description of how the small test suite was constructed and the criteria used for selecting the test methods. revision: partial
Circularity Check
Empirical reporting with no derivation or fitting
full rationale
This is a purely observational empirical study that reports measured fragility percentages (20% Layout-based, 30% Visual modified at least once; 3-4% per release) from direct inspection of one small test suite on a single hybrid app. No equations, parameter fitting, predictions, or first-principles derivations exist that could reduce to inputs by construction. The conclusion's generalization to large applications is an interpretive statement, not a circular reduction via self-definition or self-citation chains. No load-bearing steps match any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The test methods' modifications accurately measure fragility caused by GUI changes.
Reference graph
Works this paper leans on
-
[1]
Emil Alégroth, Zebao Gao, Rafael Oliveira, and Atif Memon. 2015. Conceptualiza- tion and evaluation of component-based testing unified with visual gui testing: an empirical study. In Software Testing, Verification and Validation (ICST), 2015 IEEE 8th International Conference on . IEEE, 1–10
work page 2015
-
[2]
Emil Alégroth, Arvid Karlsson, and Alexander Radway. 2018. Continuous Inte- gration and Visual GUI Testing: Benefits and Drawbacks in Industrial Practice. In Software Testing, Verification and Validation (ICST), 2018 IEEE 11th International Conference on. IEEE, 172–181
work page 2018
-
[3]
Luca Ardito, Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2019. Espresso vs. EyeAutomate: An Experiment for the Comparison of Two Gen- erations of Android GUI Testing. In Proceedings of the Evaluation and Assess- ment on Software Engineering (EASE ’19) . ACM, New York, NY, USA, 13–22. https://doi.org/10.1145/3319008.3319022
-
[4]
L. Ardito, R. Coppola, M. Torchiano, and E. Alegroth. 2018. Towards Automated Translation between Generations of GUI-based Tests for Mobile Devices. In Pro- ceedings of INTUITESTBEDS 2018, joint Workshop of the 4th International Workshop on User Interface Test Automation, and 8th Workshop on TESting Techniques for event BasED Software. ACM
work page 2018
-
[5]
Stefan Bosnic, Ištvan Papp, and Sebastian Novak. 2016. The development of hybrid mobile applications with Apache Cordova. In2016 24th Telecommunications Forum (TELFOR). IEEE, 1–4
work page 2016
-
[6]
Andreas Bruns, Andreas Kornstadt, and Dennis Wichmann. 2009. Web application tests with selenium. IEEE software 26, 5 (2009)
work page 2009
-
[7]
Victor R Basili-Gianluigi Caldiera and H Dieter Rombach. 1994. Goal question metric paradigm. Encyclopedia of software engineering 1 (1994), 528–532
work page 1994
-
[8]
2012.Experimentation in software engineering
Claes Claes, Wohlin, R Runeson, Per, H Höst, Martin, CO Ohlsson, Magnus, R Regnell, Björn, and Anders Wesslén. 2012.Experimentation in software engineering. Springer
work page 2012
-
[9]
Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2017. Scripted GUI Testing of Android Apps: A Study on Diffusion, Evolution and Fragility. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, 22–32
work page 2017
-
[10]
R. Coppola, M. Morisio, and M. Torchiano. 2018. Maintenance of Android Widget- Based GUI Testing: A Taxonomy of Test Case Modification Causes. In 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 151–158. https://doi.org/10.1109/ICSTW.2018.00044
-
[11]
doi:10.1109/TNNLS.2018.2869225
R. Coppola, M. Morisio, and M. Torchiano. 2018. Mobile GUI Testing Fragility: A Study on Open-Source Android Applications. IEEE Transactions on Reliability (2018), 1–24. https://doi.org/10.1109/TR.2018.2869227
-
[12]
Riccardo Coppola, Maurizio Morisio, Marco Torchiano, and Luca Ardito. 2019. Scripted GUI testing of Android open-source apps: evolution of test code and fragility causes. Empirical Software Engineering (18 May 2019). https://doi.org/ 10.1007/s10664-019-09722-9
-
[13]
Isabelle Dalmasso, Soumya Kanti Datta, Christian Bonnet, and Navid Nikaein
-
[14]
In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC)
Survey, comparison and evaluation of cross platform mobile application development tools. In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC). IEEE, 323–328
work page 2013
-
[15]
Ronald Jabangwe, Henry Edison, and Anh Nguyen Duc. 2018. Software engineer- ing process models for mobile app development: a systematic literature review. Journal of Systems and Software 145 (2018), 98–111
work page 2018
-
[16]
Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. 2013. Real challenges in mobile app development. In Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on . IEEE, 15–24
work page 2013
-
[17]
P. S. Kochhar, F. Thung, N. Nagappan, T. Zimmermann, and D. Lo. 2015. Un- derstanding the Test Automation Culture of App Developers. In2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST) . 1–10. https://doi.org/10.1109/ICST.2015.7102609
-
[18]
Maurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2018. Pesto: Automated migration of DOM-based Web tests towards the visual approach. Software Testing, Verification And Reliability 28, 4 (2018), e1665
work page 2018
-
[19]
Mario Linares-Vásquez, Kevin Moran, and Denys Poshyvanyk. 2017. Continuous, evolutionary and large-scale: A new perspective for automated mobile app testing. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 399–410
work page 2017
-
[20]
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empir- ical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering . ACM, 643–653
work page 2014
-
[21]
I. Malavolta, S. Ruberto, T. Soru, and V. Terragni. 2015. End Users’ Perception of Hybrid Mobile Apps in the Google Play Store. In 2015 IEEE International Conference on Mobile Services . 25–32. https://doi.org/10.1109/MobServ.2015.14
-
[22]
Gaurang Shah, Prayag Shah, and Rishikesh Muchhala. 2014. Software testing automation using appium. International Journal of Current Engineering and Technology 4, 5 (2014), 3528–3531
work page 2014
-
[23]
Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd annual ACM symposium on User interface software and technology . ACM, 183–192
work page 2009
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.