Fragility of Layout-Based and Visual GUI Test Scripts: An Assessment Study on a Hybrid Mobile Application

Luca Ardito; Marco Torchiano; Riccardo Coppola

arxiv: 1907.08164 · v2 · pith:RWAXGQ4Unew · submitted 2019-07-18 · 💻 cs.SE

Fragility of Layout-Based and Visual GUI Test Scripts: An Assessment Study on a Hybrid Mobile Application

Riccardo Coppola , Luca Ardito , Marco Torchiano This is my paper

Pith reviewed 2026-05-24 19:37 UTC · model grok-4.3

classification 💻 cs.SE

keywords GUI testingtest fragilityhybrid mobile applicationslayout-based testingvisual testingsoftware maintenanceautomated testing

0 comments

The pith

Layout-based GUI test methods for a hybrid mobile app needed changes in 20% of cases and visual methods in 30%, with each release affecting 3-4% of tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks how test scripts for a hybrid mobile application's GUI must be updated as the app evolves through releases. It compares layout-based scripts, which locate elements by structure, against visual scripts, which match screen images, in a small test suite. The observed modification rates indicate that test maintenance can become a recurring cost. This matters because frequent updates may discourage developers from relying on automated GUI testing for mobile apps. The study also lists specific fragility causes and derives guidelines to address them.

Core claim

We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.

What carries the argument

Co-evolution observation of test methods with successive GUI releases to quantify the fraction requiring modification under layout-based and visual locators.

If this is right

Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications.
Each release can induce fragilities in 3-4% of the test methods.
Principal causes for fragilities can be identified from the changes observed in the hybrid application.
Guidelines for developers can be deduced from the identified fragility causes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The relative fragility difference between layout-based and visual approaches may guide tool selection for similar apps.
The causes identified here could inform test design patterns that reduce updates when GUIs change.
If the rates scale with app size, large production suites would face substantial cumulative maintenance.

Load-bearing premise

The small test suite evaluated is representative of typical GUI test suites for hybrid mobile applications.

What would settle it

A study of a larger hybrid mobile app or different test suite showing modification rates below 10% across multiple releases would indicate the observed fragility levels do not hold generally.

Figures

Figures reproduced from arXiv: 1907.08164 by Luca Ardito, Marco Torchiano, Riccardo Coppola.

**Figure 2.** Figure 2: Defect detected by Visual test case T3 2.4 Experimental Procedure The main usage scenarios of the app were exercised with fifteen different test cases, listed in table 3. Test scripts were not defined for usage scenarios producing output that rapidly changes over time, based on time/date or device location (e.g., News Feed, Lectures Timetable, Public Transportation, Available Rooms). Several usage scenari… view at source ↗

**Figure 3.** Figure 3: Non-fragile and fragile Layout-based and visual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of Pure Graphic Fragilities (a) Visual TC11, rel. 1.5.6 (b) Visual TC13, rel. 1.4.0 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of Widget Arrangement Fragilities [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Context: Albeit different approaches exist for automated GUI testing of hybrid mobile applications, the practice appears to be not so commonly adopted by developers. A possible reason for such a low diffusion can be the fragility of the techniques, i.e. the frequent need for maintaining test cases when the GUI of the app is changed. Goal: In this paper, we perform an assessment of the maintenance needed by test cases for a hybrid mobile app, and the related fragility causes. Methods: We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. Results: We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Conclusion: Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports fragility rates (20% layout, 30% visual) from one small test suite on a single hybrid app, but the jump to maintenance costs in large apps rests on an untested assumption of representativeness.

read the letter

The core takeaway is that this study tracked a small GUI test suite for one hybrid mobile app across releases, using Appium for layout-based tests and EyeAutomate for visual ones, and logged that 20% of the layout methods and 30% of the visual ones needed edits at least once, with each release touching 3-4% of them. They also note some causes and offer guidelines. That gives a concrete data point on how often these scripts break when the GUI changes, which is new for hybrid apps based on the abstract. The work is straightforward empirical observation without circular fitting or invented entities, and the percentages come directly from their co-evolution tracking. Credit for shipping actual numbers instead of just opinions. The main limitation is scale. Everything rests on one app and a small suite, so the conclusion that fragility induces relevant efforts in large applications lacks direct support—no comparisons to other apps, bigger suites, or different frameworks. The stress-test concern about generalization holds up from the abstract alone. This is useful for people already working on mobile GUI test maintenance who need a starting benchmark, but it won't shift the field much on its own. A serious referee could ask for more validation or clearer limits on the claims, and the paper is coherent enough on its own terms to deserve that step rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper reports an empirical assessment of GUI test fragility for one hybrid mobile application. A small test suite is evaluated with a layout-based tool (Appium) and a visual tool (EyeAutomate) across app releases; the study finds that 20% of layout-based test methods and 30% of visual test methods required at least one modification, with each release inducing fragility in 3-4% of methods. Principal fragility causes are identified and guidelines for developers are deduced.

Significance. If the reported rates and causes are reliable, the work supplies concrete empirical data on maintenance costs for hybrid-app GUI testing and identifies actionable fragility sources. This could inform testing-tool choice and test-design practices, but the single-app, small-suite design restricts extrapolation to larger suites or other frameworks.

major comments (3)

[Methods/Results] Methods/Results sections: the abstract and results report the headline percentages (20%, 30%, 3-4%) without stating the total number of test methods, the number of releases examined, or any statistical measures; this information is required to assess whether the observed rates are robust or merely descriptive of a tiny sample.
[Conclusion] Conclusion: the claim that fragility 'can induce relevant maintenance efforts in test suites of large applications' rests on data from a single small test suite on one hybrid app; no multi-app comparison, scaling analysis, or evidence that the per-release rate generalizes is provided, rendering the broader claim load-bearing yet unsupported.
[Results] Results: the study reports fragility rates but supplies no breakdown by cause frequency, no comparison of modification effort between the two tools, and no discussion of how the 'small test suite' was constructed, all of which are needed to substantiate the identified principal causes.

minor comments (2)

[Abstract] Abstract: the phrase 'a small test suite' should be replaced by the actual counts once they are added to the Methods section.
[Introduction] The paper could cite additional recent studies on GUI-test maintenance for mobile or hybrid apps to better situate its contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our empirical assessment of GUI test fragility. We will revise the manuscript to supply the requested contextual details and to qualify our conclusions to match the scope of the single-app case study. Point-by-point responses follow.

read point-by-point responses

Referee: [Methods/Results] Methods/Results sections: the abstract and results report the headline percentages (20%, 30%, 3-4%) without stating the total number of test methods, the number of releases examined, or any statistical measures; this information is required to assess whether the observed rates are robust or merely descriptive of a tiny sample.

Authors: We agree these details should be explicit. The full paper describes the test suite size and the sequence of releases examined; we will add the exact totals for test methods and releases to the abstract, Methods, and Results sections and will state that the percentages are descriptive statistics from this case study, with no inferential statistics applied given the small sample. revision: yes
Referee: [Conclusion] Conclusion: the claim that fragility 'can induce relevant maintenance efforts in test suites of large applications' rests on data from a single small test suite on one hybrid app; no multi-app comparison, scaling analysis, or evidence that the per-release rate generalizes is provided, rendering the broader claim load-bearing yet unsupported.

Authors: We accept that the original wording overreaches the evidence. We will revise the Conclusion to limit the claim to the observed rates in the studied hybrid application and will add an explicit statement that generalization to large suites or other apps requires further work. revision: yes
Referee: [Results] Results: the study reports fragility rates but supplies no breakdown by cause frequency, no comparison of modification effort between the two tools, and no discussion of how the 'small test suite' was constructed, all of which are needed to substantiate the identified principal causes.

Authors: We will add a frequency breakdown of fragility causes to the Results section. The study recorded whether a modification was required rather than measuring effort (time or change size); we will note this scope limitation. We will expand the Methods description of how the small test suite was constructed and the criteria used for selecting the test methods. revision: partial

Circularity Check

0 steps flagged

Empirical reporting with no derivation or fitting

full rationale

This is a purely observational empirical study that reports measured fragility percentages (20% Layout-based, 30% Visual modified at least once; 3-4% per release) from direct inspection of one small test suite on a single hybrid app. No equations, parameter fitting, predictions, or first-principles derivations exist that could reduce to inputs by construction. The conclusion's generalization to large applications is an interpretive statement, not a circular reduction via self-definition or self-citation chains. No load-bearing steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in empirical software engineering studies about representativeness and measurement validity.

axioms (1)

domain assumption The test methods' modifications accurately measure fragility caused by GUI changes.
This is the core measurement assumption in the study.

pith-pipeline@v0.9.0 · 5739 in / 1028 out tokens · 27197 ms · 2026-05-24T19:37:50.312869+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Emil Alégroth, Zebao Gao, Rafael Oliveira, and Atif Memon. 2015. Conceptualiza- tion and evaluation of component-based testing unified with visual gui testing: an empirical study. In Software Testing, Verification and Validation (ICST), 2015 IEEE 8th International Conference on . IEEE, 1–10

work page 2015
[2]

Emil Alégroth, Arvid Karlsson, and Alexander Radway. 2018. Continuous Inte- gration and Visual GUI Testing: Benefits and Drawbacks in Industrial Practice. In Software Testing, Verification and Validation (ICST), 2018 IEEE 11th International Conference on. IEEE, 172–181

work page 2018
[3]

Luca Ardito, Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2019. Espresso vs. EyeAutomate: An Experiment for the Comparison of Two Gen- erations of Android GUI Testing. In Proceedings of the Evaluation and Assess- ment on Software Engineering (EASE ’19) . ACM, New York, NY, USA, 13–22. https://doi.org/10.1145/3319008.3319022

work page doi:10.1145/3319008.3319022 2019
[4]

Ardito, R

L. Ardito, R. Coppola, M. Torchiano, and E. Alegroth. 2018. Towards Automated Translation between Generations of GUI-based Tests for Mobile Devices. In Pro- ceedings of INTUITESTBEDS 2018, joint Workshop of the 4th International Workshop on User Interface Test Automation, and 8th Workshop on TESting Techniques for event BasED Software. ACM

work page 2018
[5]

Stefan Bosnic, Ištvan Papp, and Sebastian Novak. 2016. The development of hybrid mobile applications with Apache Cordova. In2016 24th Telecommunications Forum (TELFOR). IEEE, 1–4

work page 2016
[6]

Andreas Bruns, Andreas Kornstadt, and Dennis Wichmann. 2009. Web application tests with selenium. IEEE software 26, 5 (2009)

work page 2009
[7]

Victor R Basili-Gianluigi Caldiera and H Dieter Rombach. 1994. Goal question metric paradigm. Encyclopedia of software engineering 1 (1994), 528–532

work page 1994
[8]

2012.Experimentation in software engineering

Claes Claes, Wohlin, R Runeson, Per, H Höst, Martin, CO Ohlsson, Magnus, R Regnell, Björn, and Anders Wesslén. 2012.Experimentation in software engineering. Springer

work page 2012
[9]

Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2017. Scripted GUI Testing of Android Apps: A Study on Diffusion, Evolution and Fragility. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, 22–32

work page 2017
[10]

Coppola, M

R. Coppola, M. Morisio, and M. Torchiano. 2018. Maintenance of Android Widget- Based GUI Testing: A Taxonomy of Test Case Modification Causes. In 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 151–158. https://doi.org/10.1109/ICSTW.2018.00044

work page doi:10.1109/icstw.2018.00044 2018
[11]

doi:10.1109/TNNLS.2018.2869225

R. Coppola, M. Morisio, and M. Torchiano. 2018. Mobile GUI Testing Fragility: A Study on Open-Source Android Applications. IEEE Transactions on Reliability (2018), 1–24. https://doi.org/10.1109/TR.2018.2869227

work page doi:10.1109/tr.2018.2869227 2018
[12]

Riccardo Coppola, Maurizio Morisio, Marco Torchiano, and Luca Ardito. 2019. Scripted GUI testing of Android open-source apps: evolution of test code and fragility causes. Empirical Software Engineering (18 May 2019). https://doi.org/ 10.1007/s10664-019-09722-9

work page doi:10.1007/s10664-019-09722-9 2019
[13]

Isabelle Dalmasso, Soumya Kanti Datta, Christian Bonnet, and Navid Nikaein

work page
[14]

In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC)

Survey, comparison and evaluation of cross platform mobile application development tools. In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC). IEEE, 323–328

work page 2013
[15]

Ronald Jabangwe, Henry Edison, and Anh Nguyen Duc. 2018. Software engineer- ing process models for mobile app development: a systematic literature review. Journal of Systems and Software 145 (2018), 98–111

work page 2018
[16]

Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. 2013. Real challenges in mobile app development. In Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on . IEEE, 15–24

work page 2013
[17]

P. S. Kochhar, F. Thung, N. Nagappan, T. Zimmermann, and D. Lo. 2015. Un- derstanding the Test Automation Culture of App Developers. In2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST) . 1–10. https://doi.org/10.1109/ICST.2015.7102609

work page doi:10.1109/icst.2015.7102609 2015
[18]

Maurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2018. Pesto: Automated migration of DOM-based Web tests towards the visual approach. Software Testing, Verification And Reliability 28, 4 (2018), e1665

work page 2018
[19]

Mario Linares-Vásquez, Kevin Moran, and Denys Poshyvanyk. 2017. Continuous, evolutionary and large-scale: A new perspective for automated mobile app testing. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 399–410

work page 2017
[20]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empir- ical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering . ACM, 643–653

work page 2014
[21]

Malavolta, S

I. Malavolta, S. Ruberto, T. Soru, and V. Terragni. 2015. End Users’ Perception of Hybrid Mobile Apps in the Google Play Store. In 2015 IEEE International Conference on Mobile Services . 25–32. https://doi.org/10.1109/MobServ.2015.14

work page doi:10.1109/mobserv.2015.14 2015
[22]

Gaurang Shah, Prayag Shah, and Rishikesh Muchhala. 2014. Software testing automation using appium. International Journal of Current Engineering and Technology 4, 5 (2014), 3528–3531

work page 2014
[23]

Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd annual ACM symposium on User interface software and technology . ACM, 183–192

work page 2009

[1] [1]

Emil Alégroth, Zebao Gao, Rafael Oliveira, and Atif Memon. 2015. Conceptualiza- tion and evaluation of component-based testing unified with visual gui testing: an empirical study. In Software Testing, Verification and Validation (ICST), 2015 IEEE 8th International Conference on . IEEE, 1–10

work page 2015

[2] [2]

Emil Alégroth, Arvid Karlsson, and Alexander Radway. 2018. Continuous Inte- gration and Visual GUI Testing: Benefits and Drawbacks in Industrial Practice. In Software Testing, Verification and Validation (ICST), 2018 IEEE 11th International Conference on. IEEE, 172–181

work page 2018

[3] [3]

Luca Ardito, Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2019. Espresso vs. EyeAutomate: An Experiment for the Comparison of Two Gen- erations of Android GUI Testing. In Proceedings of the Evaluation and Assess- ment on Software Engineering (EASE ’19) . ACM, New York, NY, USA, 13–22. https://doi.org/10.1145/3319008.3319022

work page doi:10.1145/3319008.3319022 2019

[4] [4]

Ardito, R

L. Ardito, R. Coppola, M. Torchiano, and E. Alegroth. 2018. Towards Automated Translation between Generations of GUI-based Tests for Mobile Devices. In Pro- ceedings of INTUITESTBEDS 2018, joint Workshop of the 4th International Workshop on User Interface Test Automation, and 8th Workshop on TESting Techniques for event BasED Software. ACM

work page 2018

[5] [5]

Stefan Bosnic, Ištvan Papp, and Sebastian Novak. 2016. The development of hybrid mobile applications with Apache Cordova. In2016 24th Telecommunications Forum (TELFOR). IEEE, 1–4

work page 2016

[6] [6]

Andreas Bruns, Andreas Kornstadt, and Dennis Wichmann. 2009. Web application tests with selenium. IEEE software 26, 5 (2009)

work page 2009

[7] [7]

Victor R Basili-Gianluigi Caldiera and H Dieter Rombach. 1994. Goal question metric paradigm. Encyclopedia of software engineering 1 (1994), 528–532

work page 1994

[8] [8]

2012.Experimentation in software engineering

Claes Claes, Wohlin, R Runeson, Per, H Höst, Martin, CO Ohlsson, Magnus, R Regnell, Björn, and Anders Wesslén. 2012.Experimentation in software engineering. Springer

work page 2012

[9] [9]

Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2017. Scripted GUI Testing of Android Apps: A Study on Diffusion, Evolution and Fragility. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, 22–32

work page 2017

[10] [10]

Coppola, M

R. Coppola, M. Morisio, and M. Torchiano. 2018. Maintenance of Android Widget- Based GUI Testing: A Taxonomy of Test Case Modification Causes. In 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 151–158. https://doi.org/10.1109/ICSTW.2018.00044

work page doi:10.1109/icstw.2018.00044 2018

[11] [11]

doi:10.1109/TNNLS.2018.2869225

R. Coppola, M. Morisio, and M. Torchiano. 2018. Mobile GUI Testing Fragility: A Study on Open-Source Android Applications. IEEE Transactions on Reliability (2018), 1–24. https://doi.org/10.1109/TR.2018.2869227

work page doi:10.1109/tr.2018.2869227 2018

[12] [12]

Riccardo Coppola, Maurizio Morisio, Marco Torchiano, and Luca Ardito. 2019. Scripted GUI testing of Android open-source apps: evolution of test code and fragility causes. Empirical Software Engineering (18 May 2019). https://doi.org/ 10.1007/s10664-019-09722-9

work page doi:10.1007/s10664-019-09722-9 2019

[13] [13]

Isabelle Dalmasso, Soumya Kanti Datta, Christian Bonnet, and Navid Nikaein

work page

[14] [14]

In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC)

Survey, comparison and evaluation of cross platform mobile application development tools. In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC). IEEE, 323–328

work page 2013

[15] [15]

Ronald Jabangwe, Henry Edison, and Anh Nguyen Duc. 2018. Software engineer- ing process models for mobile app development: a systematic literature review. Journal of Systems and Software 145 (2018), 98–111

work page 2018

[16] [16]

Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. 2013. Real challenges in mobile app development. In Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on . IEEE, 15–24

work page 2013

[17] [17]

P. S. Kochhar, F. Thung, N. Nagappan, T. Zimmermann, and D. Lo. 2015. Un- derstanding the Test Automation Culture of App Developers. In2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST) . 1–10. https://doi.org/10.1109/ICST.2015.7102609

work page doi:10.1109/icst.2015.7102609 2015

[18] [18]

Maurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2018. Pesto: Automated migration of DOM-based Web tests towards the visual approach. Software Testing, Verification And Reliability 28, 4 (2018), e1665

work page 2018

[19] [19]

Mario Linares-Vásquez, Kevin Moran, and Denys Poshyvanyk. 2017. Continuous, evolutionary and large-scale: A new perspective for automated mobile app testing. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 399–410

work page 2017

[20] [20]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empir- ical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering . ACM, 643–653

work page 2014

[21] [21]

Malavolta, S

I. Malavolta, S. Ruberto, T. Soru, and V. Terragni. 2015. End Users’ Perception of Hybrid Mobile Apps in the Google Play Store. In 2015 IEEE International Conference on Mobile Services . 25–32. https://doi.org/10.1109/MobServ.2015.14

work page doi:10.1109/mobserv.2015.14 2015

[22] [22]

Gaurang Shah, Prayag Shah, and Rishikesh Muchhala. 2014. Software testing automation using appium. International Journal of Current Engineering and Technology 4, 5 (2014), 3528–3531

work page 2014

[23] [23]

Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd annual ACM symposium on User interface software and technology . ACM, 183–192

work page 2009