pith. sign in

arxiv: 1907.08164 · v2 · pith:RWAXGQ4Unew · submitted 2019-07-18 · 💻 cs.SE

Fragility of Layout-Based and Visual GUI Test Scripts: An Assessment Study on a Hybrid Mobile Application

Pith reviewed 2026-05-24 19:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords GUI testingtest fragilityhybrid mobile applicationslayout-based testingvisual testingsoftware maintenanceautomated testing
0
0 comments X

The pith

Layout-based GUI test methods for a hybrid mobile app needed changes in 20% of cases and visual methods in 30%, with each release affecting 3-4% of tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks how test scripts for a hybrid mobile application's GUI must be updated as the app evolves through releases. It compares layout-based scripts, which locate elements by structure, against visual scripts, which match screen images, in a small test suite. The observed modification rates indicate that test maintenance can become a recurring cost. This matters because frequent updates may discourage developers from relying on automated GUI testing for mobile apps. The study also lists specific fragility causes and derives guidelines to address them.

Core claim

We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.

What carries the argument

Co-evolution observation of test methods with successive GUI releases to quantify the fraction requiring modification under layout-based and visual locators.

If this is right

  • Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications.
  • Each release can induce fragilities in 3-4% of the test methods.
  • Principal causes for fragilities can be identified from the changes observed in the hybrid application.
  • Guidelines for developers can be deduced from the identified fragility causes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The relative fragility difference between layout-based and visual approaches may guide tool selection for similar apps.
  • The causes identified here could inform test design patterns that reduce updates when GUIs change.
  • If the rates scale with app size, large production suites would face substantial cumulative maintenance.

Load-bearing premise

The small test suite evaluated is representative of typical GUI test suites for hybrid mobile applications.

What would settle it

A study of a larger hybrid mobile app or different test suite showing modification rates below 10% across multiple releases would indicate the observed fragility levels do not hold generally.

Figures

Figures reproduced from arXiv: 1907.08164 by Luca Ardito, Marco Torchiano, Riccardo Coppola.

Figure 1
Figure 1. Figure 1: Sample screens of PoliTO App [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Defect detected by Visual test case T3 2.4 Experimental Procedure The main usage scenarios of the app were exercised with fifteen different test cases, listed in table 3. Test scripts were not defined for usage scenarios producing out￾put that rapidly changes over time, based on time/date or device location (e.g., News Feed, Lectures Timetable, Public Transportation, Available Rooms). Several usage scenari… view at source ↗
Figure 3
Figure 3. Figure 3: Non-fragile and fragile Layout-based and visual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of Pure Graphic Fragilities (a) Visual TC11, rel. 1.5.6 (b) Visual TC13, rel. 1.4.0 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of Widget Arrangement Fragilities [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Context: Albeit different approaches exist for automated GUI testing of hybrid mobile applications, the practice appears to be not so commonly adopted by developers. A possible reason for such a low diffusion can be the fragility of the techniques, i.e. the frequent need for maintaining test cases when the GUI of the app is changed. Goal: In this paper, we perform an assessment of the maintenance needed by test cases for a hybrid mobile app, and the related fragility causes. Methods: We evaluated a small test suite with a Layout-based testing tool (Appium) and a Visual one (EyeAutomate) and observed the changes needed by tests during the co-evolution with the GUI of the app. Results: We found that 20% Layout-based test methods and 30% Visual test methods had to be modified at least once, and that each release induced fragilities in 3-4% of the test methods. Conclusion: Fragility of GUI tests can induce relevant maintenance efforts in test suites of large applications. Several principal causes for fragilities have been identified for the tested hybrid application, and guidelines for developers are deduced from them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports an empirical assessment of GUI test fragility for one hybrid mobile application. A small test suite is evaluated with a layout-based tool (Appium) and a visual tool (EyeAutomate) across app releases; the study finds that 20% of layout-based test methods and 30% of visual test methods required at least one modification, with each release inducing fragility in 3-4% of methods. Principal fragility causes are identified and guidelines for developers are deduced.

Significance. If the reported rates and causes are reliable, the work supplies concrete empirical data on maintenance costs for hybrid-app GUI testing and identifies actionable fragility sources. This could inform testing-tool choice and test-design practices, but the single-app, small-suite design restricts extrapolation to larger suites or other frameworks.

major comments (3)
  1. [Methods/Results] Methods/Results sections: the abstract and results report the headline percentages (20%, 30%, 3-4%) without stating the total number of test methods, the number of releases examined, or any statistical measures; this information is required to assess whether the observed rates are robust or merely descriptive of a tiny sample.
  2. [Conclusion] Conclusion: the claim that fragility 'can induce relevant maintenance efforts in test suites of large applications' rests on data from a single small test suite on one hybrid app; no multi-app comparison, scaling analysis, or evidence that the per-release rate generalizes is provided, rendering the broader claim load-bearing yet unsupported.
  3. [Results] Results: the study reports fragility rates but supplies no breakdown by cause frequency, no comparison of modification effort between the two tools, and no discussion of how the 'small test suite' was constructed, all of which are needed to substantiate the identified principal causes.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'a small test suite' should be replaced by the actual counts once they are added to the Methods section.
  2. [Introduction] The paper could cite additional recent studies on GUI-test maintenance for mobile or hybrid apps to better situate its contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our empirical assessment of GUI test fragility. We will revise the manuscript to supply the requested contextual details and to qualify our conclusions to match the scope of the single-app case study. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Methods/Results] Methods/Results sections: the abstract and results report the headline percentages (20%, 30%, 3-4%) without stating the total number of test methods, the number of releases examined, or any statistical measures; this information is required to assess whether the observed rates are robust or merely descriptive of a tiny sample.

    Authors: We agree these details should be explicit. The full paper describes the test suite size and the sequence of releases examined; we will add the exact totals for test methods and releases to the abstract, Methods, and Results sections and will state that the percentages are descriptive statistics from this case study, with no inferential statistics applied given the small sample. revision: yes

  2. Referee: [Conclusion] Conclusion: the claim that fragility 'can induce relevant maintenance efforts in test suites of large applications' rests on data from a single small test suite on one hybrid app; no multi-app comparison, scaling analysis, or evidence that the per-release rate generalizes is provided, rendering the broader claim load-bearing yet unsupported.

    Authors: We accept that the original wording overreaches the evidence. We will revise the Conclusion to limit the claim to the observed rates in the studied hybrid application and will add an explicit statement that generalization to large suites or other apps requires further work. revision: yes

  3. Referee: [Results] Results: the study reports fragility rates but supplies no breakdown by cause frequency, no comparison of modification effort between the two tools, and no discussion of how the 'small test suite' was constructed, all of which are needed to substantiate the identified principal causes.

    Authors: We will add a frequency breakdown of fragility causes to the Results section. The study recorded whether a modification was required rather than measuring effort (time or change size); we will note this scope limitation. We will expand the Methods description of how the small test suite was constructed and the criteria used for selecting the test methods. revision: partial

Circularity Check

0 steps flagged

Empirical reporting with no derivation or fitting

full rationale

This is a purely observational empirical study that reports measured fragility percentages (20% Layout-based, 30% Visual modified at least once; 3-4% per release) from direct inspection of one small test suite on a single hybrid app. No equations, parameter fitting, predictions, or first-principles derivations exist that could reduce to inputs by construction. The conclusion's generalization to large applications is an interpretive statement, not a circular reduction via self-definition or self-citation chains. No load-bearing steps match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions in empirical software engineering studies about representativeness and measurement validity.

axioms (1)
  • domain assumption The test methods' modifications accurately measure fragility caused by GUI changes.
    This is the core measurement assumption in the study.

pith-pipeline@v0.9.0 · 5739 in / 1028 out tokens · 27197 ms · 2026-05-24T19:37:50.312869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    Emil Alégroth, Zebao Gao, Rafael Oliveira, and Atif Memon. 2015. Conceptualiza- tion and evaluation of component-based testing unified with visual gui testing: an empirical study. In Software Testing, Verification and Validation (ICST), 2015 IEEE 8th International Conference on . IEEE, 1–10

  2. [2]

    Emil Alégroth, Arvid Karlsson, and Alexander Radway. 2018. Continuous Inte- gration and Visual GUI Testing: Benefits and Drawbacks in Industrial Practice. In Software Testing, Verification and Validation (ICST), 2018 IEEE 11th International Conference on. IEEE, 172–181

  3. [3]

    Luca Ardito, Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2019. Espresso vs. EyeAutomate: An Experiment for the Comparison of Two Gen- erations of Android GUI Testing. In Proceedings of the Evaluation and Assess- ment on Software Engineering (EASE ’19) . ACM, New York, NY, USA, 13–22. https://doi.org/10.1145/3319008.3319022

  4. [4]

    Ardito, R

    L. Ardito, R. Coppola, M. Torchiano, and E. Alegroth. 2018. Towards Automated Translation between Generations of GUI-based Tests for Mobile Devices. In Pro- ceedings of INTUITESTBEDS 2018, joint Workshop of the 4th International Workshop on User Interface Test Automation, and 8th Workshop on TESting Techniques for event BasED Software. ACM

  5. [5]

    Stefan Bosnic, Ištvan Papp, and Sebastian Novak. 2016. The development of hybrid mobile applications with Apache Cordova. In2016 24th Telecommunications Forum (TELFOR). IEEE, 1–4

  6. [6]

    Andreas Bruns, Andreas Kornstadt, and Dennis Wichmann. 2009. Web application tests with selenium. IEEE software 26, 5 (2009)

  7. [7]

    Victor R Basili-Gianluigi Caldiera and H Dieter Rombach. 1994. Goal question metric paradigm. Encyclopedia of software engineering 1 (1994), 528–532

  8. [8]

    2012.Experimentation in software engineering

    Claes Claes, Wohlin, R Runeson, Per, H Höst, Martin, CO Ohlsson, Magnus, R Regnell, Björn, and Anders Wesslén. 2012.Experimentation in software engineering. Springer

  9. [9]

    Riccardo Coppola, Maurizio Morisio, and Marco Torchiano. 2017. Scripted GUI Testing of Android Apps: A Study on Diffusion, Evolution and Fragility. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, 22–32

  10. [10]

    Coppola, M

    R. Coppola, M. Morisio, and M. Torchiano. 2018. Maintenance of Android Widget- Based GUI Testing: A Taxonomy of Test Case Modification Causes. In 2018 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). 151–158. https://doi.org/10.1109/ICSTW.2018.00044

  11. [11]

    doi:10.1109/TNNLS.2018.2869225

    R. Coppola, M. Morisio, and M. Torchiano. 2018. Mobile GUI Testing Fragility: A Study on Open-Source Android Applications. IEEE Transactions on Reliability (2018), 1–24. https://doi.org/10.1109/TR.2018.2869227

  12. [12]

    Riccardo Coppola, Maurizio Morisio, Marco Torchiano, and Luca Ardito. 2019. Scripted GUI testing of Android open-source apps: evolution of test code and fragility causes. Empirical Software Engineering (18 May 2019). https://doi.org/ 10.1007/s10664-019-09722-9

  13. [13]

    Isabelle Dalmasso, Soumya Kanti Datta, Christian Bonnet, and Navid Nikaein

  14. [14]

    In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC)

    Survey, comparison and evaluation of cross platform mobile application development tools. In 2013 9th International Wireless Communications and Mobile Computing Conference (IWCMC). IEEE, 323–328

  15. [15]

    Ronald Jabangwe, Henry Edison, and Anh Nguyen Duc. 2018. Software engineer- ing process models for mobile app development: a systematic literature review. Journal of Systems and Software 145 (2018), 98–111

  16. [16]

    Mona Erfani Joorabchi, Ali Mesbah, and Philippe Kruchten. 2013. Real challenges in mobile app development. In Empirical Software Engineering and Measurement, 2013 ACM/IEEE International Symposium on . IEEE, 15–24

  17. [17]

    P. S. Kochhar, F. Thung, N. Nagappan, T. Zimmermann, and D. Lo. 2015. Un- derstanding the Test Automation Culture of App Developers. In2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST) . 1–10. https://doi.org/10.1109/ICST.2015.7102609

  18. [18]

    Maurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo Tonella. 2018. Pesto: Automated migration of DOM-based Web tests towards the visual approach. Software Testing, Verification And Reliability 28, 4 (2018), e1665

  19. [19]

    Mario Linares-Vásquez, Kevin Moran, and Denys Poshyvanyk. 2017. Continuous, evolutionary and large-scale: A new perspective for automated mobile app testing. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 399–410

  20. [20]

    Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empir- ical analysis of flaky tests. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering . ACM, 643–653

  21. [21]

    Malavolta, S

    I. Malavolta, S. Ruberto, T. Soru, and V. Terragni. 2015. End Users’ Perception of Hybrid Mobile Apps in the Google Play Store. In 2015 IEEE International Conference on Mobile Services . 25–32. https://doi.org/10.1109/MobServ.2015.14

  22. [22]

    Gaurang Shah, Prayag Shah, and Rishikesh Muchhala. 2014. Software testing automation using appium. International Journal of Current Engineering and Technology 4, 5 (2014), 3528–3531

  23. [23]

    Tom Yeh, Tsung-Hsiang Chang, and Robert C Miller. 2009. Sikuli: using GUI screenshots for search and automation. In Proceedings of the 22nd annual ACM symposium on User interface software and technology . ACM, 183–192