Recognition: no theorem link
Android Instrumentation Testing in Continuous Integration: Practices, Patterns, and Performance
Pith reviewed 2026-05-13 18:24 UTC · model grok-4.3
The pith
Community-based emulator setups prove most reliable and efficient for running Android instrumentation tests in everyday CI, while custom scripts increase reruns and third-party labs raise costs for regressions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By examining CI workflow files, scripts, and Gradle configurations in 4,518 repositories, we find that instrumentation tests run in CI in only 481 cases (10.6 percent), typically via community components or repository-specific custom scripts; these setups remain stable over time with a gradual shift toward reusable components; and performance metrics from GitHub Actions metadata indicate community-based setups are most reliable and efficient for daily checks, third-party labs suit regressions despite higher costs and failures, and custom scripting provides flexibility but correlates with more reruns.
What carries the argument
Classification of CI setup styles (community reusable components, custom scripts, third-party device labs) and their measured outcomes via GitHub Actions run-level and step-level metadata on success, duration, reruns, and queue delay.
If this is right
- Projects gain reliability and lower rerun rates by adopting community-based emulator setups for routine CI runs.
- Third-party device labs become practical only when full regression coverage outweighs their added cost and failure frequency.
- Custom scripting remains useful when flexibility is required but demands extra effort to handle higher rerun rates.
- Changes in CI setup are most often driven by the desire to expand test coverage rather than performance alone.
- Setups tend to stabilize once chosen, with migration toward community components when evolution occurs.
Where Pith is reading between the lines
- Open-source Android teams could reduce maintenance by contributing more reusable community components that standardize emulator setup.
- The observed patterns may generalize to other mobile ecosystems where device testing is similarly fragile.
- Teams could mix setup styles within one project, using community components for daily checks and third-party labs only for periodic full suites.
- Future studies could track how project scale and test suite size influence the choice and success of each approach.
Load-bearing premise
The GitHub Actions metadata and single repository snapshot capture unbiased performance differences across setup styles without major effects from varying project sizes or test complexities.
What would settle it
Run identical instrumentation test suites on the same set of Android projects using each of the three setup styles in parallel CI pipelines and compare measured rerun rates, durations, and failure counts.
Figures
read the original abstract
Android instrumentation tests (end-to-end tests that run on a device or emulator) can catch problems that simpler tests miss. However, running these tests automatically in continuous integration (CI) is often difficult because emulator setup is fragile and configurations tend to drift over time. We study how open-source Android apps run instrumentation tests in CI by analyzing 4,518 repositories that use CI (snapshot: Aug. 10, 2025). We examine CI workflow files, scripts, and build configurations to identify cases where device setup is defined in Gradle (e.g., Gradle Managed Devices). Our results answer three questions about adoption, evolution, and outcomes. First, only about one in ten repositories (481/4,518; 10.6%) run instrumentation tests in CI, typically using either reusable community components or repository-specific custom scripts to set up emulators. Second, these setups usually stay the same over time; when changes happen, projects tend to move from custom scripts toward reusable community components. Third, we study why projects change their CI setup by analyzing their commits, pull requests, and issue messages. We evaluate how different setup styles perform using GitHub Actions run- and step-level metadata (e.g., outcomes, duration, reruns, and queue delay). We find that teams often change approaches to expand test coverage, and that each approach fits different needs: community-based setups are typically the most reliable and efficient for everyday checks on new code, third-party device labs suit scheduled regression testing but can be costlier and fail more often, and custom scripting provides flexibility but is associated with more reruns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical analysis of Android instrumentation testing practices in continuous integration across 4,518 GitHub repositories. It identifies low adoption rates (10.6%), common setup approaches including community-based components, custom scripts, and third-party device labs, examines their temporal evolution, and evaluates performance differences in terms of reliability, efficiency, reruns, and queue delays using GitHub Actions metadata. The central conclusion is that community-based setups offer the best reliability and efficiency for routine checks, third-party labs are suitable for regression testing despite higher costs and failure rates, and custom scripts provide flexibility at the cost of more reruns.
Significance. If the observed performance differences are not confounded by project characteristics, the study offers actionable insights for practitioners selecting CI configurations for Android instrumentation tests. The scale of the repository analysis and the focus on real-world GitHub Actions data contribute to its relevance in software engineering research on testing practices. The identification of evolution patterns from custom to community setups is particularly noteworthy.
major comments (2)
- [Results (performance evaluation)] The performance comparison (results section on outcomes, duration, reruns, and queue delay) attributes differences in reliability and reruns to setup style without stratification or multivariate controls for confounders such as repository size, commit volume, test-suite scale, or app complexity. Projects adopting custom scripts may systematically differ in these dimensions, so reported associations could reflect selection effects rather than causal properties of the setup approach.
- [Methodology] The methodology provides limited detail on repository selection criteria, inclusion/exclusion filters, and how the snapshot of 4,518 repositories was constructed to ensure representativeness. This weakens the generalizability of the 10.6% adoption rate and the evolution findings.
minor comments (2)
- [Abstract] The snapshot date 'Aug. 10, 2025' appears to be in the future; confirm whether this is a typographical error for 2024.
- [Abstract] Clarify in the abstract and results whether any statistical tests (e.g., chi-square or regression) were used to compare metrics across setup styles, or if comparisons are purely descriptive.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below, acknowledging where the manuscript can be strengthened through additional analysis and detail.
read point-by-point responses
-
Referee: [Results (performance evaluation)] The performance comparison (results section on outcomes, duration, reruns, and queue delay) attributes differences in reliability and reruns to setup style without stratification or multivariate controls for confounders such as repository size, commit volume, test-suite scale, or app complexity. Projects adopting custom scripts may systematically differ in these dimensions, so reported associations could reflect selection effects rather than causal properties of the setup approach.
Authors: We agree that the performance analysis is observational and does not include multivariate controls, raising the possibility that differences partly reflect project characteristics rather than setup style. In the revision we will add stratification by repository size (stars and commit count) and a multivariate regression controlling for commit volume, test-suite scale (where measurable from build files), and app complexity proxies. We will also explicitly note the observational nature of the findings and avoid causal language. revision: yes
-
Referee: [Methodology] The methodology provides limited detail on repository selection criteria, inclusion/exclusion filters, and how the snapshot of 4,518 repositories was constructed to ensure representativeness. This weakens the generalizability of the 10.6% adoption rate and the evolution findings.
Authors: We acknowledge the need for greater transparency. The 4,518 repositories were obtained via GitHub API queries for projects with Android Gradle files and GitHub Actions workflows containing instrumentation-test steps, filtered to non-fork, non-archived repositories with at least 10 commits in the preceding year; the snapshot was taken on 10 August 2025. In the revision we will expand the Methodology section with the exact search criteria, inclusion/exclusion rules, and any checks performed for representativeness. revision: yes
Circularity Check
No circularity: purely observational repository analysis
full rationale
The paper conducts an empirical snapshot study of 4,518 GitHub repositories, classifying CI setups from workflow files and measuring outcomes via direct GitHub Actions metadata. No equations, derivations, fitted parameters, or predictions appear; claims about reliability, efficiency, and reruns are presented as observed associations from the data itself rather than reductions to prior inputs or self-citations. The analysis is self-contained against the external repository corpus with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 4,518 repositories represent a valid sample of Android apps using CI.
Reference graph
Works this paper leans on
-
[1]
Pingfan Kong, Li Li, Jun Gao, Kui Liu, Tegawendé F. Bissyandé, and Jacques Klein. Automated testing of Android apps: A systematic literature review.IEEE Transactions on Reliability, 68(1):45–66, 2019
work page 2019
-
[2]
Fabiano Pecorelli, Gemma Catolino, Filomena Ferrucci, Andrea De Lu- cia, and Fabio Palomba. Software testing and Android applications: A large-scale empirical study.Empirical Software Engineering, 27(2):31, 2022
work page 2022
-
[3]
Tarek Mahmud, Meiru Che, Anne H. H. Ngu, and Guowei Yang. Why Android app testing falls short: Empirical insights from open-source projects and a practitioner survey.Empirical Software Engineering, 30(6):163, 2025
work page 2025
-
[4]
An empirical analysis of flaky tests
Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. An empirical analysis of flaky tests. InProceedings of the 22nd ACM SIGSOFT International Symposium on the Foundations of Software Engineering, pages 643–653, 2014
work page 2014
-
[5]
A study on the lifecycle of flaky tests
Wing Lam, Kıvanç Mu¸ slu, Hitesh Sajnani, and Suresh Thummalapenta. A study on the lifecycle of flaky tests. InProceedings of the 42nd International Conference on Software Engineering, pages 1471–1482, 2020
work page 2020
-
[6]
Usage, costs, and benefits of continuous integration in open- source projects
Michael Hilton, Timothy Tunnell, Kevin Huang, Darko Marinov, and Danny Dig. Usage, costs, and benefits of continuous integration in open- source projects. InProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pages 426–437, 2016
work page 2016
-
[7]
Bogdan Vasilescu, Yue Yu, Huaimin Wang, Premkumar T. Devanbu, and Vladimir Filkov. Quality and productivity outcomes relating to continuous integration in GitHub. InProceedings of the 10th Joint Meeting on Foundations of Software Engineering, pages 805–816, 2015
work page 2015
-
[8]
Ghaleb, Osamah Abduljalil, and Safwat Hassan
Taher A. Ghaleb, Osamah Abduljalil, and Safwat Hassan. CI/CD configuration practices in open-source Android apps: An empirical study. ACM Transactions on Software Engineering and Methodology, 2025
work page 2025
-
[9]
CI/CD pipelines evolution and restructuring
Fiorella Zampetti, Simone Scalabrino, and Rocco Oliveto. CI/CD pipelines evolution and restructuring. InProceedings of the IEEE International Conference on Software Maintenance and Evolution, 2021
work page 2021
-
[10]
Parisa Reza Mazrae, Tom Mens, Mahshid Golzadeh, and Alexandre Decan. On the usage, co-usage and migration of CI/CD tools: A qualitative analysis.Empirical Software Engineering, 2023
work page 2023
-
[11]
Build instrumented tests, 2025
Android Developers. Build instrumented tests, 2025. Accessed: 2025- 08-19
work page 2025
-
[12]
AndroidJUnitRunner | test your app on android,
Android Developers. AndroidJUnitRunner | test your app on android,
-
[13]
Accessed: 2025-08-19
work page 2025
-
[14]
Test from the command line, 2024
Android Developers. Test from the command line, 2024. Accessed: 2025-08-19
work page 2024
-
[15]
Run apps on the Android Emulator, 2024
Android Developers. Run apps on the Android Emulator, 2024. Accessed: 2025-01-15
work page 2024
-
[16]
GitHub Action - Android Emulator Runner
ReactiveCircus. GitHub Action - Android Emulator Runner. Accessed: 2025-10-16
work page 2025
- [17]
-
[18]
Scale your tests with build-managed devices, 2026
Android Developers. Scale your tests with build-managed devices, 2026. Accessed: 2025-10-16
work page 2026
- [19]
-
[20]
Managing app testing device clouds: Issues and opportunities
Mattia Fazzini and Alessandro Orso. Managing app testing device clouds: Issues and opportunities. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 1257–1259, 2020
work page 2020
-
[21]
Hao Lin, Jiaxing Qiu, Hongyi Wang, Zhenhua Li, Liangyi Gong, Di Gao, Yunhao Liu, Feng Qian, Zhao Zhang, Ping Yang, and Tianyin Xu. Virtual device farms for mobile app testing at scale: A pursuit for fidelity, efficiency, and accessibility. InProceedings of the 29th Annual International Conference on Mobile Computing and Networking, 2023
work page 2023
-
[22]
Workflow syntax for GitHub Actions, 2025
GitHub Docs. Workflow syntax for GitHub Actions, 2025. Accessed: 2025-12-15
work page 2025
-
[23]
Using pre-written building blocks in your workflow, 2025
GitHub Docs. Using pre-written building blocks in your workflow, 2025. Documents that workflow steps can use actions defined in other public repositories or container images, which can encapsulate environment setup outside the caller repository
work page 2025
-
[24]
Creating a composite action, 2025
GitHub Docs. Creating a composite action, 2025. Accessed: 2025-12- 15
work page 2025
- [25]
-
[26]
GitHub Docs. Self-hosted runners, 2025. Describes self-hosted runners and environment customization to run GitHub Actions jobs, which can make parts of environment setup external to repository artifacts
work page 2025
-
[27]
GitHub Actions is generally available
GitHub. GitHub Actions is generally available. GitHub Changelog, November 2019
work page 2019
-
[28]
Lianyu Zheng, Shuang Li, Xi Huang, Jiangnan Huang, Bin Lin, Jinfu Chen, and Jifeng Xuan. Why do GitHub actions workflows fail? an empirical study.ACM Transactions on Software Engineering and Methodology, 2025
work page 2025
-
[29]
Resource usage and optimization opportunities in workflows of GitHub actions
Islem Bouzenia and Michael Pradel. Resource usage and optimization opportunities in workflows of GitHub actions. InProceedings of the IEEE/ACM International Conference on Software Engineering, 2024
work page 2024
-
[30]
Sakina Fatima, Taher A. Ghaleb, and Lionel Briand. Flakify: A black- box, language model-based predictor for flaky tests.IEEE Transactions on Software Engineering, 49(4):1912–1927, 2023
work page 1912
-
[31]
An empirical study of flaky tests in Android apps
Swapna Thorve, Chandani Sreshtha, and Na Meng. An empirical study of flaky tests in Android apps. InProceedings of the IEEE International Conference on Software Maintenance and Evolution, 2018
work page 2018
-
[32]
An empirical analysis of UI-based flaky tests
Alan Romano, Zihe Song, Sampath Grandhi, Wei Yang, and Weihang Wang. An empirical analysis of UI-based flaky tests. InProceedings of the IEEE/ACM International Conference on Software Engineering, 2021
work page 2021
-
[33]
Valeria Pontillo, Fabio Palomba, and Filomena Ferrucci. Test code flakiness in mobile apps: The developer’s perspective.Information and Software Technology, 168:107394, 2024
work page 2024
-
[34]
Ghaleb, Safwat Hassan, and Ying Zou
Taher A. Ghaleb, Safwat Hassan, and Ying Zou. Studying the interplay between the durations and breakages of continuous integration builds. IEEE Transactions on Software Engineering, 49(4):2476–2497, 2023
work page 2023
-
[35]
Xiaoxin Zhou, Taher A. Ghaleb, and Safwat Hassan. Role of CI adoption in mobile app success: An empirical study of open-source Android projects. InProceedings of the 23rd International Conference on Mining Software Repositories (MSR ’26), pages 1–12, New York, NY , USA,
- [36]
-
[37]
Can LLMs write CI? a study on automatic generation of github actions configurations
Taher A Ghaleb and Dulina Rathnayake. Can LLMs write CI? a study on automatic generation of github actions configurations. In2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 767–772. IEEE, 2025
work page 2025
- [38]
-
[39]
Nitika Chopra and Taher A. Ghaleb. From first use to final commit: Studying the evolution of multi-CI service adoption. In2025 IEEE International Conference on Software Maintenance and Evolution (IC- SME), pages 773–778. IEEE, 2025
work page 2025
-
[40]
Marcus Emmanuel Barnes, Taher A. Ghaleb, and Safwat Hassan. LogSieve: Task-aware CI log reduction for sustainable LLM-based analysis. InProceedings of the 23rd International Conference on Mining Software Repositories (MSR ’26), pages 1–12, New York, NY , USA,
-
[41]
Marcus Emmanuel Barnes, Taher A. Ghaleb, and Safwat Hassan. Task- aware reduction for scalable llm-database systems. In2025 IEEE International Conference on Collaborative Advances in Software and COmputiNg (CASCON), pages 631–635, 2025
work page 2025
-
[42]
Test automation in open-source Android apps: A large-scale empirical study
Jun-Wei Lin, Navid Salehnamadi, and Sam Malek. Test automation in open-source Android apps: A large-scale empirical study. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 1078–1089, 2020
work page 2020
-
[43]
Tarek Mahmud, Meiru Che, Anne H. H. Ngu, and Guowei Yang. An empirical investigation on Android app testing practices. InProceedings of the 2024 IEEE 35th International Symposium on Software Reliability Engineering, pages 355–366, 2024
work page 2024
-
[44]
An empirical study of regression testing for Android apps in continuous integration environment
Dingbang Wang, Yu Zhao, Lu Xiao, and Tingting Yu. An empirical study of regression testing for Android apps in continuous integration environment. InProceedings of the 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2023. ESEM 2023
work page 2023
-
[45]
Rest api endpoints for search, 2025
GitHub Docs. Rest api endpoints for search, 2025. Accessed: 2025-12- 22
work page 2025
-
[46]
Travis CI Documentation. Build an Android project, 2025. See “Default Test Commands”: for Gradle projects Travis CI runsgradle build connectedCheck(or./gradlew build connectedCheck whengradlewis present)
work page 2025
-
[47]
Why are commits being reverted? a comparative study of industrial and open source projects
Junji Shimagaki, Yasutaka Kamei, Shane McIntosh, David Pursehouse, and Naoyasu Ubayashi. Why are commits being reverted? a comparative study of industrial and open source projects. InProceedings of the 2016 IEEE International Conference on Software Maintenance and Evolution, pages 301–310, 2016
work page 2016
-
[48]
Wil M. P. van der Aalst.Process Mining: Data Science in Action. Springer, 2 edition, 2016
work page 2016
-
[49]
Andreas Mauczka, Markus Huber, Christian Schanes, Wolfgang Schramm, Mario Bernhart, and Thomas Grechenig. Tracing your maintenance work—a cross-project validation of an automated classi- fication dictionary for commit messages. InInternational Conference on Fundamental Approaches to Software Engineering, pages 301–315. Springer, 2012
work page 2012
-
[50]
Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval.Information Processing & Management, 24(5):513–523, 1988
work page 1988
-
[51]
Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval.Journal of Documentation, 28(1):11–21, 1972
work page 1972
-
[52]
Manning, Prabhakar Raghavan, and Hinrich Schütze
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK, 2008
work page 2008
-
[53]
A review of keyphrase extraction, 2019
Eirini Papagiannopoulou and Grigorios Tsoumakas. A review of keyphrase extraction, 2019. arXiv:1905.05044 [cs.IR]
-
[54]
Karl Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50:157–175, 1900
work page 1900
-
[55]
Alan Agresti.Categorical Data Analysis. Wiley, 3 edition, 2013
work page 2013
-
[56]
Lawrence Erlbaum Associates, 2 edition, 1988
Jacob Cohen.Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, 2 edition, 1988
work page 1988
-
[57]
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995
work page 1995
-
[58]
Henry B. Mann and Donald R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, 18(1):50–60, 1947
work page 1947
-
[59]
Norman Cliff. Dominance statistics: Ordinal analyses to answer ordinal questions.Psychological Bulletin, 114(3):494–509, 1993
work page 1993
-
[60]
Kromrey, Jesse Coraggio, Jeff Skowronek, and Lindsey Devine
Jeanine Romano, Jeffrey D. Kromrey, Jesse Coraggio, Jeff Skowronek, and Lindsey Devine. Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and cohen’s d indices the most appropriate choices? InProceedings of the Annual Meeting of the Southern Association for Institutional Research, 2006. Commonly cited thresho...
work page 2006
-
[61]
Pauli Virtanen et al. SciPy 1.0: Fundamental algorithms for scientific computing in python.Nature Methods, 17(3):261–272, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.