pith. sign in

arxiv: 2605.25356 · v1 · pith:3JUDDE4Xnew · submitted 2026-05-25 · 💻 cs.SE

Names Are All You Need: Effective and Safe Regression Test Selection for Python

Pith reviewed 2026-06-29 21:04 UTC · model grok-4.3

classification 💻 cs.SE
keywords regression test selectionPythonbipartite graphdependency analysissoftware testingdynamic languagesreachabilitytest selection safety
0
0 comments X

The pith

NameRTS models Python code as a bipartite graph of elements and names to select affected tests via reachability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NameRTS, a regression test selection technique for Python that represents programs as a bipartite graph linking code elements to the names they define or reference. Selection reduces to a reachability query: a test runs if any changed element is reachable from a name appearing in the test. This formulation sidesteps imprecise call-graph construction in a dynamically typed language and avoids the over-conservatism of file-level dependency tracking. Two pruning steps that draw on prior execution data and local context limit cascades from ambiguous name matches. A new benchmark dataset with commit-level ground truth lets the authors measure both the fraction of tests skipped and the rate at which all truly affected tests are retained.

Core claim

NameRTS models a Python program as a bipartite graph of code element nodes and name nodes, with edges capturing definitions and references. RTS is formulated as a reachability problem on this graph: a test is selected if any modified code element is reachable from the names used in that test. This design avoids call-graph construction, enabling a conservative analysis amenable to safety. To control dependency cascades introduced by coarse name matching, NameRTS applies two pruning strategies that leverage prior test executions and context information to refine name matching.

What carries the argument

Bipartite graph of code element nodes and name nodes whose reachability relation determines test selection, augmented by execution-history and context-based pruning of name matches.

If this is right

  • NameRTS skips 69.90 percent of test files on average across the benchmark.
  • It reduces end-to-end testing time by 45.59 percent.
  • It selects every affected test for 99.6 percent of commits.
  • It outperforms a file-level baseline both in the fraction of tests skipped and in the fraction of commits handled safely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The name-reachability model could be adapted to other dynamically typed languages that rely on eager imports.
  • The new dataset could become a shared resource for comparing additional Python testing techniques.
  • Extending the pruning rules with more execution context might further reduce the small number of missed tests.
  • Combining the static reachability check with lightweight runtime monitoring could address the exceptional cases where dynamic behavior evades the graph.

Load-bearing premise

The newly constructed Python RTS dataset supplies accurate ground truth that identifies exactly which test files are affected by each commit.

What would settle it

A commit for which the dataset labels certain test files as unaffected yet those files fail when run after the commit, or vice versa.

Figures

Figures reproduced from arXiv: 2605.25356 by Michael Pradel, You Wang, Zhongxin Liu.

Figure 2
Figure 2. Figure 2: Overview of NameRTS. example, an attribute (e.g., A1::magnify) is reachable only when its defining class (e.g., A1) is reachable. The final reachable set consists of code elements that the test may use. A change to any element in this set, such as A1::magnify, causes test_1.py to be selected for re-execution. A change outside this set, such as A2::magnify, does not. Thus, this graph avoids the unnecessary … view at source ↗
Figure 3
Figure 3. Figure 3: Example Code with Module and SharedVariable elements Constructing SharedVariable elements. Global variables and class static variables are initialized at import time and constitute shared state that may be accessed across multiple contexts. The analysis constructs a SharedVariable code element for each such variable, extracting used external names from top-level statements executed at import time that defi… view at source ↗
Figure 5
Figure 5. Figure 5: Cumulative Relative Testing Time most common safety issues stem from BabelRTS ignoring implicit parent package imports. The safety degradation is most pronounced in matplotlib and pylint, where BabelRTS achieves high test reduction (97.63% and 60.84%) but attains safe rates of only 14.00% and 48.00%. In matplotlib, the unsafety is caused by BabelRTS assuming that package sources are located under the proje… view at source ↗
Figure 6
Figure 6. Figure 6: Parameter sensitivity analysis (RQ4). the time reduction by 27.65%, because NEM not only reduces the selection overhead but also helps NameRTS avoid the tests that are relatively slow. When both pruning mechanisms are removed, test reduction and time reduction drop by 36.46% and 46.75%, respectively. Even without any pruning, NameRTS still reduces more tests and time than EkstaP and BabelRTS, thanks to its… view at source ↗
read the original abstract

Regression test selection reduces the cost of regression testing by executing only those tests affected by a code change. Despite extensive study of RTS in statically typed languages, achieving effective and safe RTS in Python is challenging. Python's dynamic typing makes precise call-graph construction difficult, which can cause call-graph-based RTS to miss affected tests. Python's eager importing mechanism, in contrast, renders file-level dependency analysis overly conservative. This paper presents NameRTS, the first Python RTS approach based on fine-grained dependency analysis. NameRTS models a Python program as a bipartite graph of code element nodes and name nodes, with edges capturing definitions and references. RTS is formulated as a reachability problem on this graph: a test is selected if any modified code element is reachable from the names used in that test. This design avoids call-graph construction, enabling a conservative analysis amenable to safety. To control dependency cascades introduced by coarse name matching, NameRTS applies two pruning strategies that leverage prior test executions and context information to refine name matching. To evaluate NameRTS, we construct the first Python RTS dataset with a ground truth indicating which test files are affected by each commit. We compare NameRTS with the best-performing baseline, BabelRTS, an RTS technique based on coarse file-level dependencies. On this benchmark, NameRTS skips 69.90% of test files on average, outperforming BabelRTS by 146.5%. It also reduces end-to-end testing time by 45.59%, yielding a 107.7% improvement over BabelRTS. In terms of safety, NameRTS selects all affected tests for 99.6% of commits, with only rare misses in exceptional cases. In contrast, BabelRTS is safe for 76.6% of commits. These results demonstrate the effectiveness of NameRTS, paving the way for more efficient regression testing in Python.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes NameRTS, the first Python-specific regression test selection (RTS) technique that models programs as a bipartite graph of code-element nodes and name nodes, formulates RTS as a reachability query on this graph, and applies two pruning strategies based on prior executions and context. It constructs a new Python RTS benchmark dataset providing ground-truth affected test files per commit, and reports that NameRTS skips 69.90% of test files on average (146.5% better than BabelRTS), reduces end-to-end testing time by 45.59%, and is safe on 99.6% of commits (vs. 76.6% for BabelRTS).

Significance. If the ground-truth labels prove accurate, the work would be significant for the RTS literature by demonstrating that name-based fine-grained analysis can be both effective and safe in a dynamically typed language where call-graph and file-level methods struggle; the bipartite-graph formulation and pruning heuristics are a concrete, falsifiable contribution that could be replicated or extended.

major comments (2)
  1. [Abstract and Evaluation section (dataset construction)] The manuscript provides no description of the labeling procedure used to establish ground truth (which test files are affected by each commit) in the newly constructed dataset referenced in the abstract and evaluation. All headline metrics—69.90% skip rate, 99.6% safety, 45.59% time reduction, and the 146.5% outperformance—are computed directly against these labels; without details on whether labels were obtained via full re-execution, coverage instrumentation, differential outcomes, or static reachability, the validity of the safety and effectiveness claims cannot be assessed.
  2. [Approach section (pruning strategies)] The two pruning strategies that refine name matching (to control dependency cascades) are presented as preserving conservatism, yet no ablation or separate quantification is given showing their effect on the safety rate; if pruning ever drops an affected test, the 99.6% safety figure would be overstated.
minor comments (1)
  1. [Approach] Notation for the bipartite graph (code-element nodes vs. name nodes) should be introduced with a small example or diagram early in the approach section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to improve clarity on dataset construction and to add supporting analysis on the pruning strategies.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section (dataset construction)] The manuscript provides no description of the labeling procedure used to establish ground truth (which test files are affected by each commit) in the newly constructed dataset referenced in the abstract and evaluation. All headline metrics—69.90% skip rate, 99.6% safety, 45.59% time reduction, and the 146.5% outperformance—are computed directly against these labels; without details on whether labels were obtained via full re-execution, coverage instrumentation, differential outcomes, or static reachability, the validity of the safety and effectiveness claims cannot be assessed.

    Authors: We agree that the absence of a description of the ground-truth labeling procedure is a significant omission that prevents readers from assessing the validity of the reported metrics. In the revised manuscript we will add a dedicated subsection (likely in Section 5) that fully documents the labeling process, including the exact steps taken to determine which test files are affected by each commit. revision: yes

  2. Referee: [Approach section (pruning strategies)] The two pruning strategies that refine name matching (to control dependency cascades) are presented as preserving conservatism, yet no ablation or separate quantification is given showing their effect on the safety rate; if pruning ever drops an affected test, the 99.6% safety figure would be overstated.

    Authors: We acknowledge that an explicit ablation or quantification of the pruning strategies' impact on safety would strengthen the paper. In the revision we will add an ablation study (new table or figure in the Evaluation section) that reports safety rates with and without each pruning strategy, thereby directly addressing whether the 99.6% safety figure is affected by pruning. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper defines NameRTS via an explicit bipartite-graph reachability model, introduces two pruning heuristics, constructs an independent benchmark dataset, and reports empirical comparisons against an external baseline (BabelRTS). None of the load-bearing claims reduce by construction to the method's own inputs, fitted parameters, or self-citations; the safety and effectiveness numbers are measured against the separately constructed ground-truth labels rather than being tautological with the algorithm itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the modeling assumption that name reachability on the bipartite graph captures affected tests conservatively and that the new dataset supplies reliable ground truth.

axioms (1)
  • domain assumption Python programs can be modeled as a bipartite graph of code element nodes and name nodes with edges capturing definitions and references.
    This modeling choice enables the reachability formulation for RTS without call-graph construction.
invented entities (1)
  • Bipartite graph of code elements and name nodes no independent evidence
    purpose: To enable conservative dependency analysis for RTS in dynamic Python code
    New modeling construct introduced to avoid limitations of call graphs and file-level analysis.

pith-pipeline@v0.9.1-grok · 5880 in / 1234 out tokens · 39017 ms · 2026-06-29T21:04:42.951380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 2 canonical work pages

  1. [1]

    2025. 3. Data model - Python 3.14.0 documentation. https://docs.python.org/3/reference/datamodel.html

  2. [2]

    Compound statements – Python 3.14.0 documentation

    2025. Compound statements – Python 3.14.0 documentation. https://docs.python.org/3/reference/compound_stmts. html#function-definitions

  3. [3]

    dis - Disassembler for Python bytecode - Python 3.14.0 documentation

    2025. dis - Disassembler for Python bytecode - Python 3.14.0 documentation. https://docs.python.org/3/library/dis. html#dis.hasname

  4. [4]

    The import system – Python 3.14.0 documentation

    2025. The import system – Python 3.14.0 documentation. https://docs.python.org/3/reference/import.html#regular- packages

  5. [5]

    Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1

    2025. Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1. https://github.blog/news- insights/octoverse/octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/

  6. [6]

    Technology | 2025 Stack Overflow Developer Survey

    2025. Technology | 2025 Stack Overflow Developer Survey. https://survey.stackoverflow.co/2025/technology#most- popular-technologies

  7. [7]

    TIOBE Index - TIOBE

    2025. TIOBE Index - TIOBE. https://www.tiobe.com/tiobe-index/

  8. [8]

    Tools and Trends - The State of Developer Ecosystem in 2025

    2025. Tools and Trends - The State of Developer Ecosystem in 2025. https://devecosystem-2025.jetbrains.com/tools- and-trends

  9. [9]

    Our replication package

    2026. Our replication package. https://github.com/ZJU-CTAG/NameRTS

  10. [10]

    Beatrice Åkerblom, Jonathan Stendahl, Mattias Tumlin, and Tobias Wrigstad. 2014. Tracing dynamic features in python programs. InProceedings of the 11th working conference on mining software repositories. 292–295

  11. [11]

    Khaled Walid Al-Sabbagh, Miroslaw Staron, Miroslaw Ochodek, Regina Hebig, and Wilhelm Meding. 2020. Selective regression testing based on big data: Comparing feature extraction techniques. In2020 IEEE International Conference on Software Testing, Verification and Validation Workshops. 322–329

  12. [12]

    Jeff Anderson, Saeed Salem, and Hyunsook Do. 2014. Improving the effectiveness of test suite through mining historical data. InProceedings of the 11th Working Conference on Mining Software Repositories. 142–151

  13. [13]

    Maral Azizi and Hyunsook Do. 2018. ReTEST: A cost effective test case selection technique for modern software development. In2018 IEEE 29th International Symposium on Software Reliability Engineering. 144–154

  14. [14]

    Antonia Bertolino, Antonio Guerriero, Breno Miranda, Roberto Pietrantuono, and Stefano Russo. 2020. Learning-to- rank vs ranking-to-learn: Strategies for regression testing in continuous integration. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1–12

  15. [15]

    Vincent Blondeau, Anne Etien, Nicolas Anquetil, Sylvain Cresson, Pascal Croisy, and Stéphane Ducasse. 2017. Test case selection in industry: An analysis of issues related to static approaches.Software Quality Journal25, 4 (2017), 1203–1237

  16. [16]

    Islem Bouzenia, Bajaj Piyush Krishan, and Michael Pradel. 2024. DyPyBench: A benchmark of executable python software.Proceedings of the ACM on Software Engineering1 (2024), 338–358

  17. [17]

    Islem Bouzenia and Michael Pradel. 2024. Resource usage and optimization opportunities in workflows of github actions. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  18. [18]

    Yufeng Chen. 2021. NodeSRT: a selective regression testing tool for Node. js application. In2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings. 126–128

  19. [19]

    Pavan Kumar Chittimalli and Mary Jean Harrold. 2009. Recomputing coverage information to assist regression testing. IEEE Transactions on Software Engineering35, 4 (2009), 452–469

  20. [20]

    Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, and Zhongxin Liu. 2025. NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition.CoRRabs/2507.18130 (2025)

  21. [21]

    Aryaz Eghbali and Michael Pradel. 2022. DynaPyt: a dynamic analysis framework for Python. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 760–771

  22. [22]

    Daniel Elsner, Severin Kacianka, Stephan Lipp, Alexander Pretschner, Axel Habermann, Maria Graber, and Silke Reimer

  23. [23]

    In2023 IEEE Conference on Software Testing, Verification and Validation

    BinaryRTS: Cross-language regression test selection for C++ binaries in CI. In2023 IEEE Conference on Software Testing, Verification and Validation. 327–338. , Vol. 1, No. 1, Article . Publication date: May 2026. Names Are All You Need: Effective and Safe Regression Test Selection for Python 21

  24. [24]

    Emelie Engström, Per Runeson, and Mats Skoglund. 2010. A systematic review on regression test selection techniques. Information and Software Technology52, 1 (2010), 14–30

  25. [25]

    Ben Fu, Sasa Misailovic, and Milos Gligoric. 2019. Resurgence of regression test selection for C++. In2019 12th IEEE Conference on Software Testing, Validation and Verification. 323–334

  26. [26]

    Milos Gligoric, Lamyaa Eloussi, and Darko Marinov. 2015. Practical regression test selection with dynamic file dependencies. InProceedings of the 2015 International Symposium on Software Testing and Analysis. 211–222

  27. [27]

    Alex Gyori, Owolabi Legunsen, Farah Hariri, and Darko Marinov. 2018. Evaluating regression test selection oppor- tunities in a very large open-source ecosystem. In2018 IEEE 29th International Symposium on Software Reliability Engineering. 112–122

  28. [28]

    M Jean Harrold, Rajiv Gupta, and Mary Lou Soffa. 1993. A methodology for controlling the size of a test suite.ACM Transactions on Software Engineering and Methodology2, 3 (1993), 270–285

  29. [29]

    Simon Hundsdorfer, Roland Würsching, and Alexander Pretschner. 2025. RustyRTS: Regression Test Selection for Rust. In2025 IEEE Conference on Software Testing, Verification and Validation. 338–348

  30. [30]

    Zhonghao Jiang, David Lo, and Zhongxin Liu. 2025. Agentic Software Issue Resolution with Large Language Models: A Survey.CoRRabs/2512.22256 (2025)

  31. [31]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE- bench: Can language models resolve real-world GitHub issues?. InInternational Conference on Learning Representations

  32. [32]

    Eero Kauhanen, Jukka K Nurminen, Tommi Mikkonen, and Matvei Pashkovskiy. 2021. Regression test selection tool for python in continuous integration process. In2021 IEEE International Conference on Software Analysis, Evolution and Reengineering. 618–621

  33. [33]

    James Law and Gregg Rothermel. 2003. Whole program path-based dynamic impact analysis. In25th International Conference on Software Engineering, 2003. Proceedings.308–318

  34. [34]

    Owolabi Legunsen, Farah Hariri, August Shi, Yafeng Lu, Lingming Zhang, and Darko Marinov. 2016. An extensive study of static regression test selection in modern software evolution. InProceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 583–594

  35. [35]

    Owolabi Legunsen, August Shi, and Darko Marinov. 2017. STARTS: STAtic regression test selection. In2017 32nd IEEE/ACM International Conference on Automated Software Engineering. 949–954

  36. [36]

    Hareton KN Leung and Lee White. 1989. Insights into regression testing (software testing). InProceedings. Conference on Software Maintenance-1989. 60–69

  37. [37]

    Hareton KN Leung and Lee White. 1990. A study of integration testing and software regression at the integration level. InProceedings. Conference on Software Maintenance 1990. 290–301

  38. [38]

    Yue Li, Tian Tan, and Jingling Xue. 2019. Understanding and analyzing java reflection.ACM Transactions on Software Engineering and Methodology28, 2 (2019), 1–50

  39. [39]

    Yingling Li, Junjie Wang, Yun Yang, and Qing Wang. 2019. Method-level test selection for continuous integration with static dependencies and dynamic execution rules. In2019 IEEE 19th International Conference on Software Quality, Reliability and Security. 350–361

  40. [40]

    Yu Liu, Jiyang Zhang, Pengyu Nie, Milos Gligoric, and Owolabi Legunsen. 2023. More precise regression test selection via reasoning about semantics-modifying changes. InProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 664–676

  41. [41]

    Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chandra. 2019. Predictive test selection. In2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice. 91–100

  42. [42]

    Gabriele Maurina, Walter Cazzola, and Sudipto Ghosh. 2025. BabelRTS: Polyglot Regression Test Selection.IEEE Transactions on Software Engineering(2025)

  43. [43]

    Alessandro Orso, Nanjuan Shi, and Mary Jean Harrold. 2004. Scaling regression testing to large software systems. ACM SIGSOFT Software Engineering Notes29, 6 (2004), 241–251

  44. [44]

    Cong Pan and Michael Pradel. 2021. Continuous test suite failure prediction. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. 553–565

  45. [45]

    Gregg Rothermel and Mary Jean Harrold. 1997. A safe, efficient regression test selection technique.ACM Transactions on Software Engineering and Methodology6, 2 (1997), 173–210

  46. [46]

    Gregg Rothermel and Mary Jean Harrold. 2002. Analyzing regression test selection techniques.IEEE Transactions on software engineering22, 8 (2002), 529–551

  47. [47]

    Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In2021 IEEE/ACM 43rd International Conference on Software Engineering. 1646–1657

  48. [48]

    August Shi, Milica Hadzi-Tanovic, Lingming Zhang, Darko Marinov, and Owolabi Legunsen. 2019. Reflection-aware static regression test selection.Proceedings of the ACM on Programming Languages3 (2019), 1–29

  49. [49]

    Quinten David Soetens, Serge Demeyer, Andy Zaidman, and Javier Pérez. 2016. Change-based test selection: an empirical evaluation.Empirical software engineering21, 5 (2016), 1990–2032. , Vol. 1, No. 1, Article . Publication date: May 2026. 22 You Wang, Michael Pradel, and Zhongxin Liu

  50. [50]

    Marko Vasic, Zuhair Parvez, Aleksandar Milicevic, and Milos Gligoric. 2017. File-level vs. module-level regression test selection for. net. InProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 848–853

  51. [51]

    Kaiyuan Wang, Chenguang Zhu, Ahmet Celik, Jongwook Kim, Don Batory, and Milos Gligoric. 2018. Towards refactoring-aware regression test selection. InProceedings of the 40th international conference on software engineering. 233–244

  52. [52]

    Solved Issues

    You Wang, Michael Pradel, and Zhongxin Liu. 2026. Are “Solved Issues” in SWE-bench Really Solved Correctly? An Empirical Study. In2026 IEEE/ACM 48th International Conference on Software Engineering

  53. [53]

    W Eric Wong, Joseph R Horgan, Saul London, and Hiralal Agrawal. 1997. A study of effective regression testing in practice. InPROCEEDINGS The Eighth International Symposium On Software Reliability Engineering. 264–274

  54. [54]

    Shin Yoo and Mark Harman. 2012. Regression testing minimization, selection and prioritization: a survey.Software testing, verification and reliability22, 2 (2012), 67–120

  55. [55]

    2021.Towards Parallelization of Regression Test Selection

    Maruf Hasan Zaber. 2021.Towards Parallelization of Regression Test Selection. Master’s thesis. University of California, Irvine

  56. [56]

    Chengming Zhang, Haoye Wang, Chuyang Xu, Jiakun Liu, Kui Liu, and Zhongxin Liu. 2026. Can test cases generated by large language models facilitate automated program repair?Empirical Software Engineering31, 3 (2026), 68

  57. [57]

    Guofeng Zhang, Luyao Liu, Zhenbang Chen, and Ji Wang. 2024. Hybrid Regression Test Selection by Integrating File and Method Dependences. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1557–1569

  58. [58]

    Lingming Zhang. 2018. Hybrid regression test selection. InProceedings of the 40th International Conference on Software Engineering. 199–209

  59. [59]

    Chenguang Zhu, Owolabi Legunsen, August Shi, and Milos Gligoric. 2019. A framework for checking regression test selection tools. In2019 IEEE/ACM 41st International Conference on Software Engineering. 430–441. , Vol. 1, No. 1, Article . Publication date: May 2026