pith. sign in

arxiv: 2605.04189 · v1 · submitted 2026-05-05 · 💻 cs.HC · cs.SE

Exploring the Output of Software Testing Tools through a Visual Comparative Analysis

Pith reviewed 2026-05-08 17:35 UTC · model grok-4.3

classification 💻 cs.HC cs.SE
keywords software testingtest result visualizationHCICLI outputGUI outputcomparative analysisoutput formattingcolor usage
0
0 comments X

The pith

A comparison of 50 testing tools shows shared patterns in how they format and visualize test results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a visual comparative analysis on the outputs of 50 software testing tools and harnesses, with 44 using command-line interfaces and 6 using graphical ones, drawn from four programming languages. It identifies recurring interface elements, the ways test results are displayed and visualized, and the detailed composition of those outputs. A sympathetic reader would care because earlier studies indicate that good visualizations help testers make decisions, yet no prior HCI work has mapped the common elements across tools. The results point to consistent trends in formatting and color use that developers of new tools could follow.

Core claim

Our analysis reveals the common interface elements in software testing tools, how these tools display and visualize test results, as well as the specific make-up of the output. Our findings provide insight on how visual testing output is formatted and how colour is used across both CLI and GUI environments, identifying trends that can be applied by developers of testing tools.

What carries the argument

The visual comparative analysis of outputs from 44 CLI and 6 GUI testing tools, which surfaces recurring elements, display methods, and formatting details.

If this is right

  • Testing tool developers can adopt the observed formatting and color conventions to align with existing patterns.
  • Shared display methods for results can be used to make test output more consistent across tools.
  • Trends identified in both CLI and GUI settings can inform interface design choices for new harnesses.
  • The specific composition of outputs can guide how results are structured to support tester decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If these patterns prove stable, integrated development environments could standardize result views based on them.
  • The same analysis method could be repeated on tools for other programming languages to test whether the trends generalize.
  • Designers might explore whether adopting the common elements reduces the time testers spend interpreting outputs.

Load-bearing premise

The 50 chosen tools are representative enough of the broader population of testing tools to support claims about general trends.

What would settle it

A follow-up survey of testing tools in additional languages or domains that reveals substantially different visualization patterns or color usage would show the identified common elements are not general.

Figures

Figures reproduced from arXiv: 2605.04189 by Anthony Maocheia-Ricci, Brandon Lit, Thomas Driscoll.

Figure 1
Figure 1. Figure 1: Our visual comparative analysis methodology phases, adapted from Frappier et al. [ view at source ↗
Figure 2
Figure 2. Figure 2: The full mosaic of all testing outputs, each composed of the 8 common interface elements. The CLIs and GUIs are view at source ↗
Figure 3
Figure 3. Figure 3: A screenshot of CUnit CLI with interface element view at source ↗
Figure 4
Figure 4. Figure 4: A screenshot of QUnit’s GUI with interface element view at source ↗
Figure 7
Figure 7. Figure 7: CUnit’s interactive CLI mode, with an example view at source ↗
Figure 6
Figure 6. Figure 6: An example of the “details-on-the-outside” class of view at source ↗
Figure 8
Figure 8. Figure 8: Examples of error location identifiers and code lines/blocks: (a) Pytest displaying line numbers, carats, and code block; view at source ↗
Figure 9
Figure 9. Figure 9: Example showing a differing amount of detail be view at source ↗
Figure 10
Figure 10. Figure 10: Example test suite summary blocks from our sample: (a) CHEAT, using two colours with period and colon symbols view at source ↗
read the original abstract

Software testing is a fundamental process of software development, and prior work has shown that visualizations of test results support testers' decision-making. However, Human-Computer Interaction research on software testing has yet to explore and understand the shared interface elements and patterns in visualization of testing outputs. To address this, we conducted a visual comparative analysis of the output of 50 software testing tools and harnesses (44 with CLI output, 6 with GUI output) across four popular programming languages. Our analysis reveals the common interface elements in software testing tools, how these tools display and visualize test results, as well as the specific make-up of the output. Our findings provide insight on how visual testing output is formatted and how colour is used across both CLI and GUI environments, identifying trends that can be applied by developers of testing tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a visual comparative analysis of outputs from 50 software testing tools (44 CLI, 6 GUI) across four programming languages. It claims to identify common interface elements, how test results are displayed and visualized, the specific composition of outputs, and trends in formatting and color usage that can inform testing tool development.

Significance. If the methodology were transparent and the sample justified, the work could usefully map design patterns in testing visualizations for HCI researchers and tool developers, addressing a noted gap in understanding shared elements that support tester decision-making. The descriptive approach has practical potential but currently lacks the rigor needed for reliable insights or generalization.

major comments (2)
  1. [Methods] Methods section: The visual comparative analysis is described only at a high level in the abstract and introduction, with no details on the procedure, coding scheme for identifying 'common' elements, criteria for patterns, or validation steps such as inter-rater checks. This is load-bearing for the central claims, as the reported findings on interface elements, visualization, and trends cannot be assessed or replicated without it.
  2. [Tool selection] Tool selection (likely §3 or equivalent): The sample of 50 tools (44 CLI, 6 GUI) across four languages is presented without explicit selection criteria, popularity metrics, stratification, or diversity audit. The severe CLI/GUI imbalance risks the observed 'common' patterns being artifacts of over-represented open-source CLI harnesses rather than general trends.
minor comments (1)
  1. [Abstract] Abstract: 'Four popular programming languages' is stated without naming them (e.g., Java, Python), which reduces immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the transparency and rigor of our work. We address each major comment point by point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Methods] Methods section: The visual comparative analysis is described only at a high level in the abstract and introduction, with no details on the procedure, coding scheme for identifying 'common' elements, criteria for patterns, or validation steps such as inter-rater checks. This is load-bearing for the central claims, as the reported findings on interface elements, visualization, and trends cannot be assessed or replicated without it.

    Authors: We agree that the current Methods section lacks sufficient detail for full assessment and replication. In the revised manuscript, we will expand this section to describe the full analysis procedure step by step, the coding scheme used to categorize interface elements and visualization formats, explicit criteria for identifying 'common' patterns and trends (including thresholds for commonality), and validation measures such as inter-rater reliability checks between coders. These additions will directly address the load-bearing nature of the claims. revision: yes

  2. Referee: [Tool selection] Tool selection (likely §3 or equivalent): The sample of 50 tools (44 CLI, 6 GUI) across four languages is presented without explicit selection criteria, popularity metrics, stratification, or diversity audit. The severe CLI/GUI imbalance risks the observed 'common' patterns being artifacts of over-represented open-source CLI harnesses rather than general trends.

    Authors: We acknowledge that the tool selection process requires more explicit justification. The revised manuscript will include a dedicated subsection detailing the selection criteria (e.g., popularity via GitHub stars, download metrics, and official documentation), any stratification by language or output type, and a diversity audit. We will also explain the rationale for the CLI/GUI distribution, noting that it mirrors the real-world prevalence of CLI-based testing tools, while discussing limitations and potential impacts on generalizability. Where feasible, we will explore adding more GUI examples to mitigate the imbalance. revision: yes

Circularity Check

0 steps flagged

No circularity: purely descriptive empirical survey with no derivations or self-referential claims

full rationale

This paper performs a visual comparative analysis of outputs from 50 testing tools (44 CLI, 6 GUI) across four languages. The central claims consist of observed common interface elements, display patterns, and color usage identified through direct inspection. There are no equations, fitted parameters, predictions, uniqueness theorems, or self-citations invoked to justify core results. The analysis is self-contained as an empirical description; any concern about sample representativeness is a question of external validity, not a reduction of the reported findings to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical comparative study of existing tools; no mathematical models, free parameters, axioms, or new entities are introduced or required.

pith-pipeline@v0.9.0 · 5435 in / 878 out tokens · 35384 ms · 2026-05-08T17:35:33.213049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    List of unit testing frameworks

    2026. List of unit testing frameworks. https://en.wikipedia.org/w/index.php? title=List_of_unit_testing_frameworks&oldid=1343929439 Page Version ID: 1343929439

  2. [2]

    Abdulaziz Alaboudi and Thomas D. Latoza. 2023. Hypothesizer: A Hypothesis- Based Debugger to Find and Test Debugging Hypotheses. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3586183.3606781

  3. [3]

    Paul Ayres and John Sweller. 2005. The Split-Attention Principle in Multime- dia Learning. InThe Cambridge Handbook of Multimedia Learning, Richard Mayer (Ed.). Cambridge University Press, Cambridge, 135–146. doi:10.1017/ CBO9780511816819.009

  4. [4]

    Benjamin Bach, Zezhong Wang, Matteo Farinella, Dave Murray-Rust, and Nathalie Henry Riche. 2018. Design Patterns for Data Comics. InProceed- ings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3173574.3173612

  5. [5]

    Andrea Borg, Chris Porter, and Mark Micallef. 2015. Is Carmen better than George? testing the exploratory tester using HCI techniques. InProceedings of the 37th International Conference on Software Engineering - Volume 2 (ICSE ’15). IEEE Press, Florence, Italy, 815–816. https://dl.acm.org/doi/10.5555/2819009.2819181

  6. [6]

    Yuanliang Chen, Yu Jiang, Fuchen Ma, Jie Liang, Mingzhe Wang, Chijin Zhou, Xun Jiao, and Zhuo Su. 2019. EnFuzz: Ensemble Fuzzing with Seed Synchroniza- tion among Diverse Fuzzers. 1967–1983. https://www.usenix.org/conference/ usenixsecurity19/presentation/chen-yuanliang

  7. [7]

    Song, Walter S

    Yan Chen, Maulishree Pandey, Jean Y. Song, Walter S. Lasecki, and Steve Oney

  8. [8]

    In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20)

    Improving Crowd-Supported GUI Testing with Structural Guidance. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145/3313831.3376835

  9. [9]

    Chen, Rahul Gopinath, Anita Tadakamalla, Michael D

    Yiqun T. Chen, Rahul Gopinath, Anita Tadakamalla, Michael D. Ernst, Reid Holmes, Gordon Fraser, Paul Ammann, and René Just. 2021. Revisiting the relationship between fault detection, test adequacy criteria, and test set size. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE ’20). Association for Computing...

  10. [10]

    Chiou, Ali S

    Paul T. Chiou, Ali S. Alotaibi, and William G.J. Halfond. 2023. BAGEL: An Approach to Automatically Detect Navigation-Based Web Accessibility Barriers for Keyboard Users. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23). Association for Computing Machinery, New York, NY, USA, 1–17. doi:10.1145/3544548.3580749

  11. [11]

    Lisa G Dirks, Miranda Belarde-Lewis, and Wanda Pratt. 2025. Amplifying Cultural Values with Collaborative Photo-Elicitation: Strengths-Focused Co-Design with Alaska Native People. InProceedings of the 2025 ACM Designing Interactive Systems Conference (DIS ’25). Association for Computing Machinery, New York, NY, USA, 1349–1365. doi:10.1145/3715336.3735688

  12. [12]

    Micallef

    Isabel Evans, Chris Porter, and Mark J. Micallef. 2024. Breaking Tester Stereotypes: who is testing and why it matters. BCS Learning & Development, 115–126. doi:10.14236/ewic/BCSHCI2024.11

  13. [13]

    Tallullah Frappier, Nathalie Bressa, and Samuel Huron. 2024. Jumping to Con- clusions: A Visual Comparative Analysis of Online Debate Platform Layouts. InProceedings of the 13th Nordic Conference on Human-Computer Interaction (NordiCHI ’24). Association for Computing Machinery, New York, NY, USA, 1–15. doi:10.1145/3679318.3685377

  14. [14]

    Xiaoxiao Gan, Huayu Liang, and Chris Brown. 2025. Challenges, Strategies, and Impacts: A Qualitative Study on UI Testing in CI/CD Processes from GitHub Developers’ Perspectives. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). 186–197. doi:10.1109/ICST62969.2025.10988972 ISSN: 2159-4848

  15. [15]

    Nanna Gorm and Irina Shklovski. 2017. Participant Driven Photo Elicitation for Understanding Activity Tracking: Benefits and Limitations. InProceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW ’17). Association for Computing Machinery, New York, NY, USA, 1350–1361. doi:10.1145/2998181.2998214

  16. [16]

    Nina Hollender, Cristian Hofmann, Michael Deneke, and Bernhard Schmitz. 2010. Integrating cognitive load theory and concepts of human–computer interaction. Computers in Human Behavior26, 6 (Nov. 2010), 1278–1288. doi:10.1016/j.chb. 2010.05.031

  17. [17]

    Waqas Javed and Niklas Elmqvist. 2012. Exploring the design space of composite visualization. In2012 IEEE Pacific Visualization Symposium. 1–8. doi:10.1109/ PacificVis.2012.6183556 ISSN: 2165-8773

  18. [18]

    Alla Katsnelson. 2021. Colour me better: fixing figures for colour blindness.Nature 598, 7879 (Oct. 2021), 224–225. doi:10.1038/d41586-021-02696-z Bandiera_abtest: a Cg_type: Technology Feature Subject_term: Publishing, Communication

  19. [19]

    George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS ’18). Association for Computing Machinery, New York, NY, USA, 2123–2138. doi:10.1145/3243734.3243804

  20. [20]

    Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang

  21. [21]

    InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22)

    Guided Bug Crush: Assist Manual GUI Testing of Android Apps via Hint Moves. InProceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3491102.3501903

  22. [22]

    Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen.Proc. ACM Program. Lang.4, OOPSLA (Nov. 2020), 196:1–196:25. doi:10.1145/3428264

  23. [23]

    Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice.Proc. ACM Hum.-Comput. Interact.3, CSCW (Nov. 2019), 72:1–72:23. doi:10.1145/3359174

  24. [24]

    Miriah Meyer and Jason Dykes. 2019. Criteria for Rigor in Visualization Design Study.IEEE Transactions on Visualization and Computer Graphics(2019), 1–1. doi:10.1109/TVCG.2019.2934539

  25. [25]

    Inês Coimbra Morgado and Ana C. R. Paiva. 2019. The iMPAcT Tool for Android Testing.Proc. ACM Hum.-Comput. Interact.3, EICS (June 2019), 4:1–4:23. doi:10. 1145/3300963

  26. [26]

    Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2025. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Opera- tors. InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4 (ASP- LOS ’24). Association for Computing Machinery, New York, NY...

  27. [27]

    Marllos Paiva Prado and Auri Marcelo Rizzo Vincenzi. 2018. Towards cognitive support for unit testing: A qualitative study with practitioners.Journal of Systems and Software141 (July 2018), 66–84. doi:10.1016/j.jss.2018.03.052

  28. [28]

    2023.Visual Methodologies: An Introduction to Researching with Visual Materials(fifth edition ed.)

    Gillian Rose. 2023.Visual Methodologies: An Introduction to Researching with Visual Materials(fifth edition ed.). SAGE Publications Ltd, 55 City Road. doi:10. 4135/9781036231576

  29. [29]

    Clive Seale, Giampietro Gobo, Jaber F.Gubrium, David Silverman, and Sarah Pink

  30. [30]

    InQualitative Research Practice

    Visual Methods. InQualitative Research Practice. SAGE Publications Ltd, 361–377. doi:10.4135/9781848608191

  31. [31]

    Shneiderman

    B. Shneiderman. 1996. The eyes have it: a task by data type taxonomy for infor- mation visualizations. InProceedings 1996 IEEE Symposium on Visual Languages. 336–343. doi:10.1109/VL.1996.545307 ISSN: 1049-2615

  32. [32]

    Per Erik Strandberg, Wasif Afzal, and Daniel Sundmark. 2018. Decision making and visualizations based on test results. InProceedings of the 12th ACM/IEEE 8 Exploring Visual Software Testing Output , , International Symposium on Empirical Software Engineering and Measurement (ESEM ’18). Association for Computing Machinery, New York, NY, USA, 1–10. doi:10.1...

  33. [33]

    Per Erik Strandberg, Eduard Paul Enoiu, Wasif Afzal, Daniel Sundmark, and Robert Feldt. 2019. Information Flow in Software Testing – An Interview Study With Embedded Software Engineering Practitioners.IEEE Access7 (2019), 46434– 46453. doi:10.1109/ACCESS.2019.2909093

  34. [34]

    Zezhong Wang, Samuel Huron, Miriam Sturdee, and Sheelagh Carpendale. 2024. Summary of the Workshop on Visual Methods and Analyzing Visual Data in Human Computer Interaction. InCompanion Proceedings of the 2024 Conference on Interactive Surfaces and Spaces (ISS Companion ’24). Association for Computing Machinery, New York, NY, USA, 29–32. doi:10.1145/36967...

  35. [35]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–13. doi:10.1145/3597503.3639121

  36. [36]

    Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jab- barvand, and Lingming Zhang. 2024. WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models.Proc. ACM Program. Lang.8, OOPSLA2 (Oct. 2024), 296:709–296:735. doi:10.1145/3689736

  37. [37]

    Leni Yang, Xian Xu, XingYu Lan, Ziyan Liu, Shunan Guo, Yang Shi, Huamin Qu, and Nan Cao. 2022. A Design Space for Applying the Freytag’s Pyramid Structure to Data Stories.IEEE Transactions on Visualization and Computer Graphics28, 1 (Jan. 2022), 922–932. doi:10.1109/TVCG.2021.3114774 9 , , Lit et al. A Included Programs Table A.1: List of all programs inc...