Evaluating data-flow coverage in spectrum-based fault localization
Pith reviewed 2026-05-25 14:30 UTC · model grok-4.3
The pith
Data-flow spectra place up to 50% more faults in the top-15 ranks than control-flow spectra in spectrum-based fault localization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using data-flow spectra, up to 50% more faults are ranked in the top-15 positions compared to control-flow spectra. Most SFL ranking metrics present better effectiveness using data-flow to inspect up to the top-40 positions. The execution cost of data-flow spectra is higher, with an average overhead of 353% compared to 102% for control-flow.
What carries the argument
Definition-use association (DUA) spectra versus line spectra applied to ten SFL ranking metrics on 163 faults.
If this is right
- Developers may need to inspect less code to find faults when using data-flow spectra.
- Most ranking metrics perform better with data-flow up to the top-40 positions.
- Data-flow spectra provide additional information about suspicious variables that can aid fault localization.
- The extra execution time for data-flow, from 22 seconds to under 9 minutes, remains practical for use.
Where Pith is reading between the lines
- Combining data-flow and control-flow spectra could yield even better results in hybrid SFL techniques.
- Applying this to other types of faults or larger programs might reveal scalability limits.
- Integration with variable-level analysis could further reduce the code developers need to review.
Load-bearing premise
The 163 faults and five open-source programs with their test suites are representative of typical software systems without systematic bias from data-flow instrumentation.
What would settle it
Running the same comparison on a new set of programs and faults and observing no increase in the number of faults ranked in the top-15 with data-flow spectra.
Figures
read the original abstract
Background: Debugging is a key task during the software development cycle. Spectrum-based Fault Localization (SFL) is a promising technique to improve and automate debugging. SFL techniques use control-flow spectra to pinpoint the most suspicious program elements. However, data-flow spectra provide more detailed information about the program execution, which may be useful for fault localization. Aims: We evaluate the effectiveness and efficiency of ten SFL ranking metrics using data-flow spectra. Method: We compare the performance of data- and control-flow spectra for SFL using 163 faults from 5 real-world open source programs, which contain from 468 to 4130 test cases. The data- and control-flow spectra types used in our evaluation are definition-use associations (DUAs) and lines, respectively. Results: Using data-flow spectra, up to 50% more faults are ranked in the top-15 positions compared to control-flow spectra. Also, most SFL ranking metrics present better effectiveness using data-flow to inspect up to the top-40 positions. The execution cost of data-flow spectra is higher than control-flow, taking from 22 seconds to less than 9 minutes. Data-flow has an average overhead of 353% for all programs, while the average overhead for control-flow is of 102%. Conclusions: The results suggest that SFL techniques can benefit from using data-flow spectra to classify faults in better positions, which may lead developers to inspect less code to find bugs. The execution cost to gather data-flow is higher compared to control-flow, but it is not prohibitive. Moreover, data-flow spectra also provide information about suspicious variables for fault localization, which may improve the developers' performance using SFL.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates spectrum-based fault localization (SFL) using data-flow spectra (definition-use associations, DUAs) versus control-flow spectra (program lines) across ten ranking metrics. On 163 faults from five open-source programs (468–4130 tests each), it reports that data-flow spectra place up to 50% more faults in the top-15 positions, yield better effectiveness for most metrics up to the top-40 positions, incur higher but feasible overhead (353% average vs. 102%), and supply additional variable-suspiciousness information.
Significance. If the empirical comparison holds after addressing methodological gaps, the result would be useful for SFL research by showing that richer execution spectra can improve ranking quality on real programs without prohibitive cost. The direct head-to-head measurement on actual faults and test suites is a concrete strength; however, the absence of statistical testing and limited subject selection limit the strength of the general claim that 'SFL techniques can benefit from using data-flow spectra'.
major comments (4)
- [Method] Method section: no statistical significance tests (e.g., paired Wilcoxon or bootstrap) are reported for the top-15 and top-40 effectiveness differences that underpin the 'up to 50%' and 'better effectiveness' claims; without them the headline numbers cannot be distinguished from sampling variation.
- [Method] Method / Results: the paper supplies no description of tie-breaking rules or how programs containing multiple faults are counted when computing the 'faults ranked in top-15' metric; both choices directly affect the reported percentages.
- [Evaluation] Evaluation setup: the five programs and 163 faults are presented without explicit selection criteria, stratification by fault type, or threats-to-validity discussion of domain or language bias, making it impossible to assess whether the observed DUA advantage generalizes beyond the chosen subjects.
- [Method] Method: potential systematic bias introduced by the DUA instrumentation itself (e.g., altered execution timing or coverage) is not measured or bounded, yet the spectra comparison treats the two kinds of spectra as directly comparable.
minor comments (2)
- [Abstract] Abstract: the overhead sentence 'taking from 22 seconds to less than 9 minutes' should clarify whether these are per-program extremes or averages and should reference the corresponding table or figure.
- [Results] Results: tables or figures comparing the ten metrics should include the raw counts of faults localized at each rank threshold rather than only relative percentages, to allow independent verification.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We have revised the manuscript to strengthen its methodological transparency and address all major concerns raised.
read point-by-point responses
-
Referee: [Method] Method section: no statistical significance tests (e.g., paired Wilcoxon or bootstrap) are reported for the top-15 and top-40 effectiveness differences that underpin the 'up to 50%' and 'better effectiveness' claims; without them the headline numbers cannot be distinguished from sampling variation.
Authors: We agree this is a gap. The revised manuscript now includes paired Wilcoxon signed-rank tests on the per-metric effectiveness differences at top-15 and top-40 positions, with p-values and effect sizes reported in the Results section. Most differences remain statistically significant (p < 0.05). revision: yes
-
Referee: [Method] Method / Results: the paper supplies no description of tie-breaking rules or how programs containing multiple faults are counted when computing the 'faults ranked in top-15' metric; both choices directly affect the reported percentages.
Authors: We have added an explicit subsection in Method describing tie-breaking (average rank assigned to tied elements, standard in SFL) and clarified that the study uses single-fault versions of the programs, consistent with the majority of prior SFL benchmarks. Multi-fault handling is noted as out of scope. revision: yes
-
Referee: [Evaluation] Evaluation setup: the five programs and 163 faults are presented without explicit selection criteria, stratification by fault type, or threats-to-validity discussion of domain or language bias, making it impossible to assess whether the observed DUA advantage generalizes beyond the chosen subjects.
Authors: The revised Threats to Validity section now states the selection criteria (programs drawn from prior SFL studies with available test suites and real faults), notes lack of stratification by fault type, and explicitly discusses language (Java) and domain limitations on generalizability. revision: yes
-
Referee: [Method] Method: potential systematic bias introduced by the DUA instrumentation itself (e.g., altered execution timing or coverage) is not measured or bounded, yet the spectra comparison treats the two kinds of spectra as directly comparable.
Authors: We acknowledge the concern. The revision adds a paragraph in Method noting that both spectra are collected from the same instrumented executions (ensuring internal comparability) and bounds the timing impact via the separately reported overhead figures. We could not retroactively quantify any differential coverage distortion without new instrumentation experiments. revision: partial
Circularity Check
No circularity: direct empirical comparison of measured spectra on fixed subjects
full rationale
The paper performs an empirical evaluation comparing data-flow (DUA) and control-flow (line) spectra for SFL ranking metrics across 163 faults in 5 open-source programs. No equations, fitted parameters, predictions, or derivations appear in the abstract or described method. Effectiveness claims (e.g., up to 50% more faults in top-15) are reported as direct observations from the experiment, not reduced by construction to any self-defined quantity or prior self-citation. The reader's assessment of score 1.0 is consistent; generalizability concerns exist but are orthogonal to circularity. No load-bearing self-citation chains or ansatzes are present.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The five selected open-source programs and their 163 faults are representative of real-world software and faults.
- domain assumption Definition-use association spectra can be collected by instrumentation without introducing measurement bias or altering program behavior.
Reference graph
Works this paper leans on
-
[1]
The economic impacts of inadequate infrastructure for software testing,
G. Tassey, “The economic impacts of inadequate infrastructure for software testing,” National Institute of Standards and Technology, RTI Project, vol. 7007, no. 011, 2002
work page 2002
-
[2]
Slice-based statistical fault localization,
X. Mao, Y . Lei, Z. Dai, Y . Qi, and C. Wang, “Slice-based statistical fault localization,” Journal of Systems and Software , vol. 89, no. 0, pp. 51–62, 2014
work page 2014
-
[3]
State dependency probabilistic model for fault localization,
G. Dandan, S. Xiaohong, W. Tiantian, M. Peijun, and Y . Wang, “State dependency probabilistic model for fault localization,” Information and Software Technology, vol. 57, no. 0, pp. 430–445, 2014
work page 2014
-
[4]
Zeller, Why programs fail: A guide to systematic debugging , 2nd ed
A. Zeller, Why programs fail: A guide to systematic debugging , 2nd ed. Burlington, MA: Morgan Kaufmann Publishers, 2009
work page 2009
-
[5]
Visualization of test informa- tion to assist fault localization,
J. A. Jones, M. J. Harrold, and J. Stasko, “Visualization of test informa- tion to assist fault localization,” in Proceedings of the 24th International Conference on Software Engineering , ser. ICSE’02, 2002, pp. 467–477
work page 2002
-
[6]
Lightweight fault- localization using multiple coverage types,
R. Santelices, J. A. Jones, Y . Yu, and M. J. Harrold, “Lightweight fault- localization using multiple coverage types,” in Proceedings of the 31st International Conference on Software Engineering , ser. ICSE’09, 2009, pp. 56–66
work page 2009
-
[7]
On the accuracy of spectrum-based fault localization,
R. Abreu, P. Zoeteweij, and A. J. C. van Gemund, “On the accuracy of spectrum-based fault localization,” in Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION, ser. TAICPART-MUTATION’07, 2007, pp. 89–98
work page 2007
-
[8]
Spectral debugging with weights and incremental ranking,
L. Naish, H. J. Lee, and K. Ramamohanarao, “Spectral debugging with weights and incremental ranking,” in Proceedings of the 16th Asia- Pacific Software Engineering Conference , ser. APSEC’09, 2009, pp. 168–175
work page 2009
-
[9]
A family of code coverage- based heuristics for effective fault localization,
W. E. Wong, V . Debroy, and B. Choi, “A family of code coverage- based heuristics for effective fault localization,” Journal of Systems and Software, vol. 83, no. 2, pp. 188–208, 2010
work page 2010
-
[10]
Automatic error detection techniques based on dynamic invariants,
A. Gonzalez-Sanchez, “Automatic error detection techniques based on dynamic invariants,” Master’s thesis, Delft University of Technology, 2007
work page 2007
-
[11]
Evaluating and improving fault localization,
S. Pearson, J. Campos, R. Just, G. Fraser, R. Abreu, M. D. Ernst, D. Pang, and B. Keller, “Evaluating and improving fault localization,” in Proceedings of the 39th International Conference on Software Engi- neering, ser. ICSE’17, 2017, pp. 609–620
work page 2017
-
[12]
Uniformly evaluating and comparing ranking metrics for spectral fault localization,
C. Ma, Y . Zhang, T. Zhang, Y . Lu, and Q. Wang, “Uniformly evaluating and comparing ranking metrics for spectral fault localization,” in Pro- ceedings of the 14th International Conference on Quality Software , ser. QSIC’14, 2014, pp. 315–320
work page 2014
-
[13]
Fault localization based on information flow coverage,
W. Masri, “Fault localization based on information flow coverage,” Software Testing, Verification and Reliability , vol. 20, no. 2, pp. 121– 147, 2010
work page 2010
-
[14]
Experiments of the effectiveness of dataflow- and controlflow-based test adequacy cri- teria,
M. Hutchins, H. Foster, T. Goradia, and T. Ostrand, “Experiments of the effectiveness of dataflow- and controlflow-based test adequacy cri- teria,” in Proceedings of the 16th International Conference on Software Engineering, ser. ICSE’94, 1994, pp. 191–200
work page 1994
-
[15]
Fault localization with nearest neighbor queries,
M. Renieris and S. P. Reiss, “Fault localization with nearest neighbor queries,” in Proceedings of the 18th IEEE International Conference on Automated Software Engineering , ser. ASE’03, 2003, pp. 30–39
work page 2003
-
[16]
J. A. Jones, J. F. Bowring, and M. J. Harrold, “Debugging in parallel,” in Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA’07, 2007, pp. 16–26
work page 2007
-
[17]
HOLMES: Effective statistical debugging via efficient path profiling,
T. M. Chilimbi, B. Liblit, K. Mehra, A. V . Nori, and K. Vaswani, “HOLMES: Effective statistical debugging via efficient path profiling,” in Proceedings of the 31st International Conference on Software Engi- neering, ser. ICSE’09, 2009, pp. 34–44
work page 2009
-
[18]
Demand-driven structural testing with dynamic instrumentation,
J. Misurda, J. A. Clause, J. L. Reed, B. R. Childers, and M. L. Soffa, “Demand-driven structural testing with dynamic instrumentation,” in Proceedings of the 27th International Conference on Software Engi- neering, ser. ICSE’05, 2005, pp. 156–165
work page 2005
-
[19]
Efficiently monitoring data-flow test coverage,
R. Santelices and M. J. Harrold, “Efficiently monitoring data-flow test coverage,” in Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering, ser. ASE’07, 2007, pp. 343–352
work page 2007
-
[20]
An efficient bitwise algorithm for intra-procedural data-flow testing coverage,
M. L. Chaim and R. P. A. d. Araujo, “An efficient bitwise algorithm for intra-procedural data-flow testing coverage,” Information Processing Letters, vol. 113, no. 8, pp. 293–300, 2013
work page 2013
-
[21]
Data-flow testing in the large,
R. P. A. de Araujo and M. L. Chaim, “Data-flow testing in the large,” in Proceedings of the 7th IEEE International Conference on Software Testing, Verification and Validation, ser. ICST’14, 2014, pp. 81–90
work page 2014
-
[22]
Jaguar: A spectrum-based fault localization tool for real- world software,
H. L. Ribeiro, H. A. de Souza, R. P. A. de Araujo, M. L. Chaim, and F. Kon, “Jaguar: A spectrum-based fault localization tool for real- world software,” in Proceedings of the 11th International Conference on Software Testing, Verification and Validation , ser. ICST’18, 2018, pp. 404–409
work page 2018
-
[23]
Releng of the nerds: Open source release engineering,
K. Moir, “Releng of the nerds: Open source release engineering,” March 2011, SDK code coverage with JaCoCo. [Online]. Available: http://relengofthenerds.blogspot.com.br/2011/03/ sdk-code-coverage-with-jacoco.html
work page 2011
-
[24]
Selecting software test data using data flow information,
S. Rapps and E. J. Weyuker, “Selecting software test data using data flow information,” IEEE Transactions on Software Engineering, vol. 11, no. 4, pp. 367–375, 1985
work page 1985
-
[25]
The use of program profiling for software maintenance with applications to the year 2000 problem,
T. Reps, T. Ball, M. Das, and J. Larus, “The use of program profiling for software maintenance with applications to the year 2000 problem,” in Proceedings of the 6th European Software Engineering Conference Held Jointly with the 5th ACM SIGSOFT Symposium on the Foundations of Software Engineering , ser. ESEC/FSE’97, 1997, pp. 432–449
work page 2000
-
[26]
The impact of software evolution on code coverage information,
S. Elbaum, D. Gable, and G. Rothermel, “The impact of software evolution on code coverage information,” in Proceedings of the 19th IEEE International Conference on Software Maintenance, ser. ICSM’01, 2001, pp. 170–179
work page 2001
-
[27]
An empirical investiga- tion of program spectra,
M. J. Harrold, G. Rothermel, R. Wu, and L. Yi, “An empirical investiga- tion of program spectra,” SIGPLAN Notices, vol. 33, no. 7, pp. 83–90, 1998
work page 1998
-
[28]
A consensus-based strategy to improve the quality of fault localization,
V . Debroy and W. E. Wong, “A consensus-based strategy to improve the quality of fault localization,” Software: Practice and Experience, vol. 43, no. 8, pp. 989–1011, 2013
work page 2013
-
[29]
A dynamic fault localization technique with noise reduction for java programs,
J. Xu, W. K. Chan, Z. Zhang, T. H. Tse, and S. Li, “A dynamic fault localization technique with noise reduction for java programs,” in Proceedings of the 11th International Conference on Quality Software , ser. QSIC’11, 2011, pp. 11–20
work page 2011
-
[30]
A debugging strategy based on requirements of testing,
M. L. Chaim, J. C. Maldonado, and M. Jino, “A debugging strategy based on requirements of testing,” in Proceedings of the 7th Euro- pean Conference on Software Maintenance and Reengineering , ser. CSMR’03, 2003, pp. 160–169
work page 2003
-
[31]
Defects4j: A database of existing faults to enable controlled testing studies for java programs,
R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in Pro- ceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA’14, 2014, pp. 437–440
work page 2014
-
[32]
Are automated debugging techniques actually helping programmers?
C. Parnin and A. Orso, “Are automated debugging techniques actually helping programmers?” in Proceedings of the ACM SIGSOFT Inter- national Symposium on Software Testing and Analysis , ser. ISSTA’11, 2011, pp. 199–209
work page 2011
-
[33]
Practitioners’ expectations on automated fault localization,
P. S. Kochhar, X. Xia, D. Lo, and S. Li, “Practitioners’ expectations on automated fault localization,” in Proceedings of the 25th International Symposium on Software Testing and Analysis , ser. ISSTA’16, 2016, pp. 165–176
work page 2016
-
[34]
T. W. Anderson and D. A. Darling, “A test of goodness of fit,” Journal of the American Statistical Association , vol. 49, no. 268, pp. 765–769, 1954
work page 1954
-
[35]
Individual comparisons by ranking methods,
F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics bulletin, vol. 1, no. 6, pp. 80–83, 1945
work page 1945
-
[36]
Dominance statistics: Ordinal analyses to answer ordinal questions,
N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinal questions,” Psychological Bulletin, vol. 114, no. 3, pp. 494–509, 1993
work page 1993
-
[37]
Assessment of spectrum-based fault localization for practical use,
H. A. de Souza, “Assessment of spectrum-based fault localization for practical use,” PhD thesis, Institute of Mathematics and Statistics – University of S ˜ao Paulo, S ˜ao Paulo, Brazil, April 2018
work page 2018
-
[38]
Effective statistical fault localization using program slices,
Y . Lei, X. Mao, Z. Dai, and C. Wang, “Effective statistical fault localization using program slices,” in Proceedings of the IEEE 36th Annual International Computers, Software and Applications Conference, ser. COMPSAC’12, 2012, pp. 1–10
work page 2012
-
[39]
Hsfal: Effective fault localization using hybrid spectrum of full slices and execution slices,
X. Ju, S. Jiang, X. Chen, X. Wang, Y . Zhang, and H. Cao, “Hsfal: Effective fault localization using hybrid spectrum of full slices and execution slices,” Journal of Systems and Software , vol. 90, no. 0, pp. 3–17, 2014
work page 2014
-
[40]
Locating faults using multiple spectra-specific models,
K. Yu, M. Lin, Q. Gao, H. Zhang, and X. Zhang, “Locating faults using multiple spectra-specific models,” in Proceedings of the 26th ACM Symposium on Applied Computing , ser. SAC’11, 2011, pp. 1404–1410
work page 2011
-
[41]
Software-defect localisation by mining dataflow-enabled call graphs,
F. Eichinger, K. Krogmann, R. Klug, and K. B ¨ohm, “Software-defect localisation by mining dataflow-enabled call graphs,” in Proceedings of the Joint European Conference on Machine Learning and Principles and Practice on Knowledge Discovery in Databases , ser. ECML PKDD 2010, 2010, pp. 425–441
work page 2010
-
[42]
How effective are code coverage criteria?
H. Hemmati, “How effective are code coverage criteria?” in 2015 IEEE International Conference on Software Quality, Reliability and Security , ser. QRS’15, 2015, pp. 151–156
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.