Revisiting Code Debloating with Ground Truth-based Evaluation
Pith reviewed 2026-05-10 04:58 UTC · model grok-4.3
The pith
Ground-truth evaluation shows dynamic debloaters remove needed code while static ones retain excess or add variants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using ground truth that precisely identifies which code must be retained or removed, dynamic-analysis debloaters eliminate up to 94 percent of necessary code, static-analysis debloaters exhibit high false-retention rates from coarse dependency over-approximation, and some static passes introduce additional specialized function variants; the resulting programs therefore suffer functional incorrectness, systematic inconsistency, robustness failures, and exploitable vulnerabilities.
What carries the argument
Ground-truth-based evaluation paradigm that directly measures retained versus required code across eight tools and three transformation levels.
If this is right
- Incorrect removals produce programs that no longer behave as originally intended.
- False retentions leave attack surfaces larger than the debloated size suggests.
- Added specialized function variants create new potential entry points for exploits.
- Imprecise debloating can introduce systematic inconsistencies and robustness failures across program executions.
Where Pith is reading between the lines
- Hybrid dynamic-static techniques may reduce the complementary error patterns observed in the two families of tools.
- Ground-truth benchmarks could replace size-reduction or gadget-count metrics as the primary correctness check in future debloating studies.
- Production use of debloated binaries would benefit from independent verification against full behavioral specifications rather than tool output alone.
Load-bearing premise
The ground truth constructed for each program correctly captures all intended behaviors and true code dependencies.
What would settle it
A program whose complete input space and exact dependency set are known, for which one of the evaluated debloaters produces a binary that preserves exactly the ground-truth code and passes every possible input.
Figures
read the original abstract
Program debloating aims to remove unused code to reduce performance overhead, attack surfaces, and maintenance costs. Over time, debloating has evolved across multiple layers (container, library, and application), each building on the principles of application-level debloating. Despite its central role, application-level debloating continues to rely on imperfect proxies for measuring performance, such as test-case-driven evaluation for correctness, code size for runtime efficiency, and gadget count reduction for estimating security posture. While there is widespread skepticism about using such imperfect proxies, the community still lacks standardized methodologies or benchmarks to assess the true performance of application-level software debloating. This experience paper aims to address the gap. We revisit the foundations of application-level debloating through a ground-truth-based evaluation paradigm. Our analysis of eight state-of-the-art debloaters - Blade, Chisel, Cov, CovA, Lmcas, Trimmer, Occam, and Razor - uncovers insights previously unattainable through traditional evaluations. These tools collectively span the spectrum of source-to-source, IR-to-IR, and binary-to-binary transformation paradigms, characterizing a holistic reassessment across abstraction levels. Our analysis reveals that while dynamic analysis-based tools often remove up to 94% of code that should be retained, static analysis-based approaches exhibit the opposite behavior, showing high false retention rates due to coarse-grained dependency over-approximation. Additionally, static analyses may add code by introducing specialized variants of functions. False retentions and removals not only cause functional incorrectness but may also lead to systematic inconsistency, robustness failures, and exploitable vulnerabilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript is an experience paper that revisits application-level code debloating evaluation. It analyzes eight tools (Blade, Chisel, Cov, CovA, Lmcas, Trimmer, Occam, Razor) spanning source-to-source, IR-to-IR, and binary-to-binary paradigms using a ground-truth-based approach rather than traditional proxies such as test-case coverage, code size, or gadget counts. The central findings are that dynamic-analysis tools remove up to 94% of code that should be retained, while static-analysis tools exhibit high false-retention rates from coarse dependency over-approximation and can even introduce additional code via specialized function variants, potentially causing functional incorrectness, robustness issues, and vulnerabilities.
Significance. If the ground-truth construction proves robust, independent, and reproducible, the work could meaningfully advance the debloating field by replacing imperfect evaluation proxies with a more reliable paradigm. It provides concrete evidence of systematic weaknesses in both dynamic and static debloaters that prior proxy-based studies could not surface, potentially guiding future tool development toward better preservation of intended behavior.
major comments (1)
- Abstract: The claim that dynamic tools remove up to 94% of code that should be retained (and the contrasting static-tool behavior) is load-bearing for the entire dynamic-vs-static contrast and the paper's call for better evaluation. However, no details are provided on how the ground-truth set of 'code that should be retained' was constructed, whether it was derived independently of the tools' own dynamic or static approximations, or how it ensures coverage of all intended behaviors without test-case bias. This directly affects the validity of the reported rates.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify key aspects of our ground-truth evaluation approach. We address the major comment below and will revise the manuscript to improve transparency.
read point-by-point responses
-
Referee: Abstract: The claim that dynamic tools remove up to 94% of code that should be retained (and the contrasting static-tool behavior) is load-bearing for the entire dynamic-vs-static contrast and the paper's call for better evaluation. However, no details are provided on how the ground-truth set of 'code that should be retained' was constructed, whether it was derived independently of the tools' own dynamic or static approximations, or how it ensures coverage of all intended behaviors without test-case bias. This directly affects the validity of the reported rates.
Authors: We agree that the abstract would benefit from a concise description of the ground-truth methodology to better support the central claims. The construction process is detailed in Section 3 of the manuscript: the ground-truth set is built independently by first executing each original (undeblated) program under a broad input corpus that includes the standard test suites plus additional inputs generated to exercise all documented features, public APIs, and edge cases derived from program specifications. Dynamic tracing records all executed code, while static reachability analysis identifies any additional code that could be required for intended behavior. This process occurs prior to and without reference to any debloater's internal approximations, ensuring independence. We acknowledge that relying on inputs introduces some potential for incomplete coverage, but the extended corpus is designed to minimize test-case bias beyond what prior proxy-based studies use. We will revise the abstract to incorporate a brief clause summarizing this independent construction, e.g., 'Ground truth is established independently via dynamic tracing and static reachability on the original programs to identify all code necessary for intended functionality.' revision: yes
Circularity Check
No significant circularity in empirical tool evaluation
full rationale
The paper is an experience report that evaluates eight external debloating tools (Blade, Chisel, Cov, CovA, Lmcas, Trimmer, Occam, Razor) by comparing their outputs to a separately constructed ground-truth set of necessary code. No derivation chain, equations, fitted parameters, or self-citations are presented as load-bearing; the central claims rest on direct observational contrasts (dynamic tools removing up to 94% of ground-truth code, static tools showing false retention) against an external benchmark. This structure is self-contained against the tools and ground truth and contains none of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ground truth for which code should be retained or removed can be reliably established for the evaluated programs.
Reference graph
Works this paper leans on
-
[1]
G. Xu, N. Mitchell, M. Arnold, A. Rountev, and G. Sevitsky, “Software bloat analysis: finding, removing, and preventing performance problems in modern large-scale object-oriented applications,” inProceedings of the Workshop on 10 FutureofSoftwareEngineeringResearch,FoSER2010,atthe18thACMSIGSOFT International Symposium on Foundations of Software Engineerin...
work page 2010
-
[2]
J.McGrenereandG.Moore,“Areweallinthesame"bloat"?”inProceedingsofthe GraphicsInterface2000Conference,May15-17,2000,Montréal,Québec,Canada, S. S. Fels and P. Poulin, Eds. Canadian Human-Computer Communications Society, 2000, pp. 187–196
work page 2000
-
[3]
Less is more: Quantifying the security benefits of debloating web applications,
B. A. Azad, P. Laperdrix, and N. Nikiforakis, “Less is more: Quantifying the security benefits of debloating web applications,” in28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, N. Heninger and P. Traynor, Eds. USENIX Association, 2019, pp. 1697–1714
work page 2019
-
[4]
The interplay of software bloat, hardware energy proportionality and system bottlenecks,
S. Bhattacharya, K. Rajamani, K. Gopinath, and M. Gupta, “The interplay of software bloat, hardware energy proportionality and system bottlenecks,” in Proceedings of the 4th Workshop on Power-Aware Computing and Systems, ser. HotPower ’11. New York, NY, USA: Association for Computing Machinery,
-
[5]
Available: https://doi.org/10.1145/2039252.2039253
[Online]. Available: https://doi.org/10.1145/2039252.2039253
-
[6]
G. Malecha, A. Gehani, and N. Shankar, “Automated software winnowing,” in Proceedings of the 30th Annual ACM Symposium on Applied Computing, ser. SAC ’15. New York, NY, USA: Association for Computing Machinery, 2015, p. 1504–1511. [Online]. Available: https://doi.org/10.1145/2695664.2695751
-
[7]
Go with the flow: profiling copies to find runtime bloat,
G. Xu, M. Arnold, N. Mitchell, A. Rountev, and G. Sevitsky, “Go with the flow: profiling copies to find runtime bloat,” inProceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’09. New York, NY, USA: Association for Computing Machinery, 2009, p. 419–430. [Online]. Available: https://doi.org/10.1145/154...
-
[8]
Effective program debloating via reinforcement learning,
K. Heo, W. Lee, P. Pashakhanloo, and M. Naik, “Effective program debloating via reinforcement learning,” inProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018, Toronto, ON, Canada, October 15-19, 2018, D. Lie, M. Mannan, M. Backes, and X. Wang, Eds. ACM, 2018, pp. 380–394
work page 2018
-
[9]
RAZOR: A framework for post-deployment software debloating,
C. Qian, H. Hu, M. Alharthi, S. P. H. Chung, T. Kim, and W. Lee, “RAZOR: A framework for post-deployment software debloating,” in28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, August 14-16, 2019, N. Heninger and P. Traynor, Eds. USENIX Association, 2019, pp. 1733–1750. [Online]. Available: https://www.usenix.org/conference/use...
work page 2019
-
[10]
Lightweight, multi-stage, compiler-assisted application specialization,
M. Alhanahnah, R. Jain, V. Rastogi, S. Jha, and T. W. Reps, “Lightweight, multi-stage, compiler-assisted application specialization,” in7th IEEE European Symposium on Security and Privacy, EuroS&P 2022, Genoa, Italy, June 6-10,
work page 2022
-
[11]
IEEE, 2022, pp. 251–269
work page 2022
-
[12]
Debloating software through piece-wise compilation and loading,
A. Quach, A. Prakash, and L. Yan, “Debloating software through piece-wise compilation and loading,” inUSENIX Security, 2018, pp. 869–886
work page 2018
-
[13]
In: 2020 IEEE European Symposium on Security and Privacy (EuroS&P), pp
S.MishraandM.Polychronakis,“Saffire:Context-sensitivefunctionspecialization against code reuse attacks,” inIEEE European Symposium on Security and Privacy, EuroS&P 2020, Genoa, Italy, September 7-11, 2020. IEEE, 2020, pp. 17–33. [Online]. Available: https://doi.org/10.1109/EuroSP48549.2020.00010
- [14]
-
[15]
In: Proceedings of the 2017 11th Joint Meeting on Foundations of Soft- ware Engineering
V. Rastogi, D. Davidson, L. D. Carli, S. Jha, and P. D. McDaniel, “Cimplifier: automatically debloating containers,” inProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017, E. Bodden, W. Schäfer, A. van Deursen, and A. Zisman, Eds. ACM, 2017, pp. 476–486. [Online]. Availa...
-
[16]
Confine: Auto- mated system call policy generation for container attack surface reduction,
S. Ghavamnia, T. Palit, A. Benameur, and M. Polychronakis, “Confine: Auto- mated system call policy generation for container attack surface reduction,” in International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020
work page 2020
-
[17]
Speaker: Split-phase execution of application containers,
L. Lei, J. Sun, K. Sun, C. Shenefiel, R. Ma, Y. Wang, and Q. Li, “Speaker: Split-phase execution of application containers,” inDetection of Intrusions and Malware, and Vulnerability Assessment: 14th International Conference, DIMVA 2017, Bonn, Germany, July 6-7, 2017, Proceedings 14. Springer, 2017, pp. 230–251
work page 2017
-
[18]
M. Ali, M. Muzammil, F. Karim, A. Naeem, R. Haroon, M. Haris, H. Nadeem, W. Sabir, F. Shaon, F. Zaffaret al., “Sok: A tale of reduction, security, and correctness-evaluating program debloating paradigms and their compositions,” in European Symposium on Research in Computer Security. Springer, 2023, pp. 229–249
work page 2023
-
[19]
A broad comparative evaluation of software debloating tools,
M.D.Brown,A.Meily,B.Fairservice,A.Sood,J.Dorn,E.Kilmer,andR.Eytchi- son, “A broad comparative evaluation of software debloating tools,” in33rd USENIX Security Symposium, USENIX Security 2024, Philadelphia, PA, USA, August 14-16, 2024, D. Balzarotti and W. Xu, Eds. USENIX Association, 2024
work page 2024
-
[20]
A broad comparative evaluation of software debloating tools,
——, “A broad comparative evaluation of software debloating tools,” in 33rd USENIX Security Symposium (USENIX Security 24). Philadelphia, PA: USENIX Association, Aug. 2024, pp. 3927–3943. [Online]. Available: https://www.usenix.org/conference/usenixsecurity24/presentation/brown
work page 2024
-
[21]
BLADE: towards scalablesourcecodedebloating,
M. Ali, R. Habib, A. Gehani, S. Rahaman, and Z. A. Uzmi, “BLADE: towards scalablesourcecodedebloating,”inIEEESecureDevelopmentConference,SecDev 2023, Atlanta, GA, USA, October 18-20, 2023. IEEE, 2023, pp. 75–87
work page 2023
-
[22]
Studyingandunderstandingthetradeoffsbetween generality and reduction in software debloating,
Q.Xin,Q.Zhang,andA.Orso,“Studyingandunderstandingthetradeoffsbetween generality and reduction in software debloating,” in37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022. ACM, 2022, pp. 99:1–99:13
work page 2022
-
[23]
Software-artifact infrastructurerepository,
“Software-artifact infrastructurerepository,” https://sir.csc.ncsu.edu/portal/index. php, accessed 2025-01-22
work page 2025
-
[24]
J. Cohen, “A coefficient of agreement for nominal scales,”Educational and Psychological Measurement, vol. 20, no. 1, pp. 37–46, 1960. [Online]. Available: https://doi.org/10.1177/001316446002000104
-
[25]
Test-case reduction for C compiler bugs,
J.Regehr,Y.Chen,P.Cuoq,E.Eide,C.Ellison,andX.Yang,“Test-casereduction for C compiler bugs,” inACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’12, Beijing, China - June 11 - 16, 2012, J. Vitek, H. Lin, and F. Tip, Eds. ACM, 2012, pp. 335–346. [Online]. Available: https://doi.org/10.1145/2254064.2254104
-
[26]
Perses: syntax-guided program reduction,
C. Sun, Y. Li, Q. Zhang, T. Gu, and Z. Su, “Perses: syntax-guided program reduction,” inProceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, M. Chaudron, I. Crnkovic, M. Chechik, and M. Harman, Eds. ACM, 2018, pp. 361–371. [Online]. Available: https://doi.org/10.1145/3180155.3180236
-
[27]
Subdomain-based generality-aware debloating,
Q. Xin, M. Kim, Q. Zhang, and A. Orso, “Subdomain-based generality-aware debloating,” in35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, September 21-25, 2020. IEEE, 2020, pp. 224–236. [Online]. Available: https://doi.org/10.1145/3324884.3416644
-
[28]
Program debloating via stochastic optimization,
——, “Program debloating via stochastic optimization,” inICSE-NIER 2020: 42nd International Conference on Software Engineering, New Ideas and Emerging Results, Seoul, South Korea, 27 June - 19 July, 2020, G. Rothermel and D. Bae, Eds. ACM, 2020, pp. 65–68. [Online]. Available: https://doi.org/10.1145/3377816.3381739
-
[29]
TRIMMER:applicationspecial- ization for code debloating,
H.Sharif,M.Abubakar,A.Gehani,andF.Zaffar,“TRIMMER:applicationspecial- ization for code debloating,” inProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, 2018, pp. 329–339
work page 2018
-
[30]
Code specialization through dynamic feature observation,
P. Biswas, N. Burow, and M. Payer, “Code specialization through dynamic feature observation,” inCODASPY ’21: Eleventh ACM Conference on Data and Application Security and Privacy, Virtual Event, USA, April 26-28, 2021, A. Joshi, B. Carminati, and R. M. Verma, Eds. ACM, 2021, pp. 257–268. [Online]. Available: https://doi.org/10.1145/3422337.3447844
-
[31]
Evaluating container debloaters,
M. Hassan, T. Tahir, M. Farrukh, A. Naveed, A. Naeem, F. Zaffar, F. Shaon, A. Gehani, and S. Rahaman, “Evaluating container debloaters,” inIEEE Secure Development Conference, SecDev 2023, Atlanta, GA, USA, October 18-20, 2023. IEEE, 2023, pp. 88–98
work page 2023
-
[32]
Sok:Softwaredebloatinglandscape and future directions,
M.Alhanahnah,Y.Boshmaf,andA.Gehani,“Sok:Softwaredebloatinglandscape and future directions,” inProceedings of the 2024 Workshop on Forming an Ecosystem Around Software Transformation, FEAST 2024, Salt Lake City, UT, USA, October 14-18, 2024, R. Craven and M. S. Mickelson, Eds. ACM, 2024, pp. 11–18. [Online]. Available: https://doi.org/10.1145/3689937.3695792 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.