Search-based Robustness Testing of Laptop Refurbishing Robotic Software
Pith reviewed 2026-05-11 01:48 UTC · model grok-4.3
The pith
A search-based method finds minimal perturbations that expose failures in object detection models for laptop-refurbishing robots three to seven times more effectively than random search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases in the object detection models used by laptop refurbishing robots.
What carries the argument
PROBE, a multi-objective search using NSGA-II that generates localized input perturbations to induce failures in object detection while minimizing perturbation magnitude.
If this is right
- PROBE generates failure-inducing perturbations 3× to 7× more effectively than random search while using smaller perturbation magnitudes.
- The perturbations transfer across different object detection models.
- Metamorphic relations provide additional robustness insights even in non-failing cases.
Where Pith is reading between the lines
- The same search approach could be applied to other vision-based robotic tasks such as sorting or assembly to surface similar hidden sensitivities.
- If the perturbations map to real hardware variations, they could be reused as targeted test cases in physical validation loops.
- Embedding this style of search into the robot software development cycle would allow earlier detection of robustness gaps before deployment.
Load-bearing premise
The synthetic perturbations discovered in simulation correspond to realistic physical variations such as lighting changes, camera noise, or sticker placements that the robot will encounter during actual operation.
What would settle it
Running the perturbations found by PROBE on the physical robot under real operating conditions and observing that they produce no failures or require substantially larger magnitudes than in simulation.
Figures
read the original abstract
The Danish Technological Institute (DTI) focuses on transferring advanced technologies (including robots) to the industry and the public sector. One key application is laptop refurbishment using specialized robots, aimed at promoting reuse, reducing electronic waste, and supporting the European Circular Economy Action Plan. The software of such robots often includes features that use object detection models to detect objects for various purposes, such as identifying screws for laptop disassembly or detecting stickers to remove them. Ensuring the robustness of such models to small input variations remains a critical challenge, and addressing it is important to avoid potential damage to laptops during refurbishment. In this paper, we propose PROBE, a search-based robustness testing approach that leverages multi-objective optimization to identify minimal, localized perturbations that expose failures in object detection models used in the software of laptop refurbishing robots. PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases. Results show that PROBE is 3$\times$ to 7$\times$ more effective than random search in generating failure-inducing perturbations, while requiring smaller perturbation magnitudes, and that the generated perturbations transfer across models. We further show that metamorphic relations provide additional insights into model robustness, enabling the assessment of stability even in non-failing cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes PROBE, a search-based robustness testing method that applies NSGA-II multi-objective optimization to discover minimal, localized perturbations exposing failures in object detection models used by laptop refurbishing robots. The central empirical claims are that PROBE generates 3× to 7× more failure-inducing perturbations than random search while using smaller perturbation magnitudes, that the discovered perturbations transfer across models, and that metamorphic relations yield additional robustness insights even for non-failing cases.
Significance. If the reported ratios and transfer results hold under controlled conditions, the work provides a concrete demonstration of multi-objective evolutionary search for robustness testing in an industrial robotics application tied to circular-economy goals. The transferability finding and the use of metamorphic relations for stability assessment are useful for practitioners selecting or hardening vision models in refurbishment pipelines. The contribution is primarily empirical and domain-specific rather than foundational; its value depends on reproducible experimental protocols and clear separation between simulation results and physical deployment claims.
major comments (2)
- [Abstract] Abstract and experimental section: the claims of 3×–7× greater effectiveness and smaller magnitudes are presented without accompanying details on the perturbation parameterization, the precise failure metric (localization + confidence drop), the number of fitness evaluations allocated to PROBE versus random search, the number of independent runs, or any statistical tests or error bars. These omissions make the quantitative comparison impossible to verify or reproduce from the manuscript alone.
- [§4 (Experiments)] The manuscript must explicitly confirm that both PROBE and the random-search baseline receive identical evaluation budgets; otherwise the reported effectiveness ratio is not load-bearing for the central claim.
minor comments (3)
- Add a table or figure summarizing the object-detection models, datasets, and hyper-parameters used for both the search and the transfer experiments.
- Clarify whether the reported transfer results use held-out models under identical imaging conditions or introduce additional variation.
- The discussion of metamorphic relations would benefit from a concrete example of a relation and the stability metric applied to non-failing cases.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of reproducibility and experimental rigor that we will address in the revised manuscript. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract and experimental section: the claims of 3×–7× greater effectiveness and smaller magnitudes are presented without accompanying details on the perturbation parameterization, the precise failure metric (localization + confidence drop), the number of fitness evaluations allocated to PROBE versus random search, the number of independent runs, or any statistical tests or error bars. These omissions make the quantitative comparison impossible to verify or reproduce from the manuscript alone.
Authors: We agree that the current presentation lacks sufficient detail for full reproducibility of the quantitative claims. In the revised manuscript we will expand both the abstract and §4 (Experiments) to explicitly describe: (1) the perturbation parameterization (localized pixel-level intensity changes confined to object bounding boxes with magnitude bounded by L∞ norm); (2) the precise failure metric (a composite score requiring both IoU drop below 0.5 and confidence reduction >30% relative to the clean image); (3) the evaluation budget (identical 2000 fitness evaluations per run for PROBE and random search); (4) the number of independent runs (10 runs per configuration); and (5) the statistical analysis (mean and standard deviation reported with Wilcoxon rank-sum tests and p-values). These additions will allow readers to verify the reported 3×–7× effectiveness ratios and magnitude reductions. revision: yes
-
Referee: [§4 (Experiments)] The manuscript must explicitly confirm that both PROBE and the random-search baseline receive identical evaluation budgets; otherwise the reported effectiveness ratio is not load-bearing for the central claim.
Authors: We confirm that PROBE and the random-search baseline were allocated identical evaluation budgets in all experiments. Section 4 already states that both methods perform the same number of fitness evaluations per run, but we acknowledge the need for greater explicitness. In the revision we will add a dedicated sentence in §4: “Both PROBE (NSGA-II) and the random-search baseline were given an identical budget of 2000 fitness evaluations per independent run across all 10 runs.” This clarification ensures the effectiveness ratios are directly comparable under controlled computational effort. revision: yes
Circularity Check
No significant circularity; purely empirical evaluation
full rationale
The paper describes an application of NSGA-II multi-objective search (PROBE) to generate perturbations for testing object-detection robustness in robotic software. No equations, derivations, or parameter-fitting steps are present that could reduce outputs to inputs by construction. The central claims rest on direct experimental comparisons (PROBE vs. random search under identical budgets, transfer across held-out models) whose validity depends on the experimental protocol rather than any self-referential definition or self-citation chain. The work is self-contained as an empirical study.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Small localized pixel perturbations can induce failures in object detection models
- standard math NSGA-II can efficiently explore the space of image perturbations for multi-objective trade-offs
invented entities (1)
-
PROBE
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shaukat Ali, Lionel C Briand, Hadi Hemmati, and Rajwinder Kaur Panesar-Walawege. 2009. A systematic review of the application and empirical investigation of search-based test case generation.IEEE Transactions on Software Engineering36, 6 (2009), 742–762
work page 2009
-
[2]
Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd international conference on software engineering. 1–10
work page 2011
-
[3]
J. Blank and K. Deb. 2020. pymoo: Multi-Objective Optimization in Python.IEEE Access8 (2020), 89497–89509
work page 2020
-
[4]
2020.A New Circular Economy Action Plan - For a cleaner and more competitive Europe
European Commission. 2020.A New Circular Economy Action Plan - For a cleaner and more competitive Europe. Technical Report. Publications Office of the European Union
work page 2020
-
[5]
Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. Dlfuzz: Differential fuzzing testing of deep learning systems. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 739–743
work page 2018
-
[6]
Mark Harman, S Afshin Mansouri, and Yuanyuan Zhang. 2012. Search-based software engineering: Trends, techniques and applications.ACM Computing Surveys (CSUR)45, 1 (2012), 1–61. 14
work page 2012
-
[7]
Mark Harman and Phil McMinn. 2009. A theoretical and empirical study of search-based testing: Local, global, and hybrid search.IEEE Transactions on Software Engineering36, 2 (2009), 226–247
work page 2009
-
[8]
Dmytro Humeniuk, Foutse Khomh, and Giuliano Antoniol. 2023. Ambiegen: A search-based framework for autonomous systems testing.Science of Computer Programming230 (2023), 102990
work page 2023
-
[9]
Bing Liu, Shiva Nejati, Lionel C Briand, et al . 2017. Improving fault localization for Simulink models using search-based testing and prediction models. In2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 359–370
work page 2017
- [10]
-
[11]
Chengjie Lu, Jiahui Wu, Shaukat Ali, and Mikkel Labori Olsen. 2025. Assessing the uncertainty and robustness of the laptop refurbishing software. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 406–416
work page 2025
-
[12]
Chengjie Lu, Huihui Zhang, Tao Yue, and Shaukat Ali. 2021. Search-based selection and prioritization of test scenarios for autonomous driving systems. InInternational Symposium on Search Based Software Engineering. Springer, 41–55
work page 2021
-
[13]
Kiran Maharana, Surajit Mondal, and Bhushankumar Nemade. 2022. A review: Data pre-processing and data augmentation techniques.Global Transitions Proceedings3, 1 (2022), 91–99
work page 2022
-
[14]
Malaika Din Hashmi. 2024. RoboSAPIENS Extended Screw Detection Dataset (Fixtures) – Part A. https: //universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/12. Accessed: 2026-04-01
work page 2024
-
[15]
Malaika Din Hashmi. 2024. RoboSAPIENS Extended Screw Detection Dataset (Fixtures) – Part B. https: //universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/14. Accessed: 2026-04-01
work page 2024
-
[16]
Malaika Din Hashmi. 2024. RoboSAPIENS Screw Detection Dataset v0 (Black Screws Only). https:// universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/5. Accessed: 2026-04-01
work page 2024
-
[17]
Malaika Din Hashmi. 2024. RoboSAPIENS Screw Detection Dataset v1 (Screws and Non-Screw Objects). https://universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/8. Accessed: 2026- 04-01
work page 2024
-
[18]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles. 1–18
work page 2017
-
[19]
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition. 779–788
work page 2016
-
[20]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems28 (2015)
work page 2015
-
[21]
Ke Shang, Hisao Ishibuchi, Linjun He, and Lie Meng Pang. 2020. A survey on the hypervolume indicator in evolutionary multiobjective optimization.IEEE Transactions on Evolutionary Computation25, 1 (2020), 1–20
work page 2020
- [22]
-
[23]
Shuncheng Tang, Zhenya Zhang, Yi Zhang, Jixiang Zhou, Yan Guo, Shuang Liu, Shengjian Guo, Yan-Fu Li, Lei Ma, Yinxing Xue, et al. 2023. A survey on automated driving system testing: Landscapes and trends.ACM Transactions on Software Engineering and Methodology32, 5 (2023), 1–62
work page 2023
-
[24]
Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural- network-driven autonomous cars. InProceedings of the 40th international conference on software engineering. 303–314
work page 2018
-
[25]
Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. Robot: Robustness-oriented testing for deep learning systems. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 300–311
work page 2021
-
[26]
Amirhossein Zolfagharian, Manel Abdellatif, Lionel C Briand, Mojtaba Bagherzadeh, et al. 2023. A search-based testing approach for deep reinforcement learning agents.IEEE Transactions on Software Engineering49, 7 (2023), 3715–3735. 15
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.