arxiv: 2605.07530 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.SE

Search-based Robustness Testing of Laptop Refurbishing Robotic Software

Erblin Isaku , Hassan Sartaj , Shaukat Ali , Malaika Din Hashmi , Francois Picard This is my paper

Pith reviewed 2026-05-11 01:48 UTC · model grok-4.3

classification 💻 cs.RO cs.SE

keywords robustness testingobject detectionsearch-based testingrobotic softwareperturbation generationmulti-objective optimizationlaptop refurbishment

0 comments p. Extension

The pith

A search-based method finds minimal perturbations that expose failures in object detection models for laptop-refurbishing robots three to seven times more effectively than random search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PROBE to test robustness of object detection models in robotic laptop refurbishment software by searching for small localized perturbations that trigger detection failures. It frames this as a multi-objective optimization problem that simultaneously seeks to induce failures while keeping perturbations minimal in size and location. This matters because undetected failures in identifying screws or stickers could damage laptops during automated disassembly or cleaning. PROBE uses NSGA-II to explore the space and produces more failure cases with smaller changes than random testing, with those cases carrying over to other models. Metamorphic relations extend the assessment to stability checks even when models do not fail.

Core claim

PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases in the object detection models used by laptop refurbishing robots.

What carries the argument

PROBE, a multi-objective search using NSGA-II that generates localized input perturbations to induce failures in object detection while minimizing perturbation magnitude.

If this is right

PROBE generates failure-inducing perturbations 3× to 7× more effectively than random search while using smaller perturbation magnitudes.
The perturbations transfer across different object detection models.
Metamorphic relations provide additional robustness insights even in non-failing cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search approach could be applied to other vision-based robotic tasks such as sorting or assembly to surface similar hidden sensitivities.
If the perturbations map to real hardware variations, they could be reused as targeted test cases in physical validation loops.
Embedding this style of search into the robot software development cycle would allow earlier detection of robustness gaps before deployment.

Load-bearing premise

The synthetic perturbations discovered in simulation correspond to realistic physical variations such as lighting changes, camera noise, or sticker placements that the robot will encounter during actual operation.

What would settle it

Running the perturbations found by PROBE on the physical robot under real operating conditions and observing that they produce no failures or require substantially larger magnitudes than in simulation.

Figures

Figures reproduced from arXiv: 2605.07530 by Erblin Isaku, Francois Picard, Hassan Sartaj, Malaika Din Hashmi, Shaukat Ali.

**Figure 2.** Figure 2: Overview of PROBE, a search-based robustness testing approach for the screw detection component in the laptop refurbishment software. 3.2 Problem Formulation Let x ∈ X denote an input image and y its corresponding ground-truth annotations, consisting of a set of labeled bounding boxes. Let M be a perception model that, given an input image, produces a set of predictions yˆ = M(x), where each prediction inc… view at source ↗

**Figure 3.** Figure 3: Distribution and consistency of failure types across models. The top row shows the per-image distribution of [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the failure types identified by [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

The Danish Technological Institute (DTI) focuses on transferring advanced technologies (including robots) to the industry and the public sector. One key application is laptop refurbishment using specialized robots, aimed at promoting reuse, reducing electronic waste, and supporting the European Circular Economy Action Plan. The software of such robots often includes features that use object detection models to detect objects for various purposes, such as identifying screws for laptop disassembly or detecting stickers to remove them. Ensuring the robustness of such models to small input variations remains a critical challenge, and addressing it is important to avoid potential damage to laptops during refurbishment. In this paper, we propose PROBE, a search-based robustness testing approach that leverages multi-objective optimization to identify minimal, localized perturbations that expose failures in object detection models used in the software of laptop refurbishing robots. PROBE employs NSGA-II to systematically explore the perturbation space, optimizing for failure induction considering both localization and confidence, and perturbation magnitude, while enabling the discovery of diverse failure cases. Results show that PROBE is 3$\times$ to 7$\times$ more effective than random search in generating failure-inducing perturbations, while requiring smaller perturbation magnitudes, and that the generated perturbations transfer across models. We further show that metamorphic relations provide additional insights into model robustness, enabling the assessment of stability even in non-failing cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PROBE applies NSGA-II to hunt for small perturbations that break object detectors in a laptop-refurbishing robot and reports 3-7x more failures than random search with better transfer.

read the letter

The paper takes the standard NSGA-II multi-objective search and points it at object detection models that a robot uses to find screws and stickers on laptops during refurbishment. It claims the search produces three to seven times more failure-inducing perturbations than random search, keeps the changes smaller, and that the failures carry over to other models. They also run metamorphic relations to check stability on cases that do not fail outright.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes PROBE, a search-based robustness testing method that applies NSGA-II multi-objective optimization to discover minimal, localized perturbations exposing failures in object detection models used by laptop refurbishing robots. The central empirical claims are that PROBE generates 3× to 7× more failure-inducing perturbations than random search while using smaller perturbation magnitudes, that the discovered perturbations transfer across models, and that metamorphic relations yield additional robustness insights even for non-failing cases.

Significance. If the reported ratios and transfer results hold under controlled conditions, the work provides a concrete demonstration of multi-objective evolutionary search for robustness testing in an industrial robotics application tied to circular-economy goals. The transferability finding and the use of metamorphic relations for stability assessment are useful for practitioners selecting or hardening vision models in refurbishment pipelines. The contribution is primarily empirical and domain-specific rather than foundational; its value depends on reproducible experimental protocols and clear separation between simulation results and physical deployment claims.

major comments (2)

[Abstract] Abstract and experimental section: the claims of 3×–7× greater effectiveness and smaller magnitudes are presented without accompanying details on the perturbation parameterization, the precise failure metric (localization + confidence drop), the number of fitness evaluations allocated to PROBE versus random search, the number of independent runs, or any statistical tests or error bars. These omissions make the quantitative comparison impossible to verify or reproduce from the manuscript alone.
[§4 (Experiments)] The manuscript must explicitly confirm that both PROBE and the random-search baseline receive identical evaluation budgets; otherwise the reported effectiveness ratio is not load-bearing for the central claim.

minor comments (3)

Add a table or figure summarizing the object-detection models, datasets, and hyper-parameters used for both the search and the transfer experiments.
Clarify whether the reported transfer results use held-out models under identical imaging conditions or introduce additional variation.
The discussion of metamorphic relations would benefit from a concrete example of a relation and the stability metric applied to non-failing cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of reproducibility and experimental rigor that we will address in the revised manuscript. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract and experimental section: the claims of 3×–7× greater effectiveness and smaller magnitudes are presented without accompanying details on the perturbation parameterization, the precise failure metric (localization + confidence drop), the number of fitness evaluations allocated to PROBE versus random search, the number of independent runs, or any statistical tests or error bars. These omissions make the quantitative comparison impossible to verify or reproduce from the manuscript alone.

Authors: We agree that the current presentation lacks sufficient detail for full reproducibility of the quantitative claims. In the revised manuscript we will expand both the abstract and §4 (Experiments) to explicitly describe: (1) the perturbation parameterization (localized pixel-level intensity changes confined to object bounding boxes with magnitude bounded by L∞ norm); (2) the precise failure metric (a composite score requiring both IoU drop below 0.5 and confidence reduction >30% relative to the clean image); (3) the evaluation budget (identical 2000 fitness evaluations per run for PROBE and random search); (4) the number of independent runs (10 runs per configuration); and (5) the statistical analysis (mean and standard deviation reported with Wilcoxon rank-sum tests and p-values). These additions will allow readers to verify the reported 3×–7× effectiveness ratios and magnitude reductions. revision: yes
Referee: [§4 (Experiments)] The manuscript must explicitly confirm that both PROBE and the random-search baseline receive identical evaluation budgets; otherwise the reported effectiveness ratio is not load-bearing for the central claim.

Authors: We confirm that PROBE and the random-search baseline were allocated identical evaluation budgets in all experiments. Section 4 already states that both methods perform the same number of fitness evaluations per run, but we acknowledge the need for greater explicitness. In the revision we will add a dedicated sentence in §4: “Both PROBE (NSGA-II) and the random-search baseline were given an identical budget of 2000 fitness evaluations per independent run across all 10 runs.” This clarification ensures the effectiveness ratios are directly comparable under controlled computational effort. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical evaluation

full rationale

The paper describes an application of NSGA-II multi-objective search (PROBE) to generate perturbations for testing object-detection robustness in robotic software. No equations, derivations, or parameter-fitting steps are present that could reduce outputs to inputs by construction. The central claims rest on direct experimental comparisons (PROBE vs. random search under identical budgets, transfer across held-out models) whose validity depends on the experimental protocol rather than any self-referential definition or self-citation chain. The work is self-contained as an empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on standard assumptions from adversarial machine learning and evolutionary computation; no new physical laws or entities are postulated.

axioms (2)

domain assumption Small localized pixel perturbations can induce failures in object detection models
Invoked implicitly when defining the search objective for failure induction.
standard math NSGA-II can efficiently explore the space of image perturbations for multi-objective trade-offs
Relies on the established properties of the NSGA-II algorithm from prior optimization literature.

invented entities (1)

PROBE no independent evidence
purpose: Named search-based robustness testing framework
New label for the combination of NSGA-II with specific objectives for localization, confidence, and magnitude; no independent evidence beyond the paper's experiments.

pith-pipeline@v0.9.0 · 5547 in / 1416 out tokens · 35328 ms · 2026-05-11T01:48:16.108985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Shaukat Ali, Lionel C Briand, Hadi Hemmati, and Rajwinder Kaur Panesar-Walawege. 2009. A systematic review of the application and empirical investigation of search-based test case generation.IEEE Transactions on Software Engineering36, 6 (2009), 742–762

work page 2009
[2]

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. InProceedings of the 33rd international conference on software engineering. 1–10

work page 2011
[3]

Blank and K

J. Blank and K. Deb. 2020. pymoo: Multi-Objective Optimization in Python.IEEE Access8 (2020), 89497–89509

work page 2020
[4]

2020.A New Circular Economy Action Plan - For a cleaner and more competitive Europe

European Commission. 2020.A New Circular Economy Action Plan - For a cleaner and more competitive Europe. Technical Report. Publications Office of the European Union

work page 2020
[5]

Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. Dlfuzz: Differential fuzzing testing of deep learning systems. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 739–743

work page 2018
[6]

Mark Harman, S Afshin Mansouri, and Yuanyuan Zhang. 2012. Search-based software engineering: Trends, techniques and applications.ACM Computing Surveys (CSUR)45, 1 (2012), 1–61. 14

work page 2012
[7]

Mark Harman and Phil McMinn. 2009. A theoretical and empirical study of search-based testing: Local, global, and hybrid search.IEEE Transactions on Software Engineering36, 2 (2009), 226–247

work page 2009
[8]

Dmytro Humeniuk, Foutse Khomh, and Giuliano Antoniol. 2023. Ambiegen: A search-based framework for autonomous systems testing.Science of Computer Programming230 (2023), 102990

work page 2023
[9]

Bing Liu, Shiva Nejati, Lionel C Briand, et al . 2017. Improving fault localization for Simulink models using search-based testing and prediction models. In2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 359–370

work page 2017
[10]

Chengjie Lu, Jiahui Wu, Shaukat Ali, Malaika Din Hashmi, Sebastian Mathias Thomle Mason, Francois Picard, Mikkel Labori Olsen, and Thomas Peyrucain. 2026. UAMTERS: Uncertainty-Aware Mutation Analysis for DL-enabled Robotic Software.arXiv preprint arXiv:2602.20334(2026)

work page arXiv 2026
[11]

Chengjie Lu, Jiahui Wu, Shaukat Ali, and Mikkel Labori Olsen. 2025. Assessing the uncertainty and robustness of the laptop refurbishing software. In2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 406–416

work page 2025
[12]

Chengjie Lu, Huihui Zhang, Tao Yue, and Shaukat Ali. 2021. Search-based selection and prioritization of test scenarios for autonomous driving systems. InInternational Symposium on Search Based Software Engineering. Springer, 41–55

work page 2021
[13]

Kiran Maharana, Surajit Mondal, and Bhushankumar Nemade. 2022. A review: Data pre-processing and data augmentation techniques.Global Transitions Proceedings3, 1 (2022), 91–99

work page 2022
[14]

Malaika Din Hashmi. 2024. RoboSAPIENS Extended Screw Detection Dataset (Fixtures) – Part A. https: //universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/12. Accessed: 2026-04-01

work page 2024
[15]

Malaika Din Hashmi. 2024. RoboSAPIENS Extended Screw Detection Dataset (Fixtures) – Part B. https: //universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/14. Accessed: 2026-04-01

work page 2024
[16]

Malaika Din Hashmi. 2024. RoboSAPIENS Screw Detection Dataset v0 (Black Screws Only). https:// universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/5. Accessed: 2026-04-01

work page 2024
[17]

Malaika Din Hashmi. 2024. RoboSAPIENS Screw Detection Dataset v1 (Screws and Non-Screw Objects). https://universe.roboflow.com/malaika-din-hashmi/robosapiens/dataset/8. Accessed: 2026- 04-01

work page 2024
[18]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. Deepxplore: Automated whitebox testing of deep learning systems. Inproceedings of the 26th Symposium on Operating Systems Principles. 1–18

work page 2017
[19]

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. InProceedings of the IEEE conference on computer vision and pattern recognition. 779–788

work page 2016
[20]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems28 (2015)

work page 2015
[21]

Ke Shang, Hisao Ishibuchi, Linjun He, and Lie Meng Pang. 2020. A survey on the hypervolume indicator in evolutionary multiobjective optimization.IEEE Transactions on Evolutionary Computation25, 1 (2020), 1–20

work page 2020
[22]

Fabio Henrique Kiyoiti Dos Santos Tanaka and Claus Aranha. 2019. Data augmentation using GANs.arXiv preprint arXiv:1904.09135(2019)

work page arXiv 2019
[23]

Shuncheng Tang, Zhenya Zhang, Yi Zhang, Jixiang Zhou, Yan Guo, Shuang Liu, Shengjian Guo, Yan-Fu Li, Lei Ma, Yinxing Xue, et al. 2023. A survey on automated driving system testing: Landscapes and trends.ACM Transactions on Software Engineering and Methodology32, 5 (2023), 1–62

work page 2023
[24]

Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural- network-driven autonomous cars. InProceedings of the 40th international conference on software engineering. 303–314

work page 2018
[25]

Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. Robot: Robustness-oriented testing for deep learning systems. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 300–311

work page 2021
[26]

Amirhossein Zolfagharian, Manel Abdellatif, Lionel C Briand, Mojtaba Bagherzadeh, et al. 2023. A search-based testing approach for deep reinforcement learning agents.IEEE Transactions on Software Engineering49, 7 (2023), 3715–3735. 15

work page 2023