pith. sign in

arxiv: 2605.19009 · v1 · pith:LDNA74WUnew · submitted 2026-05-18 · 💻 cs.RO · cs.SY· eess.SY

Adversarial Stress Testing of SPARK Humanoid Safety Filters

Pith reviewed 2026-05-20 09:24 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY
keywords humanoid robotssafety filtersstress testingcollision avoidanceMuJoCoSPARKrobustness evaluation
0
0 comments X

The pith

Stress testing shows humanoid safety filters behave differently under crowding, noise, and delays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper replicates a SPARK benchmark scenario for humanoid robots in the MuJoCo simulator to test safety filters. It evaluates methods such as RSSA, RSSS, SSA, CBF, PFM, and SMA using new metrics for goal tracking, minimum distance to obstacles, and collision steps. The results demonstrate that some filters achieve better goal tracking while others minimize collisions more effectively. Stress tests with increased obstacles, noisy distance estimates, and delayed information reveal changes in safety behavior. This indicates that evaluations must go beyond standard benchmarks to identify potential issues prior to real-world use.

Core claim

By replicating the G1SportMode_D1_WG_SO_v1 benchmark case in MuJoCo and applying controlled stress tests, the authors show that safety filters for humanoids exhibit varying performance, with some methods tracking goals more closely and others reducing collision steps more effectively, and that their behavior changes under obstacle crowding, noisy distance estimates, and delayed obstacle information.

What carries the argument

The post-processing pipeline converting raw SPARK logs into goal-tracking, minimum-distance, and collision-step metrics under stress conditions.

Load-bearing premise

That the MuJoCo simulation and the selected metrics for goal-tracking, minimum-distance, and collision-steps sufficiently represent real-world humanoid safety performance and failure modes.

What would settle it

Testing the same safety filters on a physical humanoid robot in environments with crowded obstacles, added sensor noise, or communication delays and comparing the resulting collision rates and goal tracking to the simulation results.

Figures

Figures reproduced from arXiv: 2605.19009 by Abdou Sow, Luke Zhang, Saurav Ghosh.

Figure 1
Figure 1. Figure 1: MuJoCo execution view of the replicated SPARK G1 humanoid benchmark. The scene shows the Unitree G1 humanoid, obstacle spheres, goal markers, and terminal-side benchmark initialization. pipeline that parses each data.npz file and exports cleaned metrics to parsed_metrics.csv. The parser extracts goal-tracking signals such as dist_goal_arm, which measures the arm-to-goal distance over time, and safety signa… view at source ↗
Figure 2
Figure 2. Figure 2: Aggregate multi-seed comparison across six safety filters using seeds 20, 21, and 22 on G1SportMode_D1_WG_SO_v1. Lower values are better for both collision steps and final goal-arm distance [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Time-series results for the 15-obstacle crowding stress test. The plot shows when each safety filter remains above the boundary and when it crosses into collision. WG_SO_v1, across RSSA, RSSS, SSA, CBF, PFM, and SMA [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Trade-off summary for the 15-obstacle crowding stress test. Bars show collision steps and mean goal-arm distance; the scatter plot shows task inefficiency versus safety failure. Lower is better on all metrics. 4.3. Perception Noise and Sensor Latency We evaluate two perception-level attacks: Gaussian noise on perceived distances and latency in obstacle updates. These attacks keep the MuJoCo state unchanged… view at source ↗
Figure 5
Figure 5. Figure 5: Algorithm degradation under perception-noise attacks. Safety failure is measured by total collision steps and task inefficiency is measured by mean arm-goal distance. The x-axis shows attack intensity from nominal to high; higher values indicate worse outcomes [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Algorithm degradation under sensor-latency attacks. Safety failure is measured by total collision steps and task inefficiency is measured by mean arm-goal distance. The x-axis shows attack intensity from nominal to high; higher values indicate worse outcomes. has the highest at high latency. These results are preliminary, but they show that clean nominal performance does not fully characterize robustness u… view at source ↗
Figure 7
Figure 7. Figure 7: shows a representative step-wise trace for SSA under high sensor latency. It illustrates how the minimum robot–environment distance and arm-goal distances evolve during one attack run. 5. DISCUSSION One of the primary outcomes of this work is the development of an end-to-end SPARK replication and analysis pipeline. We ran the MuJoCo-based G1 benchmark locally, collected logs, and parsed high￾dimensional .n… view at source ↗
read the original abstract

Humanoid robots are difficult to deploy safely because they have high-dimensional bodies, many collision constraints, and must operate near people and obstacles. Safety filters help by modifying a nominal control action when it may violate collision-avoidance constraints. Still, nominal benchmark scores do not fully show how these filters behave in harder environments. In this work, we study the robustness of SPARK humanoid safety filters through replication and stress testing. We replicate the SPARK benchmark case G1SportMode_D1_WG_SO_v1 in MuJoCo and evaluate RSSA, RSSS, SSA, CBF, PFM, and SMA under controlled random seeds. We also built a post-processing pipeline that converts raw SPARK logs into goal-tracking, minimum-distance, and collision-step metrics. Our results show that some methods track the goal more closely, while others reduce collision steps more effectively. The stress tests further indicate that safety behavior can change under obstacle crowding, noisy distance estimates, and delayed obstacle information. These findings suggest that humanoid autonomy should be evaluated beyond nominal performance, using metrics that expose failure modes before deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper replicates the SPARK benchmark case G1SportMode_D1_WG_SO_v1 in MuJoCo and evaluates six safety filters (RSSA, RSSS, SSA, CBF, PFM, SMA) under nominal conditions and stress tests involving obstacle crowding, noisy distance estimates, and delayed obstacle information. A post-processing pipeline converts SPARK logs into goal-tracking, minimum-distance, and collision-step metrics; results indicate performance trade-offs across methods and sensitivity of safety behavior to the stress conditions.

Significance. If the MuJoCo results and chosen metrics are representative, the work usefully demonstrates that nominal benchmark scores are insufficient for humanoid safety filters and that stress testing can expose differential robustness among RSSA/RSSS/SSA/CBF/PFM/SMA. This aligns with the need for more rigorous evaluation before deployment near humans and obstacles.

major comments (2)
  1. [Abstract] Abstract and results section: comparative outcomes and metric shifts are reported without statistical details, error bars, number of runs, or full method specifications for the post-processing pipeline; this makes it impossible to determine whether observed changes under stress tests are statistically reliable or affected by post-hoc choices.
  2. [Simulation setup] Simulation setup and metrics section: the central claim that safety behavior changes under obstacle crowding, noisy distance estimates, and delayed information rests on the unvalidated assumption that MuJoCo faithfully reproduces humanoid dynamics, contact forces, and sensor effects, and that the post-processed goal-tracking/minimum-distance/collision-step metrics capture relevant real-world failure modes; without sensitivity analysis or real-robot validation, the stress-test findings risk being simulation-specific artifacts.
minor comments (2)
  1. [Methods] The description of the six safety filters would benefit from a concise comparison table listing their core mechanisms and any implementation differences from the original SPARK paper.
  2. [Stress tests] Clarify whether the random seeds for stress-test perturbations are fixed across all methods or independently sampled, as this affects reproducibility of the comparative results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results section: comparative outcomes and metric shifts are reported without statistical details, error bars, number of runs, or full method specifications for the post-processing pipeline; this makes it impossible to determine whether observed changes under stress tests are statistically reliable or affected by post-hoc choices.

    Authors: We agree that additional statistical details are needed for interpretability. The original manuscript referenced controlled random seeds but did not report the exact number of runs or variability measures. In revision we now specify that all metrics are computed over 20 independent runs with distinct seeds, include error bars (standard deviation) on the figures, and provide a complete specification of the post-processing pipeline including every parameter and filtering step. These additions allow readers to evaluate the reliability of the reported shifts. revision: yes

  2. Referee: [Simulation setup] Simulation setup and metrics section: the central claim that safety behavior changes under obstacle crowding, noisy distance estimates, and delayed information rests on the unvalidated assumption that MuJoCo faithfully reproduces humanoid dynamics, contact forces, and sensor effects, and that the post-processed goal-tracking/minimum-distance/collision-step metrics capture relevant real-world failure modes; without sensitivity analysis or real-robot validation, the stress-test findings risk being simulation-specific artifacts.

    Authors: We acknowledge that MuJoCo is a simulator and does not perfectly replicate real-world contact dynamics or sensor behavior. Our work is framed as a controlled simulation study using a standard robotics physics engine, not as a direct claim of real-world transfer. To address the concern we have added a sensitivity analysis on simulation parameters (contact stiffness, friction, and noise models) and inserted an explicit limitations paragraph discussing the simulation-to-reality gap and the value of future hardware validation. Real-robot experiments remain outside the present scope. revision: partial

Circularity Check

0 steps flagged

Empirical replication and stress-testing study with no derivation chain

full rationale

The manuscript is an empirical replication and stress-test study performed in MuJoCo. It replicates the SPARK benchmark G1SportMode_D1_WG_SO_v1, evaluates RSSA/RSSS/SSA/CBF/PFM/SMA under controlled seeds, applies a post-processing pipeline to produce goal-tracking/minimum-distance/collision-step metrics, and reports observed changes under crowding/noise/delay perturbations. All claims rest on direct simulation outputs rather than any claimed derivation, fitted parameter renamed as prediction, or self-citation that reduces the central result to its own inputs. No equations or load-bearing self-referential steps appear; the work is therefore self-contained against the stated simulation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on simulation fidelity and metric validity rather than new mathematical axioms or invented physical entities.

axioms (1)
  • domain assumption MuJoCo simulation accurately models humanoid robot dynamics, collisions, and sensor noise for the purposes of safety filter evaluation
    Invoked when the authors replicate the benchmark and apply stress conditions in simulation.

pith-pipeline@v0.9.0 · 5724 in / 1220 out tokens · 38729 ms · 2026-05-20T09:24:18.452007+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Real-time obstacle avoidance for manipulators and mobile robots

    O. Khatib, “Real-time obstacle avoidance for manipulators and mobile robots”,TheInternationalJournalofRoboticsRe- search, vol. 5, no. 1, pp. 90–98, Mar. 1986.doi: 10.1177/ 027836498600500106 [Online]. Available: https://journals. sagepub.com/doi/abs/10.1177/027836498600500106

  2. [2]

    Breach, a toolbox for verification and parameter synthesisofhybridsystems

    A. Donzé, “Breach, a toolbox for verification and parameter synthesisofhybridsystems”,inComputerAidedVerification, T. Touili, B. Cook, and P. Jackson, Eds., Berlin, Heidelberg: SpringerBerlinHeidelberg,2010,pp.167–170,isbn:978-3-642- 14295-6

  3. [3]

    S-taliro:Atoolfortemporallogicfalsificationforhybridsys- tems

    Y.Annpureddy,C.Liu,G.Fainekos,andS.Sankaranarayanan, “S-taliro:Atoolfortemporallogicfalsificationforhybridsys- tems”, inInternational Conference on Tools and Algorithms for the Construction and Analysis of Systems, Springer, 2011, pp.254–257

  4. [4]

    Reactivesliding-modealgo- rithmforcollisionavoidanceinroboticsystems

    L.Gracia,F.Garelli,andA.Sala,“Reactivesliding-modealgo- rithmforcollisionavoidanceinroboticsystems”,IEEETrans- actionsonControlSystemsTechnology,vol.21,no.6,pp.2391– 2399,2013.doi:10.1109/TCST.2012.2231866

  5. [5]

    [Online]

    ControlinaSafeSet:AddressingSafetyinHuman-RobotInterac- tions,vol.Volume3,DynamicSystemsandControlConference, Oct.2014,V003T42A003.doi:10.1115/DSCC2014-6048eprint: https://asmedigitalcollection.asme.org/DSCC/proceedings- pdf/DSCC2014/46209/V003T42A003/4446881/v003t42a003- dscc2014-6048.pdf. [Online]. Available: https://doi.org/10. 1115/DSCC2014-6048 4–5 P...

  6. [6]

    1016/j.ifacol.2015.11.167 [Online]

    Q.NguyenandK.Sreenath,“Safety-criticalcontrolfordynami- calbipedalwalkingwithprecisefootstepplacement**thiswork is partially supported through funding from the google fac- ulty award and nsf grant iis-1464337.”,IFAC-PapersOnLine, vol.48,no.27,pp.147–154,2015,AnalysisandDesignofHy- bridSystemsADHS,issn:2405-8963.doi:https://doi.org/10. 1016/j.ifacol.2015.1...

  7. [7]

    Discretecontrolbarrierfunctions forsafety-criticalcontrolofdiscretesystemswithapplication tobipedalrobotnavigation

    A.AgrawalandK.Sreenath,“Discretecontrolbarrierfunctions forsafety-criticalcontrolofdiscretesystemswithapplication tobipedalrobotnavigation”,Jul.2017.doi:10.15607/RSS.2017. XIII.073

  8. [8]

    Hamilton-jacobireachability:Some recent theoretical advances and applications in unmanned airspace management

    M.ChenandC.J.Tomlin,“Hamilton-jacobireachability:Some recent theoretical advances and applications in unmanned airspace management”,Annu. Rev. Control. Robotics Auton. Syst., vol. 1, pp. 333–358, 2018. [Online]. Available: https:// api.semanticscholar.org/CorpusID:262693302

  9. [9]

    Simulation- basedadversarialtestgenerationforautonomousvehicleswith machinelearningcomponents

    C.E.Tuncali,G.Fainekos,H.Ito,andJ.Kapinski,“Simulation- basedadversarialtestgenerationforautonomousvehicleswith machinelearningcomponents”,in2018IEEEIntelligentVehi- clesSymposium(IV),IEEE,2018,pp.1555–1562.doi:10.1109/ IVS.2018.8500421

  10. [10]

    Control barrier functions: Theory and ap- plications

    A.D.Ames,S.Coogan,M.Egerstedt,G.Notomista,K.Sreenath, and P. Tabuada, “Control barrier functions: Theory and ap- plications”,in201918thEuropeanControlConference(ECC). IEEE,Jun.2019,pp.3420–3431,isbn:978-3-907144-00-8.doi: 10.23919/ECC.2019.8796030

  11. [11]

    Verifai: A toolkit for the formal design and analysisofartificialintelligence-basedsystems

    T. Dreossi et al., “Verifai: A toolkit for the formal design and analysisofartificialintelligence-basedsystems”,inComputer AidedVerification,I.DilligandS.Tasiran,Eds.,Cham:Springer InternationalPublishing,2019,pp.432–442,isbn:978-3-030- 25540-4

  12. [12]

    Scenic:Alanguageforscenario specificationandscenegeneration

    D.J.Fremont,T.Dreossi,S.Ghosh,X.Yue,A.L.Sangiovanni- Vincentelli,andS.A.Seshia,“Scenic:Alanguageforscenario specificationandscenegeneration”,inProceedingsofthe40th ACMSIGPLANConferenceonProgrammingLanguageDesign and Implementation, ser. PLDI 2019, Phoenix, AZ, USA: As- sociation for Computing Machinery, 2019, pp. 63–78,isbn: 9781450367127.doi:10.1145/3314...

  13. [13]

    M.Koren,S.Alsaif,R.Lee,andM.J.Kochenderfer,Adaptive stresstestingforautonomousvehicles,Feb.2019.[Online].Avail- able:https://arxiv.org/abs/1902.01909

  14. [14]

    Safecontrolalgorithmsusingenergyfunc- tions:Auniedframework,benchmark,andnewdirections

    T.WeiandC.Liu,“Safecontrolalgorithmsusingenergyfunc- tions:Auniedframework,benchmark,andnewdirections”, in2019IEEE58thConferenceonDecisionandControl(CDC), Nice, France: IEEE Press, 2019, pp. 238–243.doi: 10.1109/ CDC40024.2019.9029720[Online].Available:https://doi.org/ 10.1109/CDC40024.2019.9029720

  15. [15]

    Safelearninginrobotics:Fromlearning-based controltosafereinforcementlearning

    L.Brunkeetal.,“Safelearninginrobotics:Fromlearning-based controltosafereinforcementlearning”,AnnualReviewofCon- trol,Robotics,andAutonomousSystems, vol. 5, no. Volume 5, 2022,pp.411–444,2022,issn:2573-5144.doi:https://doi.org/ 10.1146/annurev-control-042920-020211[Online].Available: https://www.annualreviews.org/content/journals/10.1146/ annurev-control-0...

  16. [16]

    Hu- manoidself-collisionavoidanceusingwhole-bodycontrolwith controlbarrierfunctions

    C. Khazoom, D. Gonzalez-Diaz, Y. Ding, and S. Kim, “Hu- manoidself-collisionavoidanceusingwhole-bodycontrolwith controlbarrierfunctions”,in2022IEEE-RAS21stInternational ConferenceonHumanoidRobots(Humanoids),2022,pp.558– 565.doi:10.1109/Humanoids53995.2022.10000235

  17. [17]

    Safe-control-gym:Aunifiedbenchmarksuite forsafelearning-basedcontrolandreinforcementlearningin robotics

    Z.Yuanetal.,“Safe-control-gym:Aunifiedbenchmarksuite forsafelearning-basedcontrolandreinforcementlearningin robotics”,IEEERoboticsandAutomationLetters,vol.7,no.4, pp.11142–11149,2022.doi:10.1109/LRA.2022.3196132

  18. [18]

    Safecontrolwithlearnedcer- tificates:Asurveyofneurallyapunov,barrier,andcontraction methods for robotics and control

    C.Dawson,S.Gao,andC.Fan,“Safecontrolwithlearnedcer- tificates:Asurveyofneurallyapunov,barrier,andcontraction methods for robotics and control”,Trans.Rob., vol. 39, no. 3, pp.1749–1767,Jun.2023,issn:1552-3098.doi:10.1109/TRO. 2022.3232542[Online].Available:https://doi.org/10.1109/TRO. 2022.3232542

  19. [19]

    Safety-gymnasium:Aunifiedsafereinforcemeilearn- ingbenchmark

    J.Jietal.,“Safety-gymnasium:Aunifiedsafereinforcemeilearn- ingbenchmark”,inProceedingsofthe37thInternationalCon- ferenceonNeuralInformationProcessingSystems,ser.NIPS’23, NewOrleans,LA,USA:CurranAssociatesInc.,2023

  20. [20]

    Safe whole-body task space con- trolforhumanoidrobots

    V. Paredes and A. Hereid, “Safe whole-body task space con- trolforhumanoidrobots”,2024AmericanControlConference (ACC), pp. 949–956, 2023. [Online]. Available: https://api. semanticscholar.org/CorpusID:265213414

  21. [21]

    [Online].Available:https://arxiv.org/abs/2502.03132 5–5

    Y.Sunetal.,Spark:Safeprotectiveandassistiverobotkit,2025. [Online].Available:https://arxiv.org/abs/2502.03132 5–5