pith. machine review for the scientific record. sign in

arxiv: 2604.24018 · v1 · submitted 2026-04-27 · 💻 cs.RO

Recognition: unknown

Betting for Sim-to-Real Performance Evaluation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:02 UTC · model grok-4.3

classification 💻 cs.RO
keywords sim-to-realperformance evaluationbettingMonte Carlo estimatorroboticsvariance reductionsimulatorpick-and-place
0
0 comments X

The pith

A betting mechanism yields more accurate real-world robot performance estimates than Monte Carlo sampling by constructing simulator-guided bets under specific theoretical conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines robot performance evaluation when physical experiments are scarce and costly. It shows that a betting approach, which places informed wagers based on simulator data, can produce better estimates of real behavior than simply averaging random trials. The work derives conditions under which these bets are provably more efficient, develops practical approximations, and supplies rules to verify the bets are functioning as expected. This setup allows unconventional uses of multiple synthetic distributions to infer real pick-and-place accuracy for manipulators. The result matters because it offers a way to stretch limited real-world testing while maintaining statistical reliability for benchmarking and validation.

Core claim

The paper establishes theoretical conditions under which a betting mechanism can yield accurate and efficient estimates of real-world robot performance, provably outperforming the Monte Carlo estimator. It characterizes how such bets should be constructed from available simulators, develops theoretically grounded yet practically implementable approximations of the ideal bet, and provides concrete decision rules that diagnose when these approximate betting strategies are working as intended. The approach is demonstrated on synthetic examples, cross-fidelity computational simulators, and an illustrative case using synthetic distributions to infer real-world pick-and-place accuracy of a robotic

What carries the argument

the betting mechanism, which constructs simulator-derived bets to produce lower-variance estimates of real-world performance than direct Monte Carlo averaging of physical trials.

If this is right

  • When the stated conditions hold, betting reduces the number of physical trials needed for a target estimation accuracy compared with Monte Carlo.
  • Approximate bets remain effective even without exact knowledge of the underlying distributions, provided the diagnostic rules confirm reliability.
  • The same betting framework supports inference from groups of synthetic distributions to real manipulator accuracy without direct real-world sampling of that specific scenario.
  • Decision rules allow users to detect and avoid cases where the betting strategy fails to deliver its promised advantage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The betting perspective could be combined with existing variance-reduction techniques such as importance sampling to achieve further efficiency gains in sim-to-real settings.
  • If the sim-to-real gap violates the construction assumptions, the method would revert to no better than Monte Carlo, pointing to the value of adaptive bet updating during real tests.
  • The framework suggests a general template for other expensive evaluation domains where cheap simulators can be turned into informed bets rather than used only for pre-filtering.

Load-bearing premise

Theoretical conditions exist that let properly constructed bets from simulators outperform plain Monte Carlo sampling while the sim-to-real transfer assumptions remain valid.

What would settle it

A side-by-side experiment on a physical robot task with known ground-truth performance where the mean-squared error of the betting estimator exceeds the Monte Carlo estimator despite following the paper's construction and diagnostic rules.

Figures

Figures reproduced from arXiv: 2604.24018 by Bowen Weng, Yujia Chen, Zaid Mahboob.

Figure 1
Figure 1. Figure 1: A roadmap of Section II. Theoretical results (Theorems 1-3) establish when and how betting improves estimation; algorithms translate these insights into practice. Arrows show logical dependencies. drawing auxiliary information from the accumulated simulator samples, a bet bt is chosen as a fraction of the (future) payoff, reflecting the algorithm’s belief about the correctness of a forthcoming prediction (… view at source ↗
Figure 2
Figure 2. Figure 2: Experiment results for Section III: 2a: win-rate comparison of all methods against the Monte Carlo baseline across different Real synthetic distributions, learning rates, and numbers of rounds; 2b: average wealth across methods on the Real_6 distributions, providing empirical support for Theorem 3; 2c: Round-by-round estimates of the placement error for Section III-B across different methods, the (Half) la… view at source ↗
Figure 3
Figure 3. Figure 3: Extended performance comparison results among various methods discussed in Section III-A with a fixed learning rate view at source ↗
Figure 4
Figure 4. Figure 4: A combined illustration of different variants of banks of view at source ↗
Figure 5
Figure 5. Figure 5: Synthetic Real distributions used in Section III-A. (a) Extended experiments on another policy evaluated following the same procedure as described in Section III-B. With the performance be￾ing significantly different from the one shown in the paper, Sim_172 becomes the most well-performed approximated Kelly variants. (b) The real-world experiment setup of the pick-and-place manipulation accuracy testing di… view at source ↗
Figure 6
Figure 6. Figure 6: Extended and complementary results to Section III-B. view at source ↗
Figure 7
Figure 7. Figure 7: A win-rate comparison with SureSim [37] on the locomotion task in Section III-C is shown under the same fair view at source ↗
Figure 8
Figure 8. Figure 8: Extending the case studies of Fig. 2a to comparisons against the “optimal” IS settings. view at source ↗
read the original abstract

This paper studies the problem of robot performance evaluation, focusing on how to obtain accurate and efficient estimates of real-world behavior under severe constraints on physical experimentation. Such estimates are essential for benchmarking algorithms, comparing design alternatives, validating controllers, and supporting certification or regulatory decision-making, yet real-world testing with physical robots is often expensive, time-consuming, and safety-limited. To mitigate the scarcity of real-world trials, sim-to-real methodologies are commonly employed, using low-cost simulators to inform, supplement, or prioritize physical experiments. Departing from (and complementary to) existing approaches in variance reduction (e.g., importance-sampling variants) or bias-correction (e.g., through prediction-powered inference or learned control variates), we examine this performance-evaluation problem through the lens of betting. We establish theoretical conditions under which a betting mechanism can yield accurate and efficient estimates (provably outperforming the Monte Carlo estimator) and we characterize how such bets should be constructed. We further develop theoretically grounded yet practically implementable approximations of the ideal bet, and we provide concrete decision rules that diagnose when these approximate betting strategies are working as intended. We demonstrate the effectiveness of the proposed methods using both synthetic examples and cross-fidelity computational simulators. Notably, we also showcase an illustrative case in which a group of synthetic distributions are used to infer the real-world pick-and-place accuracy of a robotic manipulator, a seemingly unconventional sim-to-real transfer that becomes natural and feasible under the proposed betting perspective. Programs for reproducing empirical results are available at https://github.com/ISUSAIL/Bet4Sim2Real.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes a betting mechanism for efficient estimation of real-world robot performance from limited physical trials and abundant simulator data. It derives theoretical conditions under which suitably constructed bets yield unbiased estimates that provably outperform standard Monte Carlo sampling in terms of variance, develops practical approximations to the ideal bet together with diagnostic rules for when the approximations succeed, and demonstrates the approach on synthetic distributions and cross-fidelity simulators, including an unconventional synthetic-to-real transfer for pick-and-place accuracy.

Significance. If the stated theoretical conditions and variance-reduction guarantees hold, the betting perspective supplies a principled, complementary tool to importance sampling and prediction-powered inference for sim-to-real benchmarking and certification tasks. The explicit provision of reproducible code is a clear strength that allows direct verification of the empirical claims.

major comments (2)
  1. [§3, Theorem 1] §3, Theorem 1: the claimed strict dominance over Monte Carlo is stated to hold under 'mild conditions on the simulator,' yet the precise measurability and integrability requirements that make the betting estimator unbiased and lower-variance are not spelled out; without them it is unclear whether the result applies to the discontinuous or heavy-tailed performance metrics typical in robotics.
  2. [§5.2, Eq. (18)–(20)] §5.2, Eq. (18)–(20): the practical approximation replaces the ideal bet with a learned surrogate; the paper does not quantify the bias introduced by this surrogate or provide a finite-sample bound showing that the diagnostic rule still controls type-I error when the surrogate error is non-negligible.
minor comments (3)
  1. [§2 and §4] Notation for the payoff function and the betting fraction is introduced in §2 but reused with different subscripts in §4; a single consolidated table of symbols would improve readability.
  2. [Figure 4] Figure 4 (pick-and-place results) lacks error bars on the real-world reference and does not state how many physical trials were used to obtain the ground-truth accuracy; this makes it hard to judge whether the betting estimator’s reported improvement is statistically meaningful.
  3. [Abstract] The abstract claims 'provably outperforming the Monte Carlo estimator,' yet the main text only shows dominance under the derived conditions; a brief caveat sentence in the abstract would align the two.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for the constructive comments. We address each major comment below.

read point-by-point responses
  1. Referee: [§3, Theorem 1] §3, Theorem 1: the claimed strict dominance over Monte Carlo is stated to hold under 'mild conditions on the simulator,' yet the precise measurability and integrability requirements that make the betting estimator unbiased and lower-variance are not spelled out; without them it is unclear whether the result applies to the discontinuous or heavy-tailed performance metrics typical in robotics.

    Authors: We agree that the assumptions underlying Theorem 1 should be stated more explicitly. The result requires the performance metric to be a measurable function with finite first and second moments under the real-world distribution, together with integrability of the likelihood ratio induced by the simulator. These conditions ensure unbiasedness of the betting estimator and allow the variance comparison. In the revised manuscript we will add a dedicated remark immediately after the theorem statement that lists these requirements and discusses their implications for common robotics metrics: discontinuous indicators (e.g., success/failure) remain admissible provided the expectation exists, while heavy-tailed distributions preserve unbiasedness but may lose the strict variance reduction if the second moment is infinite. revision: yes

  2. Referee: [§5.2, Eq. (18)–(20)] §5.2, Eq. (18)–(20): the practical approximation replaces the ideal bet with a learned surrogate; the paper does not quantify the bias introduced by this surrogate or provide a finite-sample bound showing that the diagnostic rule still controls type-I error when the surrogate error is non-negligible.

    Authors: The surrogate is obtained by minimizing a convex loss that approximates the ideal betting function, and the diagnostic rule monitors whether the empirical average of the surrogate bet remains close to its theoretical expectation. While we do not supply a finite-sample bound on type-I error under surrogate approximation error, the rule is constructed to be conservative and our synthetic and robotic experiments indicate that it reliably detects large deviations. In the revision we will augment §5.2 with a short analysis of the approximation bias, including a simple concentration argument under bounded surrogate error, together with practical guidance on when additional validation trials should be performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives theoretical conditions under which a betting mechanism yields accurate estimates provably outperforming Monte Carlo, characterizes ideal bets, develops practical approximations, and supplies diagnostic rules. These steps are supported by independent synthetic examples, cross-fidelity simulators, and an unconventional sim-to-real pick-and-place case, with external reproducible code. No load-bearing step reduces by construction to a fitted input, self-definition, or unverified self-citation chain; the central claims rest on explicit theoretical derivations and empirical validation outside the fitted values themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on unspecified theoretical conditions for betting to outperform Monte Carlo; no explicit free parameters, ad-hoc axioms, or invented entities are described.

axioms (1)
  • standard math Standard probabilistic assumptions underlying Monte Carlo estimation and betting mechanisms
    The comparison to Monte Carlo and construction of bets implicitly rely on foundational probability theory.

pith-pipeline@v0.9.0 · 5582 in / 1174 out tokens · 79175 ms · 2026-05-08T03:02:59.691380+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

77 extracted references · 8 canonical work pages · 1 internal anchor

  1. [1]

    Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment.Nature Communications, 12(1):1–14, 2021

    Shuo Feng, Xintao Yan, Haowei Sun, Yiheng Feng, and Henry X Liu. Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment.Nature Communications, 12(1):1–14, 2021

  2. [2]

    Sim2real predictivity: Does evaluation in simulation predict real- world performance?IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

    Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Sim2real predictivity: Does evaluation in simulation predict real- world performance?IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

  3. [3]

    As- sessing transferability from simulation to reality for rein- forcement learning.IEEE transactions on pattern anal- ysis and machine intelligence, 43(4):1172–1183, 2019

    Fabio Muratore, Michael Gienger, and Jan Peters. As- sessing transferability from simulation to reality for rein- forcement learning.IEEE transactions on pattern anal- ysis and machine intelligence, 43(4):1172–1183, 2019

  4. [4]

    Towards standardized disturbance rejection testing of legged robot locomotion with lin- ear impactor: A preliminary study, observations, and implications

    Bowen Weng, Guillermo A Castillo, Yun-Seok Kang, and Ayonga Hereid. Towards standardized disturbance rejection testing of legged robot locomotion with lin- ear impactor: A preliminary study, observations, and implications. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9946–9952. IEEE, 2024

  5. [5]

    Real-time sampling-based safe motion planning for robotic manipulators in dynamic environments,

    Bowen Weng, Linda Capito, Guillermo A. Castillo, and Dylan Khor. Rethink Repeatable Measures of Robot Performance with Statistical Query.IEEE Transactions on Robotics, 42:561–578, 2025. doi: 10.1109/TRO.2025. 3645934

  6. [6]

    Deep rein- forcement learning at the edge of the statistical precipice

    Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep rein- forcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34: 29304–29320, 2021

  7. [7]

    Benchmarking deep reinforcement learn- ing for continuous control

    Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learn- ing for continuous control. InInternational conference on machine learning, pages 1329–1338. PMLR, 2016

  8. [8]

    Robot learning as an empirical science: Best practices for policy evaluation, 2024

    Hadas Kress-Gazit, Kunimatsu Hashimoto, Naveen Kup- puswamy, Paarth Shah, Phoebe Horgan, Gordon Richard- son, Siyuan Feng, and Benjamin Burchfiel. Robot learn- ing as an empirical science: Best practices for policy evaluation.arXiv preprint arXiv:2409.09491, 2024

  9. [9]

    On the comparability and optimal aggressiveness of the adversarial scenario-based safety testing of robots.IEEE Transactions on Robotics, 39(4): 3299–3318, 2023

    Bowen Weng, Guillermo A Castillo, Wei Zhang, and Ayonga Hereid. On the comparability and optimal aggressiveness of the adversarial scenario-based safety testing of robots.IEEE Transactions on Robotics, 39(4): 3299–3318, 2023

  10. [10]

    Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?Transportation research part A: policy and practice, 94:182–193, 2016

    Nidhi Kalra and Susan M Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?Transportation research part A: policy and practice, 94:182–193, 2016

  11. [11]

    Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953):620–627, 2023

    Shuo Feng, Haowei Sun, Xintao Yan, Haojie Zhu, Zhengxia Zou, Shengyin Shen, and Henry X Liu. Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953):620–627, 2023

  12. [12]

    Performance evaluation of manipulators from a kinematic viewpoint.NBS Special Publication, 459:39–62, 1976

    Bernard Roth. Performance evaluation of manipulators from a kinematic viewpoint.NBS Special Publication, 459:39–62, 1976

  13. [13]

    How generalizable is my behavior cloning policy? a statistical approach to trustworthy performance evaluation.IEEE Robotics and Automation Letters, 2024

    Joseph A Vincent, Haruki Nishimura, Masha Itkina, Paarth Shah, Mac Schwager, and Thomas Kollar. How generalizable is my behavior cloning policy? a statistical approach to trustworthy performance evaluation.IEEE Robotics and Automation Letters, 2024

  14. [14]

    A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979, 2019

    C ´edric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979, 2019

  15. [15]

    ANSI/RIA R15.05: Industrial Robots and Robot Systems – Performance Characteristics, 1992

    American National Standards Institute/Robotic Industries Association. ANSI/RIA R15.05: Industrial Robots and Robot Systems – Performance Characteristics, 1992

  16. [16]

    ISO 9283: Manipulating Industrial Robots – Performance Criteria and Related Test Methods, 1998

    International Organization for Standardization. ISO 9283: Manipulating Industrial Robots – Performance Criteria and Related Test Methods, 1998

  17. [17]

    van Ratingen

    Michiel R. van Ratingen. The Euro NCAP safety rating. In Alexander Piskun, editor,Karosseriebautage Hamburg 2017, pages 11–20, Wiesbaden, 2017. Springer Fachmedien Wiesbaden. ISBN 978-3-658-18107-9

  18. [18]

    MIT press Cam- bridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cam- bridge, 1998

  19. [19]

    Sim-to-real transfer of robotic control with dynamics randomization

    Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automa- tion (ICRA), pages 3803–3810. IEEE, 2018

  20. [20]

    Passivity-based full- body force control for humanoids and application to dynamic balancing and locomotion

    SangHo Hyon and Gordon Cheng. Passivity-based full- body force control for humanoids and application to dynamic balancing and locomotion. In2006 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 4915–4922. IEEE, 2006

  21. [21]

    Cambridge university press, 2004

    Stephen Boyd and Lieven Vandenberghe.Convex opti- mization. Cambridge university press, 2004

  22. [22]

    The monte carlo method.Journal of the American statistical asso- ciation, 44(247):335–341, 1949

    Nicholas Metropolis and Stanislaw Ulam. The monte carlo method.Journal of the American statistical asso- ciation, 44(247):335–341, 1949

  23. [23]

    Equa- tion of state calculations by fast computing machines

    Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equa- tion of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953

  24. [24]

    Monte Carlo sampling methods using Markov chains and their applications.Biometrika, 57(1): 97–109, 1970

    W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.Biometrika, 57(1): 97–109, 1970

  25. [25]

    Monte carlo methods.Ltd., London, 40:32, 1964

    JM Hammersley and DC Handscomb. Monte carlo methods.Ltd., London, 40:32, 1964

  26. [26]

    Curse of rarity for autonomous vehicles.nature communications, 15(1): 4808, 2024

    Henry X Liu and Shuo Feng. Curse of rarity for autonomous vehicles.nature communications, 15(1): 4808, 2024

  27. [27]

    A study on challenges of testing robotic systems

    Afsoon Afzal, Claire Le Goues, Michael Hilton, and Christopher Steven Timperley. A study on challenges of testing robotic systems. In2020 IEEE 13th inter- national conference on software testing, validation and verification (ICST), pages 96–107. IEEE, 2020

  28. [28]

    Challenges in autonomous vehicle testing and validation.SAE Inter- national Journal of Transportation Safety, 4(1):15–24, 2016

    Philip Koopman and Michael Wagner. Challenges in autonomous vehicle testing and validation.SAE Inter- national Journal of Transportation Safety, 4(1):15–24, 2016

  29. [29]

    Rare-event simula- tion

    Søren Asmussen and Peter W Glynn. Rare-event simula- tion. InStochastic Simulation: Algorithms and Analysis, pages 158–205. Springer, 2007

  30. [30]

    Estimation of particle transmission by random sampling.National Bureau of Standards applied mathematics series, 12:27– 30, 1951

    Herman Kahn and Theodore E Harris. Estimation of particle transmission by random sampling.National Bureau of Standards applied mathematics series, 12:27– 30, 1951

  31. [31]

    Springer, 2007

    Søren Asmussen and Peter W Glynn.Stochastic sim- ulation: algorithms and analysis, volume 57. Springer, 2007

  32. [32]

    Scalable end-to- end autonomous vehicle testing via rare-event simulation

    Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, Russ Tedrake, and John C Duchi. Scalable end-to- end autonomous vehicle testing via rare-event simulation. Advances in neural information processing systems, 31, 2018

  33. [33]

    The sample size required in importance sampling.The Annals of Applied Probability, 28(2):1099–1135, 2018

    Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling.The Annals of Applied Probability, 28(2):1099–1135, 2018

  34. [34]

    Adaptive stress testing for autonomous vehicles

    Mark Koren, Saud Alsaif, Ritchie Lee, and Mykel J Kochenderfer. Adaptive stress testing for autonomous vehicles. In2018 IEEE Intelligent Vehicles Symposium (IV), pages 1–7. IEEE, 2018

  35. [35]

    Closing the sim-to-real loop: Adapting simulation randomization with real world experience

    Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In2019 International Conference on Robotics and Automation (ICRA), pages 8973–8979. IEEE, 2019

  36. [36]

    Prediction- powered inference.Science, 382(6671):669–674, 2023

    Anastasios N Angelopoulos, Stephen Bates, Clara Fan- njiang, Michael I Jordan, and Tijana Zrnic. Prediction- powered inference.Science, 382(6671):669–674, 2023

  37. [37]

    Reliable and scalable robot policy eval- uation with imperfect simulators.arXiv preprint arXiv:2510.04354, 2025

    Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O’Kelly, Anushri Dixit, and Anirudha Majumdar. Reliable and scalable robot policy eval- uation with imperfect simulators.arXiv preprint arXiv:2510.04354, 2025

  38. [38]

    Black box variational inference

    Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. InArtificial intelligence and statistics, pages 814–822. PMLR, 2014

  39. [39]

    Sim2Val: Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

    Rachel Luo, Heng Yang, Michael Watson, Apoorva Sharma, Sushant Veer, Edward Schmerling, and Marco Pavone. Sim2Val: Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation. arXiv preprint arXiv:2506.20553, 2025

  40. [40]

    Domain ran- domization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain ran- domization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ in- ternational conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

  41. [41]

    Sim-to-Real: Learning Agile Locomotion for Quadruped Robots

    Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-Real: Learning Agile Locomotion for Quadruped Robots. InRobotics: Science and Systems, 2018

  42. [42]

    Data-efficient domain randomization with bayesian optimization.IEEE Robotics and Automation Letters, 6(2):911–918, 2021

    Fabio Muratore, Christian Eilers, Michael Gienger, and Jan Peters. Data-efficient domain randomization with bayesian optimization.IEEE Robotics and Automation Letters, 6(2):911–918, 2021

  43. [43]

    Solving Rubik's Cube with a Robot Hand

    Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

  44. [44]

    Col- lision avoidance and navigation for a quadrotor swarm using end-to-end deep reinforcement learning

    Zhehui Huang, Zhaojing Yang, Rahul Krupani, Baskın S ¸enbas ¸lar, Sumeet Batra, and Gaurav S Sukhatme. Col- lision avoidance and navigation for a quadrotor swarm using end-to-end deep reinforcement learning. In2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 300–306. IEEE, 2024

  45. [45]

    A survey on transfer learning, author=Pan, Sinno Jialin and Yang, Qiang.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009

  46. [46]

    Robust adversarial reinforcement learn- ing

    Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learn- ing. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017

  47. [47]

    Using simulation to improve sample-efficiency of Bayesian optimization for bipedal robots.Journal of machine learning research, 20(49): 1–24, 2019

    Akshara Rai, Rika Antonova, Franziska Meier, and Christopher G Atkeson. Using simulation to improve sample-efficiency of Bayesian optimization for bipedal robots.Journal of machine learning research, 20(49): 1–24, 2019

  48. [48]

    Time-uniform Chernoff bounds via nonnegative supermartingales.Probability Surveys, 17: 257–317, 2020

    Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform Chernoff bounds via nonnegative supermartingales.Probability Surveys, 17: 257–317, 2020

  49. [49]

    A new interpretation of information rate

    John L Kelly. A new interpretation of information rate. the bell system technical journal, 35(4):917–926, 1956

  50. [50]

    Portfolio choice and the Kelly crite- rion

    Edward O Thorp. Portfolio choice and the Kelly crite- rion. InStochastic optimization models in finance, pages 599–619. Elsevier, 1975

  51. [51]

    Understanding the Kelly criterion

    Edward O Thorp. Understanding the Kelly criterion. In The Kelly capital growth investment criterion: theory and practice, pages 509–523. World Scientific, 2011

  52. [52]

    The Kelly crite- rion and the stock market.The American Mathematical Monthly, 99(10):922–931, 1992

    Louis M Rotando and Edward O Thorp. The Kelly crite- rion and the stock market.The American Mathematical Monthly, 99(10):922–931, 1992

  53. [53]

    John Wiley & Sons, 1999

    Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

  54. [54]

    Growth versus security in dynamic investment analysis.Management Science, 38(11):1562–1585, 1992

    Leonard C MacLean, William T Ziemba, and George Blazenko. Growth versus security in dynamic investment analysis.Management Science, 38(11):1562–1585, 1992

  55. [55]

    Good and bad properties of the Kelly criterion.The Best of Wilmott, page 65, 2006

    Bill Ziemba. Good and bad properties of the Kelly criterion.The Best of Wilmott, page 65, 2006

  56. [56]

    Universal portfolios.Mathematical finance, 1(1):1–29, 1991

    Thomas M Cover. Universal portfolios.Mathematical finance, 1(1):1–29, 1991

  57. [57]

    Universal port- folios with side information.IEEE Transactions on Information Theory, 42(2):348–363, 2002

    Thomas M Cover and Erik Ordentlich. Universal port- folios with side information.IEEE Transactions on Information Theory, 42(2):348–363, 2002

  58. [58]

    Asymptotic optimality and asymptotic equipartition properties of log- optimum investment.The Annals of Probability, pages 876–898, 1988

    Paul H Algoet and Thomas M Cover. Asymptotic optimality and asymptotic equipartition properties of log- optimum investment.The Annals of Probability, pages 876–898, 1988

  59. [59]

    The weighted majority algorithm.Information and computation, 108 (2):212–261, 1994

    Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm.Information and computation, 108 (2):212–261, 1994

  60. [60]

    Cambridge university press, 2006

    Nicolo Cesa-Bianchi and G ´abor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

  61. [61]

    A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55 (1):119–139, 1997

    Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55 (1):119–139, 1997

  62. [62]

    Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

    Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

  63. [63]

    A game of prediction with expert advice

    Vladimir G V ovk. A game of prediction with expert advice. InProceedings of the eighth annual conference on Computational learning theory, pages 51–60, 1995

  64. [64]

    Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

    Aaditya Ramdas, Peter Gr ¨unwald, Vladimir V ovk, and Glenn Shafer. Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

  65. [65]

    ISO 18646: Robots and Robotic Devices – Performance Cri- teria and Related Test Methods for Service Robots, 2016

    International Organization for Standardization. ISO 18646: Robots and Robotic Devices – Performance Cri- teria and Related Test Methods for Service Robots, 2016

  66. [66]

    SO-ARM100: Open-Source Robotic Arm Platform

    The Robot Studio. SO-ARM100: Open-Source Robotic Arm Platform. https://github.com/TheRobotStudio/ SO-ARM100, 2024. Accessed: 2025-01-XX

  67. [67]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Ma- nipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi: 10.15607/RSS.2023.XIX.016

  68. [68]

    Unitree RL Gym

    Unitree Robotics. Unitree RL Gym. https://github.com/ unitreerobotics/unitree rl gym, 2024

  69. [69]

    arXiv preprint arXiv:2509.10771 , year=

    Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. RSL-RL: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

  70. [70]

    Betting for Sim-to-Real Performance Evaluation

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. Supplementary Material:“Betting for Sim-to-Real Performance Evaluation” This document supplements the paper titled “Betting for Sim-to-Real Performance Evalua...

  71. [71]

    Nevertheless, this allocation reflects the most balanced and fair use of the available simulator budget for our comparison

    The sim-real pairwise testing required by SureSim may be limited by the relatively small number of samples available in our setting (30 samples), and the additional sim-only samples (20 samples) may also be insufficient to fully realize its potential (as mentioned above). Nevertheless, this allocation reflects the most balanced and fair use of the availab...

  72. [72]

    SureSim involves a larger number of hyperparameters that may require careful tuning; in our reproduction, we did not perform extensive hyperparameter optimization

  73. [73]

    From this perspective, direct point- estimate comparison may not fully reflect its intended use, though it remains the most practical basis for comparison in our setting

    A primary strength of SureSim (and PPI-based methods more broadly) lies not in producing the most accurate point estimate of the mean, but in providing confidence intervals with guaranteed coverage. From this perspective, direct point- estimate comparison may not fully reflect its intended use, though it remains the most practical basis for comparison in ...

  74. [74]

    SureSim is primarily designed around a single simulator and relies on correlation-based adjustments, whereas the proposed Kelly-style betting variants naturally accommodate and benefit from a diverse bank of simulators

  75. [75]

    theoretically

    The two approaches are not mutually exclusive. As discussed in the paper, PPI-style bias correction and betting-based variance reduction address complementary aspects of the sim-to-real inference problem and could potentially be combined in future work. B. Comparisons with IS The practical implementation of IS (importance sampling) is highly case-specific...

  76. [76]

    The zero-variance guarantee is asymptotic: while variance vanishes asn→ ∞, both bias and variance can remain non-negligible for practical budgets (heren≤300)

    The self-normalized IS estimator (3) is biased at finite sample sizes, even whenq=q ∗. The zero-variance guarantee is asymptotic: while variance vanishes asn→ ∞, both bias and variance can remain non-negligible for practical budgets (heren≤300)

  77. [77]

    No edge” simply means no useful predictive signal. •Wealth: “Wealth

    Unlike IS, which draws samples from a fixed proposal, Kelly betting is sequential and adaptive. This adaptivity allows it to incorporate early outcomes and progressively allocate weight toward uncertainty reduction. As discussed in the main paper, the proposed Kelly-style betting mechanism is not intended to replace IS or debiasing methods such as PPI. Ra...