arxiv: 2604.24018 · v1 · submitted 2026-04-27 · 💻 cs.RO

Recognition: unknown

Betting for Sim-to-Real Performance Evaluation

Zaid Mahboob , Yujia Chen , Bowen Weng

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:02 UTC · model grok-4.3

classification 💻 cs.RO

keywords sim-to-realperformance evaluationbettingMonte Carlo estimatorroboticsvariance reductionsimulatorpick-and-place

0 comments

The pith

A betting mechanism yields more accurate real-world robot performance estimates than Monte Carlo sampling by constructing simulator-guided bets under specific theoretical conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines robot performance evaluation when physical experiments are scarce and costly. It shows that a betting approach, which places informed wagers based on simulator data, can produce better estimates of real behavior than simply averaging random trials. The work derives conditions under which these bets are provably more efficient, develops practical approximations, and supplies rules to verify the bets are functioning as expected. This setup allows unconventional uses of multiple synthetic distributions to infer real pick-and-place accuracy for manipulators. The result matters because it offers a way to stretch limited real-world testing while maintaining statistical reliability for benchmarking and validation.

Core claim

The paper establishes theoretical conditions under which a betting mechanism can yield accurate and efficient estimates of real-world robot performance, provably outperforming the Monte Carlo estimator. It characterizes how such bets should be constructed from available simulators, develops theoretically grounded yet practically implementable approximations of the ideal bet, and provides concrete decision rules that diagnose when these approximate betting strategies are working as intended. The approach is demonstrated on synthetic examples, cross-fidelity computational simulators, and an illustrative case using synthetic distributions to infer real-world pick-and-place accuracy of a robotic

What carries the argument

the betting mechanism, which constructs simulator-derived bets to produce lower-variance estimates of real-world performance than direct Monte Carlo averaging of physical trials.

If this is right

When the stated conditions hold, betting reduces the number of physical trials needed for a target estimation accuracy compared with Monte Carlo.
Approximate bets remain effective even without exact knowledge of the underlying distributions, provided the diagnostic rules confirm reliability.
The same betting framework supports inference from groups of synthetic distributions to real manipulator accuracy without direct real-world sampling of that specific scenario.
Decision rules allow users to detect and avoid cases where the betting strategy fails to deliver its promised advantage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The betting perspective could be combined with existing variance-reduction techniques such as importance sampling to achieve further efficiency gains in sim-to-real settings.
If the sim-to-real gap violates the construction assumptions, the method would revert to no better than Monte Carlo, pointing to the value of adaptive bet updating during real tests.
The framework suggests a general template for other expensive evaluation domains where cheap simulators can be turned into informed bets rather than used only for pre-filtering.

Load-bearing premise

Theoretical conditions exist that let properly constructed bets from simulators outperform plain Monte Carlo sampling while the sim-to-real transfer assumptions remain valid.

What would settle it

A side-by-side experiment on a physical robot task with known ground-truth performance where the mean-squared error of the betting estimator exceeds the Monte Carlo estimator despite following the paper's construction and diagnostic rules.

Figures

Figures reproduced from arXiv: 2604.24018 by Bowen Weng, Yujia Chen, Zaid Mahboob.

**Figure 1.** Figure 1: A roadmap of Section II. Theoretical results (Theorems 1-3) establish when and how betting improves estimation; algorithms translate these insights into practice. Arrows show logical dependencies. drawing auxiliary information from the accumulated simulator samples, a bet bt is chosen as a fraction of the (future) payoff, reflecting the algorithm’s belief about the correctness of a forthcoming prediction (… view at source ↗

**Figure 2.** Figure 2: Experiment results for Section III: 2a: win-rate comparison of all methods against the Monte Carlo baseline across different Real synthetic distributions, learning rates, and numbers of rounds; 2b: average wealth across methods on the Real_6 distributions, providing empirical support for Theorem 3; 2c: Round-by-round estimates of the placement error for Section III-B across different methods, the (Half) la… view at source ↗

**Figure 3.** Figure 3: Extended performance comparison results among various methods discussed in Section III-A with a fixed learning rate view at source ↗

**Figure 4.** Figure 4: A combined illustration of different variants of banks of view at source ↗

**Figure 5.** Figure 5: Synthetic Real distributions used in Section III-A. (a) Extended experiments on another policy evaluated following the same procedure as described in Section III-B. With the performance being significantly different from the one shown in the paper, Sim_172 becomes the most well-performed approximated Kelly variants. (b) The real-world experiment setup of the pick-and-place manipulation accuracy testing di… view at source ↗

**Figure 6.** Figure 6: Extended and complementary results to Section III-B. view at source ↗

**Figure 7.** Figure 7: A win-rate comparison with SureSim [37] on the locomotion task in Section III-C is shown under the same fair view at source ↗

**Figure 8.** Figure 8: Extending the case studies of Fig. 2a to comparisons against the “optimal” IS settings. view at source ↗

read the original abstract

This paper studies the problem of robot performance evaluation, focusing on how to obtain accurate and efficient estimates of real-world behavior under severe constraints on physical experimentation. Such estimates are essential for benchmarking algorithms, comparing design alternatives, validating controllers, and supporting certification or regulatory decision-making, yet real-world testing with physical robots is often expensive, time-consuming, and safety-limited. To mitigate the scarcity of real-world trials, sim-to-real methodologies are commonly employed, using low-cost simulators to inform, supplement, or prioritize physical experiments. Departing from (and complementary to) existing approaches in variance reduction (e.g., importance-sampling variants) or bias-correction (e.g., through prediction-powered inference or learned control variates), we examine this performance-evaluation problem through the lens of betting. We establish theoretical conditions under which a betting mechanism can yield accurate and efficient estimates (provably outperforming the Monte Carlo estimator) and we characterize how such bets should be constructed. We further develop theoretically grounded yet practically implementable approximations of the ideal bet, and we provide concrete decision rules that diagnose when these approximate betting strategies are working as intended. We demonstrate the effectiveness of the proposed methods using both synthetic examples and cross-fidelity computational simulators. Notably, we also showcase an illustrative case in which a group of synthetic distributions are used to infer the real-world pick-and-place accuracy of a robotic manipulator, a seemingly unconventional sim-to-real transfer that becomes natural and feasible under the proposed betting perspective. Programs for reproducing empirical results are available at https://github.com/ISUSAIL/Bet4Sim2Real.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper recasts sim-to-real performance evaluation as a betting problem and derives conditions where it can beat Monte Carlo sampling, with code and some practical approximations provided.

read the letter

This paper takes the problem of estimating real robot performance when physical tests are scarce and looks at it through betting instead of the usual variance reduction or bias correction tricks. The core move is to set up theoretical conditions under which a betting mechanism gives accurate estimates and provably outperforms plain Monte Carlo, then to characterize how the bets should be built and to supply workable approximations plus rules for checking when they are functioning as intended. They back this with synthetic examples and a cross-fidelity demonstration that uses multiple synthetic distributions to infer pick-and-place accuracy on a real manipulator, which they present as a natural fit under the betting view. The GitHub code is a clear positive for anyone wanting to inspect the implementation. What the work does well is stay complementary to prior methods rather than claiming to replace them, and it tries to make the theory usable by adding diagnostics and approximations instead of leaving everything at the ideal level. The soft spots sit mainly in how restrictive the theoretical conditions end up being once you move beyond the synthetic cases. The abstract asserts outperformance, but the practical value hinges on whether the approximations preserve the edge and whether constructing the right bets is feasible with typical simulators; the pick-and-place example is interesting but still feels like a special case that would benefit from broader testing. The math appears to rest on independent grounding rather than circular steps. This is for robotics researchers who routinely face tight limits on real-world trials and are open to new statistical tools for evaluation and benchmarking. A reader already working on sim-to-real transfer or efficient estimation methods will get the most out of it. The combination of a fresh framing, stated theoretical conditions, and reproducible code is enough to justify sending it to a serious referee rather than desk-rejecting it.

Referee Report

2 major / 3 minor

Summary. The paper proposes a betting mechanism for efficient estimation of real-world robot performance from limited physical trials and abundant simulator data. It derives theoretical conditions under which suitably constructed bets yield unbiased estimates that provably outperform standard Monte Carlo sampling in terms of variance, develops practical approximations to the ideal bet together with diagnostic rules for when the approximations succeed, and demonstrates the approach on synthetic distributions and cross-fidelity simulators, including an unconventional synthetic-to-real transfer for pick-and-place accuracy.

Significance. If the stated theoretical conditions and variance-reduction guarantees hold, the betting perspective supplies a principled, complementary tool to importance sampling and prediction-powered inference for sim-to-real benchmarking and certification tasks. The explicit provision of reproducible code is a clear strength that allows direct verification of the empirical claims.

major comments (2)

[§3, Theorem 1] §3, Theorem 1: the claimed strict dominance over Monte Carlo is stated to hold under 'mild conditions on the simulator,' yet the precise measurability and integrability requirements that make the betting estimator unbiased and lower-variance are not spelled out; without them it is unclear whether the result applies to the discontinuous or heavy-tailed performance metrics typical in robotics.
[§5.2, Eq. (18)–(20)] §5.2, Eq. (18)–(20): the practical approximation replaces the ideal bet with a learned surrogate; the paper does not quantify the bias introduced by this surrogate or provide a finite-sample bound showing that the diagnostic rule still controls type-I error when the surrogate error is non-negligible.

minor comments (3)

[§2 and §4] Notation for the payoff function and the betting fraction is introduced in §2 but reused with different subscripts in §4; a single consolidated table of symbols would improve readability.
[Figure 4] Figure 4 (pick-and-place results) lacks error bars on the real-world reference and does not state how many physical trials were used to obtain the ground-truth accuracy; this makes it hard to judge whether the betting estimator’s reported improvement is statistically meaningful.
[Abstract] The abstract claims 'provably outperforming the Monte Carlo estimator,' yet the main text only shows dominance under the derived conditions; a brief caveat sentence in the abstract would align the two.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and for the constructive comments. We address each major comment below.

read point-by-point responses

Referee: [§3, Theorem 1] §3, Theorem 1: the claimed strict dominance over Monte Carlo is stated to hold under 'mild conditions on the simulator,' yet the precise measurability and integrability requirements that make the betting estimator unbiased and lower-variance are not spelled out; without them it is unclear whether the result applies to the discontinuous or heavy-tailed performance metrics typical in robotics.

Authors: We agree that the assumptions underlying Theorem 1 should be stated more explicitly. The result requires the performance metric to be a measurable function with finite first and second moments under the real-world distribution, together with integrability of the likelihood ratio induced by the simulator. These conditions ensure unbiasedness of the betting estimator and allow the variance comparison. In the revised manuscript we will add a dedicated remark immediately after the theorem statement that lists these requirements and discusses their implications for common robotics metrics: discontinuous indicators (e.g., success/failure) remain admissible provided the expectation exists, while heavy-tailed distributions preserve unbiasedness but may lose the strict variance reduction if the second moment is infinite. revision: yes
Referee: [§5.2, Eq. (18)–(20)] §5.2, Eq. (18)–(20): the practical approximation replaces the ideal bet with a learned surrogate; the paper does not quantify the bias introduced by this surrogate or provide a finite-sample bound showing that the diagnostic rule still controls type-I error when the surrogate error is non-negligible.

Authors: The surrogate is obtained by minimizing a convex loss that approximates the ideal betting function, and the diagnostic rule monitors whether the empirical average of the surrogate bet remains close to its theoretical expectation. While we do not supply a finite-sample bound on type-I error under surrogate approximation error, the rule is constructed to be conservative and our synthetic and robotic experiments indicate that it reliably detects large deviations. In the revision we will augment §5.2 with a short analysis of the approximation bias, including a simple concentration argument under bounded surrogate error, together with practical guidance on when additional validation trials should be performed. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper derives theoretical conditions under which a betting mechanism yields accurate estimates provably outperforming Monte Carlo, characterizes ideal bets, develops practical approximations, and supplies diagnostic rules. These steps are supported by independent synthetic examples, cross-fidelity simulators, and an unconventional sim-to-real pick-and-place case, with external reproducible code. No load-bearing step reduces by construction to a fitted input, self-definition, or unverified self-citation chain; the central claims rest on explicit theoretical derivations and empirical validation outside the fitted values themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on unspecified theoretical conditions for betting to outperform Monte Carlo; no explicit free parameters, ad-hoc axioms, or invented entities are described.

axioms (1)

standard math Standard probabilistic assumptions underlying Monte Carlo estimation and betting mechanisms
The comparison to Monte Carlo and construction of bets implicitly rely on foundational probability theory.

pith-pipeline@v0.9.0 · 5582 in / 1174 out tokens · 79175 ms · 2026-05-08T03:02:59.691380+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

77 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment.Nature Communications, 12(1):1–14, 2021

Shuo Feng, Xintao Yan, Haowei Sun, Yiheng Feng, and Henry X Liu. Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment.Nature Communications, 12(1):1–14, 2021

2021
[2]

Sim2real predictivity: Does evaluation in simulation predict real- world performance?IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

Abhishek Kadian, Joanne Truong, Aaron Gokaslan, Alexander Clegg, Erik Wijmans, Stefan Lee, Manolis Savva, Sonia Chernova, and Dhruv Batra. Sim2real predictivity: Does evaluation in simulation predict real- world performance?IEEE Robotics and Automation Letters, 5(4):6670–6677, 2020

2020
[3]

As- sessing transferability from simulation to reality for rein- forcement learning.IEEE transactions on pattern anal- ysis and machine intelligence, 43(4):1172–1183, 2019

Fabio Muratore, Michael Gienger, and Jan Peters. As- sessing transferability from simulation to reality for rein- forcement learning.IEEE transactions on pattern anal- ysis and machine intelligence, 43(4):1172–1183, 2019

2019
[4]

Towards standardized disturbance rejection testing of legged robot locomotion with lin- ear impactor: A preliminary study, observations, and implications

Bowen Weng, Guillermo A Castillo, Yun-Seok Kang, and Ayonga Hereid. Towards standardized disturbance rejection testing of legged robot locomotion with lin- ear impactor: A preliminary study, observations, and implications. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 9946–9952. IEEE, 2024

2024
[5]

Real-time sampling-based safe motion planning for robotic manipulators in dynamic environments,

Bowen Weng, Linda Capito, Guillermo A. Castillo, and Dylan Khor. Rethink Repeatable Measures of Robot Performance with Statistical Query.IEEE Transactions on Robotics, 42:561–578, 2025. doi: 10.1109/TRO.2025. 3645934

work page doi:10.1109/tro.2025 2025
[6]

Deep rein- forcement learning at the edge of the statistical precipice

Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare. Deep rein- forcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34: 29304–29320, 2021

2021
[7]

Benchmarking deep reinforcement learn- ing for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learn- ing for continuous control. InInternational conference on machine learning, pages 1329–1338. PMLR, 2016

2016
[8]

Robot learning as an empirical science: Best practices for policy evaluation, 2024

Hadas Kress-Gazit, Kunimatsu Hashimoto, Naveen Kup- puswamy, Paarth Shah, Phoebe Horgan, Gordon Richard- son, Siyuan Feng, and Benjamin Burchfiel. Robot learn- ing as an empirical science: Best practices for policy evaluation.arXiv preprint arXiv:2409.09491, 2024

work page arXiv 2024
[9]

On the comparability and optimal aggressiveness of the adversarial scenario-based safety testing of robots.IEEE Transactions on Robotics, 39(4): 3299–3318, 2023

Bowen Weng, Guillermo A Castillo, Wei Zhang, and Ayonga Hereid. On the comparability and optimal aggressiveness of the adversarial scenario-based safety testing of robots.IEEE Transactions on Robotics, 39(4): 3299–3318, 2023

2023
[10]

Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?Transportation research part A: policy and practice, 94:182–193, 2016

Nidhi Kalra and Susan M Paddock. Driving to safety: How many miles of driving would it take to demonstrate autonomous vehicle reliability?Transportation research part A: policy and practice, 94:182–193, 2016

2016
[11]

Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953):620–627, 2023

Shuo Feng, Haowei Sun, Xintao Yan, Haojie Zhu, Zhengxia Zou, Shengyin Shen, and Henry X Liu. Dense reinforcement learning for safety validation of autonomous vehicles.Nature, 615(7953):620–627, 2023

2023
[12]

Performance evaluation of manipulators from a kinematic viewpoint.NBS Special Publication, 459:39–62, 1976

Bernard Roth. Performance evaluation of manipulators from a kinematic viewpoint.NBS Special Publication, 459:39–62, 1976

1976
[13]

How generalizable is my behavior cloning policy? a statistical approach to trustworthy performance evaluation.IEEE Robotics and Automation Letters, 2024

Joseph A Vincent, Haruki Nishimura, Masha Itkina, Paarth Shah, Mac Schwager, and Thomas Kollar. How generalizable is my behavior cloning policy? a statistical approach to trustworthy performance evaluation.IEEE Robotics and Automation Letters, 2024

2024
[14]

A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979, 2019

C ´edric Colas, Olivier Sigaud, and Pierre-Yves Oudeyer. A hitchhiker’s guide to statistical comparisons of reinforcement learning algorithms.arXiv preprint arXiv:1904.06979, 2019

work page arXiv 1904
[15]

ANSI/RIA R15.05: Industrial Robots and Robot Systems – Performance Characteristics, 1992

American National Standards Institute/Robotic Industries Association. ANSI/RIA R15.05: Industrial Robots and Robot Systems – Performance Characteristics, 1992

1992
[16]

ISO 9283: Manipulating Industrial Robots – Performance Criteria and Related Test Methods, 1998

International Organization for Standardization. ISO 9283: Manipulating Industrial Robots – Performance Criteria and Related Test Methods, 1998

1998
[17]

van Ratingen

Michiel R. van Ratingen. The Euro NCAP safety rating. In Alexander Piskun, editor,Karosseriebautage Hamburg 2017, pages 11–20, Wiesbaden, 2017. Springer Fachmedien Wiesbaden. ISBN 978-3-658-18107-9

2017
[18]

MIT press Cam- bridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cam- bridge, 1998

1998
[19]

Sim-to-real transfer of robotic control with dynamics randomization

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automa- tion (ICRA), pages 3803–3810. IEEE, 2018

2018
[20]

Passivity-based full- body force control for humanoids and application to dynamic balancing and locomotion

SangHo Hyon and Gordon Cheng. Passivity-based full- body force control for humanoids and application to dynamic balancing and locomotion. In2006 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems, pages 4915–4922. IEEE, 2006

2006
[21]

Cambridge university press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex opti- mization. Cambridge university press, 2004

2004
[22]

The monte carlo method.Journal of the American statistical asso- ciation, 44(247):335–341, 1949

Nicholas Metropolis and Stanislaw Ulam. The monte carlo method.Journal of the American statistical asso- ciation, 44(247):335–341, 1949

1949
[23]

Equa- tion of state calculations by fast computing machines

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equa- tion of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953

1953
[24]

Monte Carlo sampling methods using Markov chains and their applications.Biometrika, 57(1): 97–109, 1970

W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications.Biometrika, 57(1): 97–109, 1970

1970
[25]

Monte carlo methods.Ltd., London, 40:32, 1964

JM Hammersley and DC Handscomb. Monte carlo methods.Ltd., London, 40:32, 1964

1964
[26]

Curse of rarity for autonomous vehicles.nature communications, 15(1): 4808, 2024

Henry X Liu and Shuo Feng. Curse of rarity for autonomous vehicles.nature communications, 15(1): 4808, 2024

2024
[27]

A study on challenges of testing robotic systems

Afsoon Afzal, Claire Le Goues, Michael Hilton, and Christopher Steven Timperley. A study on challenges of testing robotic systems. In2020 IEEE 13th inter- national conference on software testing, validation and verification (ICST), pages 96–107. IEEE, 2020

2020
[28]

Challenges in autonomous vehicle testing and validation.SAE Inter- national Journal of Transportation Safety, 4(1):15–24, 2016

Philip Koopman and Michael Wagner. Challenges in autonomous vehicle testing and validation.SAE Inter- national Journal of Transportation Safety, 4(1):15–24, 2016

2016
[29]

Rare-event simula- tion

Søren Asmussen and Peter W Glynn. Rare-event simula- tion. InStochastic Simulation: Algorithms and Analysis, pages 158–205. Springer, 2007

2007
[30]

Estimation of particle transmission by random sampling.National Bureau of Standards applied mathematics series, 12:27– 30, 1951

Herman Kahn and Theodore E Harris. Estimation of particle transmission by random sampling.National Bureau of Standards applied mathematics series, 12:27– 30, 1951

1951
[31]

Springer, 2007

Søren Asmussen and Peter W Glynn.Stochastic sim- ulation: algorithms and analysis, volume 57. Springer, 2007

2007
[32]

Scalable end-to- end autonomous vehicle testing via rare-event simulation

Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, Russ Tedrake, and John C Duchi. Scalable end-to- end autonomous vehicle testing via rare-event simulation. Advances in neural information processing systems, 31, 2018

2018
[33]

The sample size required in importance sampling.The Annals of Applied Probability, 28(2):1099–1135, 2018

Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling.The Annals of Applied Probability, 28(2):1099–1135, 2018

2018
[34]

Adaptive stress testing for autonomous vehicles

Mark Koren, Saud Alsaif, Ritchie Lee, and Mykel J Kochenderfer. Adaptive stress testing for autonomous vehicles. In2018 IEEE Intelligent Vehicles Symposium (IV), pages 1–7. IEEE, 2018

2018
[35]

Closing the sim-to-real loop: Adapting simulation randomization with real world experience

Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, Miles Macklin, Jan Issac, Nathan Ratliff, and Dieter Fox. Closing the sim-to-real loop: Adapting simulation randomization with real world experience. In2019 International Conference on Robotics and Automation (ICRA), pages 8973–8979. IEEE, 2019

2019
[36]

Prediction- powered inference.Science, 382(6671):669–674, 2023

Anastasios N Angelopoulos, Stephen Bates, Clara Fan- njiang, Michael I Jordan, and Tijana Zrnic. Prediction- powered inference.Science, 382(6671):669–674, 2023

2023
[37]

Reliable and scalable robot policy eval- uation with imperfect simulators.arXiv preprint arXiv:2510.04354, 2025

Apurva Badithela, David Snyder, Lihan Zha, Joseph Mikhail, Matthew O’Kelly, Anushri Dixit, and Anirudha Majumdar. Reliable and scalable robot policy eval- uation with imperfect simulators.arXiv preprint arXiv:2510.04354, 2025

work page arXiv 2025
[38]

Black box variational inference

Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. InArtificial intelligence and statistics, pages 814–822. PMLR, 2014

2014
[39]

Sim2Val: Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation

Rachel Luo, Heng Yang, Michael Watson, Apoorva Sharma, Sushant Veer, Edward Schmerling, and Marco Pavone. Sim2Val: Leveraging Correlation Across Test Platforms for Variance-Reduced Metric Estimation. arXiv preprint arXiv:2506.20553, 2025

work page arXiv 2025
[40]

Domain ran- domization for transferring deep neural networks from simulation to the real world

Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain ran- domization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ in- ternational conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017
[41]

Sim-to-Real: Learning Agile Locomotion for Quadruped Robots

Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Sim-to-Real: Learning Agile Locomotion for Quadruped Robots. InRobotics: Science and Systems, 2018

2018
[42]

Data-efficient domain randomization with bayesian optimization.IEEE Robotics and Automation Letters, 6(2):911–918, 2021

Fabio Muratore, Christian Eilers, Michael Gienger, and Jan Peters. Data-efficient domain randomization with bayesian optimization.IEEE Robotics and Automation Letters, 6(2):911–918, 2021

2021
[43]

Solving Rubik's Cube with a Robot Hand

Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand.arXiv preprint arXiv:1910.07113, 2019

work page internal anchor Pith review arXiv 1910
[44]

Col- lision avoidance and navigation for a quadrotor swarm using end-to-end deep reinforcement learning

Zhehui Huang, Zhaojing Yang, Rahul Krupani, Baskın S ¸enbas ¸lar, Sumeet Batra, and Gaurav S Sukhatme. Col- lision avoidance and navigation for a quadrotor swarm using end-to-end deep reinforcement learning. In2024 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 300–306. IEEE, 2024

2024
[45]

A survey on transfer learning, author=Pan, Sinno Jialin and Yang, Qiang.IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009

2009
[46]

Robust adversarial reinforcement learn- ing

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learn- ing. InInternational conference on machine learning, pages 2817–2826. PMLR, 2017

2017
[47]

Using simulation to improve sample-efficiency of Bayesian optimization for bipedal robots.Journal of machine learning research, 20(49): 1–24, 2019

Akshara Rai, Rika Antonova, Franziska Meier, and Christopher G Atkeson. Using simulation to improve sample-efficiency of Bayesian optimization for bipedal robots.Journal of machine learning research, 20(49): 1–24, 2019

2019
[48]

Time-uniform Chernoff bounds via nonnegative supermartingales.Probability Surveys, 17: 257–317, 2020

Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Time-uniform Chernoff bounds via nonnegative supermartingales.Probability Surveys, 17: 257–317, 2020

2020
[49]

A new interpretation of information rate

John L Kelly. A new interpretation of information rate. the bell system technical journal, 35(4):917–926, 1956

1956
[50]

Portfolio choice and the Kelly crite- rion

Edward O Thorp. Portfolio choice and the Kelly crite- rion. InStochastic optimization models in finance, pages 599–619. Elsevier, 1975

1975
[51]

Understanding the Kelly criterion

Edward O Thorp. Understanding the Kelly criterion. In The Kelly capital growth investment criterion: theory and practice, pages 509–523. World Scientific, 2011

2011
[52]

The Kelly crite- rion and the stock market.The American Mathematical Monthly, 99(10):922–931, 1992

Louis M Rotando and Edward O Thorp. The Kelly crite- rion and the stock market.The American Mathematical Monthly, 99(10):922–931, 1992

1992
[53]

John Wiley & Sons, 1999

Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

1999
[54]

Growth versus security in dynamic investment analysis.Management Science, 38(11):1562–1585, 1992

Leonard C MacLean, William T Ziemba, and George Blazenko. Growth versus security in dynamic investment analysis.Management Science, 38(11):1562–1585, 1992

1992
[55]

Good and bad properties of the Kelly criterion.The Best of Wilmott, page 65, 2006

Bill Ziemba. Good and bad properties of the Kelly criterion.The Best of Wilmott, page 65, 2006

2006
[56]

Universal portfolios.Mathematical finance, 1(1):1–29, 1991

Thomas M Cover. Universal portfolios.Mathematical finance, 1(1):1–29, 1991

1991
[57]

Universal port- folios with side information.IEEE Transactions on Information Theory, 42(2):348–363, 2002

Thomas M Cover and Erik Ordentlich. Universal port- folios with side information.IEEE Transactions on Information Theory, 42(2):348–363, 2002

2002
[58]

Asymptotic optimality and asymptotic equipartition properties of log- optimum investment.The Annals of Probability, pages 876–898, 1988

Paul H Algoet and Thomas M Cover. Asymptotic optimality and asymptotic equipartition properties of log- optimum investment.The Annals of Probability, pages 876–898, 1988

1988
[59]

The weighted majority algorithm.Information and computation, 108 (2):212–261, 1994

Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm.Information and computation, 108 (2):212–261, 1994

1994
[60]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and G ´abor Lugosi.Prediction, learning, and games. Cambridge university press, 2006

2006
[61]

A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55 (1):119–139, 1997

Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting.Journal of computer and system sciences, 55 (1):119–139, 1997

1997
[62]

Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation.Journal of the American Statistical Association, 102(477):359–378, 2007

2007
[63]

A game of prediction with expert advice

Vladimir G V ovk. A game of prediction with expert advice. InProceedings of the eighth annual conference on Computational learning theory, pages 51–60, 1995

1995
[64]

Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

Aaditya Ramdas, Peter Gr ¨unwald, Vladimir V ovk, and Glenn Shafer. Game-theoretic statistics and safe anytime- valid inference.Statistical Science, 38(4):576–601, 2023

2023
[65]

ISO 18646: Robots and Robotic Devices – Performance Cri- teria and Related Test Methods for Service Robots, 2016

International Organization for Standardization. ISO 18646: Robots and Robotic Devices – Performance Cri- teria and Related Test Methods for Service Robots, 2016

2016
[66]

SO-ARM100: Open-Source Robotic Arm Platform

The Robot Studio. SO-ARM100: Open-Source Robotic Arm Platform. https://github.com/TheRobotStudio/ SO-ARM100, 2024. Accessed: 2025-01-XX

2024
[67]

Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning Fine-Grained Bimanual Ma- nipulation with Low-Cost Hardware. InProceedings of Robotics: Science and Systems, Daegu, Republic of Korea, July 2023. doi: 10.15607/RSS.2023.XIX.016

work page doi:10.15607/rss.2023.xix.016 2023
[68]

Unitree RL Gym

Unitree Robotics. Unitree RL Gym. https://github.com/ unitreerobotics/unitree rl gym, 2024

2024
[69]

arXiv preprint arXiv:2509.10771 , year=

Clemens Schwarke, Mayank Mittal, Nikita Rudin, David Hoeller, and Marco Hutter. RSL-RL: A learning library for robotics research.arXiv preprint arXiv:2509.10771, 2025

work page arXiv 2025
[70]

Betting for Sim-to-Real Performance Evaluation

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012. Supplementary Material:“Betting for Sim-to-Real Performance Evaluation” This document supplements the paper titled “Betting for Sim-to-Real Performance Evalua...

2012
[71]

Nevertheless, this allocation reflects the most balanced and fair use of the available simulator budget for our comparison

The sim-real pairwise testing required by SureSim may be limited by the relatively small number of samples available in our setting (30 samples), and the additional sim-only samples (20 samples) may also be insufficient to fully realize its potential (as mentioned above). Nevertheless, this allocation reflects the most balanced and fair use of the availab...
[72]

SureSim involves a larger number of hyperparameters that may require careful tuning; in our reproduction, we did not perform extensive hyperparameter optimization
[73]

From this perspective, direct point- estimate comparison may not fully reflect its intended use, though it remains the most practical basis for comparison in our setting

A primary strength of SureSim (and PPI-based methods more broadly) lies not in producing the most accurate point estimate of the mean, but in providing confidence intervals with guaranteed coverage. From this perspective, direct point- estimate comparison may not fully reflect its intended use, though it remains the most practical basis for comparison in ...
[74]

SureSim is primarily designed around a single simulator and relies on correlation-based adjustments, whereas the proposed Kelly-style betting variants naturally accommodate and benefit from a diverse bank of simulators
[75]

theoretically

The two approaches are not mutually exclusive. As discussed in the paper, PPI-style bias correction and betting-based variance reduction address complementary aspects of the sim-to-real inference problem and could potentially be combined in future work. B. Comparisons with IS The practical implementation of IS (importance sampling) is highly case-specific...
[76]

The zero-variance guarantee is asymptotic: while variance vanishes asn→ ∞, both bias and variance can remain non-negligible for practical budgets (heren≤300)

The self-normalized IS estimator (3) is biased at finite sample sizes, even whenq=q ∗. The zero-variance guarantee is asymptotic: while variance vanishes asn→ ∞, both bias and variance can remain non-negligible for practical budgets (heren≤300)
[77]

No edge” simply means no useful predictive signal. •Wealth: “Wealth

Unlike IS, which draws samples from a fixed proposal, Kelly betting is sequential and adaptive. This adaptivity allows it to incorporate early outcomes and progressively allocate weight toward uncertainty reduction. As discussed in the main paper, the proposed Kelly-style betting mechanism is not intended to replace IS or debiasing methods such as PPI. Ra...