pith. sign in

arxiv: 2606.05159 · v1 · pith:UOZ4XMVZnew · submitted 2026-06-03 · 💻 cs.RO

X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation

Pith reviewed 2026-06-28 05:43 UTC · model grok-4.3

classification 💻 cs.RO
keywords variance reductioncontrol variatespolicy evaluationrobotic systemsneural surrogatesmulti-domain dataautonomous drivingrobot manipulation
0
0 comments X

The pith

X4Val learns a neural predictor from auxiliary data to cut variance in real-world robotic policy evaluation without paired samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces X4Val to make rigorous evaluation of robotic policies more efficient when real-world test data is scarce and expensive. It embeds samples from real and auxiliary domains such as simulation or historical logs into a shared space, then trains a transferable predictor of real metrics. This predictor feeds into a control-variates estimator that subtracts predictable components from the evaluation, lowering variance even without matched pairs across domains. A sympathetic reader cares because abundant mismatched data could then support high-confidence estimates, reducing the real-world samples needed for deployment validation. If correct, the approach directly improves sample efficiency in iterative policy development.

Core claim

X4Val embeds samples from real and auxiliary domains into a shared representation space and learns a transferable predictor of real-world metrics; this learned predictor is then incorporated into a control-variates estimator, enabling variance reduction even when paired samples are unavailable. The framework supplies theoretical analysis and achieves up to 38.4 percent variance reduction with consistent gains over baselines on autonomous driving and real-world robot manipulation tasks.

What carries the argument

The neural surrogate predictor trained in a shared embedding space and inserted into the control-variates estimator.

If this is right

  • Non-paired heterogeneous data sources become usable for high-confidence real-world metric estimation.
  • Variance reduction reaches up to 38.4 percent on autonomous driving and robot manipulation tasks.
  • Empirical results show consistent improvements over strong baselines.
  • Theoretical analysis backs the variance reduction property of the estimator.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The shared embedding might support evaluation of policies in entirely new environments if the representation proves policy-invariant.
  • The method could shorten iteration cycles in robotics by lowering the real-world data volume required per policy update.
  • Similar surrogate-augmented control-variates setups might apply to other domains that rely on abundant but non-representative auxiliary data, such as simulation-heavy engineering validation.

Load-bearing premise

The predictor learned from auxiliary domains must remain accurate and unbiased enough when plugged into the control-variates estimator for the target real metric.

What would settle it

Collect a large held-out set of real-world samples, compute the estimator variance both with and without the learned predictor, and check whether the reduction matches or exceeds the reported levels or whether predictor bias nullifies the gain.

Figures

Figures reproduced from arXiv: 2606.05159 by Apoorva Sharma, Boris Ivanovic, Edward Schmerling, Han Qi, Heng Yang, Marco Pavone, Michael Watson, Rachel Luo, Sushant Veer.

Figure 1
Figure 1. Figure 1: Comparison of X4Val with standard control variates-based estimation. Stan￾dard approaches (bottom) can only use real world data with its log-replay simula￾tions enforcing a strict pairing between the real and simulated data and limiting the amount of data that can be used. X4Val (top) on the other hand, can use a diverse range of data sources by projecting them to a shared embedding space and learning a tr… view at source ↗
Figure 2
Figure 2. Figure 2: AV deployment to a new region. With limited target-domain evaluation data from a new region (Germany), leveraging auxiliary data from past evaluations in an￾other region (United States) can reduce variance of performance estimation in the target region. X4Val most efficiently combines data from tests in Germany with auxil￾iary data to yield consistent variance reduction relative to baselines. Error bars sh… view at source ↗
Figure 3
Figure 3. Figure 3: Driving examples for the US and Germany geographical regions. Differences include lane-marker types, signage, architecture, typical road features/junctions, etc. 5.1 Case Study 1: Autonomous Vehicle Deployment to a New Geographic Region In this case study, we consider a scenario in which an AV policy has been trained and validated in one geographic region, and must now be validated for deploy￾ment in a new… view at source ↗
Figure 4
Figure 4. Figure 4: Iterative AV policy development. When validating a newly trained policy with limited evaluation data, historical evaluation data from earlier policy versions can serve as auxiliary information to reduce uncertainty in estimating current performance. X4Val most efficiently combines the limited current-policy evaluation data with his￾torical data from earlier policies, achieving the largest variance reductio… view at source ↗
Figure 5
Figure 5. Figure 5: X4Val for policy evaluation in a block-stacking manipulation task. (a) Example evaluation in the ManiSkill simulator. (b) Example evaluation on a real robot. (c) Variance reduction achieved by X4Val compared to Monte Carlo when estimating the policy’s mean success rate (each boxplot summarizes 20 random seeds). In this section, we demonstrate that X4Val enables leveraging robot manip￾ulation policies train… view at source ↗
Figure 6
Figure 6. Figure 6: Effect of control-weight optimization on CV_MCF variance reduction relative to Simple Monte Carlo. (a) In the geographic-transfer case study, enabling optimization yields modest but consistent gains across MCF train fractions, with the largest gain (11% → 14%) occurring at fraction = 0.0. (b) In the iterative-policy-development case study, enabling optimization is essential: it transforms an estimator that… view at source ↗
read the original abstract

Rigorous evaluation of learning-based robotic systems is an essential prerequisite for deployment. However, real-world test data is expensive to gather; moreover, in a typical iterative development context, data gathered from the latest policy is necessarily limited in scale. This motivates evaluation methodologies that make use of heterogeneous data sources, including simulation, historical policy logs, and data collected from related platforms or environments. While such auxiliary data are abundant and inexpensive, they are generally not directly representative of real-world outcomes -- for example, performance in simulation may differ substantially from performance in the real world -- making their principled use for high-confidence performance estimation challenging. In this paper, we introduce X4Val, a general framework for variance-reduced real-world metric estimation in the presence of non-paired, multi-domain data. X4Val embeds samples from real and auxiliary domains into a shared representation space and learns a transferable predictor of real-world metrics; this learned predictor is then incorporated into a control-variates estimator, enabling variance reduction even when paired samples are unavailable. We provide theoretical analysis and empirical evaluations on autonomous driving and real-world robot manipulation tasks, domains across which X4Val achieves up to 38.4% variance reduction and demonstrates consistent improvements over strong baselines. These results show that non-paired, heterogeneous data can be leveraged to substantially improve the sample efficiency of rigorous robotic system validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces X4Val, a framework for variance-reduced real-world metric estimation in robotics using non-paired multi-domain data. Samples from real and auxiliary domains (e.g., simulation, historical logs) are embedded into a shared representation space; a transferable neural predictor of the target real-world metric is learned from this space and then plugged into a control-variates estimator. The method is supported by theoretical analysis and evaluated on autonomous driving and real-world robot manipulation tasks, where it reports up to 38.4% variance reduction over strong baselines while remaining consistent across domains.

Significance. If the unbiasedness of the control-variates estimator is preserved under domain shift, the approach would meaningfully improve sample efficiency for rigorous policy evaluation in robotics, where real-world data collection is costly and auxiliary data sources are abundant but non-representative. The explicit use of learned surrogates inside control variates, together with the reported empirical gains, would constitute a practical advance over standard Monte-Carlo or paired-sample methods.

major comments (1)
  1. [Theoretical Analysis] The central claim that the estimator remains unbiased (and therefore that reported variance reductions are meaningful for high-confidence estimation) rests on the learned predictor satisfying E[P(real)] = E[M] (or a known offset) after transfer from auxiliary domains. The skeptic note correctly identifies this as the least secure assumption; the theoretical analysis section must therefore contain an explicit derivation or bound showing that any residual domain gap does not introduce a non-zero bias term in the control-variates estimator E[M - β(P - E[P])]. Without such a derivation or a sensitivity analysis, the variance-reduction numbers alone do not establish that the estimator is suitable for rigorous validation.
minor comments (2)
  1. [Introduction] The abstract and introduction would benefit from a short statement clarifying whether the control-variates coefficient β is estimated from the same data or held out, as this choice directly affects both bias and variance.
  2. [Experiments] Figure captions should explicitly state the number of independent runs and whether error bars represent standard error or standard deviation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for emphasizing the need to rigorously address potential bias under domain shift. We respond to the major comment below and will revise the manuscript to include the requested derivation and sensitivity analysis.

read point-by-point responses
  1. Referee: [Theoretical Analysis] The central claim that the estimator remains unbiased (and therefore that reported variance reductions are meaningful for high-confidence estimation) rests on the learned predictor satisfying E[P(real)] = E[M] (or a known offset) after transfer from auxiliary domains. The skeptic note correctly identifies this as the least secure assumption; the theoretical analysis section must therefore contain an explicit derivation or bound showing that any residual domain gap does not introduce a non-zero bias term in the control-variates estimator E[M - β(P - E[P])]. Without such a derivation or a sensitivity analysis, the variance-reduction numbers alone do not establish that the estimator is suitable for rigorous validation.

    Authors: We agree that the unbiasedness claim requires explicit handling of residual domain gap after transfer. Section 3 derives that the control-variates estimator E[M - β(P - E[P])] is unbiased whenever E[P] = E[M] holds in the target (real) domain; the analysis treats the learned predictor as satisfying this equality after embedding into the shared space. However, the current write-up does not provide a quantitative bound on the bias that would arise if transfer is imperfect. In the revision we will add (i) a derivation bounding the absolute bias |E[M - β(P - E[P])]| by the product of the control-variate coefficient and an integral probability metric (e.g., Wasserstein-1) between the embedded real and auxiliary distributions, and (ii) a sensitivity study in the experimental section that injects controlled domain discrepancies and reports both the resulting bias and the observed variance reduction. These additions will make the theoretical guarantees and empirical claims directly comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The provided abstract and description present X4Val as a new framework that learns a transferable neural predictor from multi-domain embeddings and plugs it into a standard control-variates estimator. No equations, self-citations, or fitted quantities are shown that reduce the claimed variance reduction to a tautology or to inputs by construction. The method relies on independent statistical theory (control variates) and standard supervised learning, with the domain-transfer assumption stated explicitly rather than smuggled in via prior self-work. This is the common honest case of a self-contained proposal against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no equations or implementation details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5799 in / 1132 out tokens · 15192 ms · 2026-06-28T05:43:30.113209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 6 linked inside Pith

  1. [1]

    Science382(6671), 669–674 (2023)

    Angelopoulos, A.N., Bates, S., Fannjiang, C., Jordan, M.I., Zrnic, T.: Prediction- powered inference. Science382(6671), 669–674 (2023)

  2. [2]

    arXiv preprint arXiv:2311.01453 (2023)

    Angelopoulos, A.N., Duchi, J.C., Zrnic, T.: Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453 (2023)

  3. [3]

    arXiv preprint arXiv:2510.04354 (2025)

    Badithela, A., Snyder, D., Zha, L., Mikhail, J., O’Kelly, M., Dixit, A., Majumdar, A.: Reliable and scalable robot policy evaluation with imperfect simulators. arXiv preprint arXiv:2510.04354 (2025)

  4. [4]

    arXiv preprint arXiv:2403.07008 (2024)

    Boyeau, P., Angelopoulos, A.N., Yosef, N., Malik, J., Jordan, M.I.: Autoeval done right: Using synthetic data for model evaluation. arXiv preprint arXiv:2403.07008 (2024)

  5. [5]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020)

  6. [6]

    Journal of Personalized Medicine13(2023)

    Chato, L., Regentova, E.E.: Survey of transfer learning approaches in the machine learning of digital health sensing data. Journal of Personalized Medicine13(2023)

  7. [7]

    The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

    Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

  8. [8]

    In: North American Chapter of the Association for Computational Linguistics (2019)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

  9. [9]

    Advances in Neural Information Process- ing Systems36, 7730–7742 (2023)

    Gulino, C., Fu, J., Luo, W., Tucker, G., Bronstein, E., Lu, Y., Harb, J., Pan, X., Wang, Y., Chen, X., et al.: Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. Advances in Neural Information Process- ing Systems36, 7730–7742 (2023)

  10. [10]

    arXiv preprint arXiv:2008.12037 (2020)

    Iakovleva, E., Verbeek, J.J., Karteek, A.: Meta-learning with shared amortized variational inference. arXiv preprint arXiv:2008.12037 (2020)

  11. [11]

    arXiv preprint arXiv:2201.05867 (2022)

    Jiang, J., Shu, Y., Wang, J., Long, M.: Transferability in deep learning: A survey. arXiv preprint arXiv:2201.05867 (2022)

  12. [12]

    In: International conference on machine learning

    Jiang, N., Li, L.: Doubly robust off-policy value evaluation for reinforcement learn- ing. In: International conference on machine learning. pp. 652–661. PMLR (2016)

  13. [13]

    In: Conference on Robot Learning (2023)

    Katdare, P., Jiang, N., Driggs-Campbell, K.: Marginalized importance sampling for off-environment policy evaluation. In: Conference on Robot Learning (2023)

  14. [14]

    Levine, S., Kumar, A., Tucker, G., Fu, J.: Offline reinforcement learning: Tutorial, review,andperspectivesonopenproblems.arXivpreprintarXiv:2005.01643(2020) X4Val 17

  15. [15]

    In: Proceedings of the Conference on Robot Learning (CoRL) (2025)

    Luo, R., Yang, H., Watson, M., Sharma, A., Veer, S., Schmerling, E., Pavone, M.: Sim2val: Leveraging correlation across test platforms for variance-reduced metric estimation. In: Proceedings of the Conference on Robot Learning (CoRL) (2025)

  16. [16]

    arXiv preprint arXiv:2507.20068 (2025)

    Mandyam, A., Meng, J., Gao, G., Sun, J., Schwager, M., Engelhardt, B.E., Brun- skill, E.: Perry: Policy evaluation with confidence intervals using auxiliary data. arXiv preprint arXiv:2507.20068 (2025)

  17. [17]

    arXiv preprint arXiv:2107.14483 (2021)

    Mu, T., Ling, Z., Xiang, F., Yang, D., Li, X., Tao, S., Huang, Z., Jia, Z., Su, H.: Maniskill: Generalizable manipulation skill benchmark with large-scale demonstra- tions. arXiv preprint arXiv:2107.14483 (2021)

  18. [18]

    ArXivabs/2402.04580(2024)

    Niu, H., Hu, J., Zhou, G., Zhan, X.: A comprehensive survey of cross-domain policy transfer for embodied agents. ArXivabs/2402.04580(2024)

  19. [19]

    NVIDIA, Cao, Y., de Lutio, R., Fidler, S., Cobo, G.G., Gojcic, Z., Igl, M., Ivanovic, B.,Karkus,P.,Esturo,J.M.,Pavone,M.,Smith,A.,Tanimura,E.,Tyszkiewicz,M., Watson, M., Wu, Q., Zhang, L.: Alpasim: A modular, lightweight, and data-driven research simulator for autonomous driving (October 2025),https://github.com/ NVlabs/alpasim

  20. [20]

    NVIDIA Corporation: PhysicalAI-Autonomous-Vehicles dataset (October 2025), https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles

  21. [21]

    arXiv preprint arXiv:2304.07193 (2023)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  22. [22]

    domains/mc/(2013)

    Owen, A.B.: Monte Carlo theory, methods and examples.https://artowen.su. domains/mc/(2013)

  23. [23]

    In: ICML

    Precup, D., Sutton, R.S., Singh, S.: Eligibility traces for off-policy policy evalua- tion. In: ICML. vol. 2000, pp. 759–766. Citeseer (2000)

  24. [24]

    In: International Confer- ence on Learning Representations (2018)

    Ravi, S., Beatson, A.: Amortized bayesian meta-learning. In: International Confer- ence on Learning Representations (2018)

  25. [25]

    In: International Conference on Artificial Neural Networks (2018)

    Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: International Conference on Artificial Neural Networks (2018)

  26. [26]

    Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? ArXivabs/1411.1792(2014)

  27. [27]

    Zaheer, M., Kottur, S., Ravanbakhsh, S., Póczos, B., Salakhutdinov, R., Smola, A.: Deep sets (2017)

  28. [28]

    arXiv preprint arXiv:2508.14285 (2025)

    Zhang, L., Snell, J., Griffiths, T.: Amortized bayesian meta-learning for low-rank adaptation of large language models. arXiv preprint arXiv:2508.14285 (2025)

  29. [29]

    arXiv preprint arXiv:2502.10563 (2025)

    Zhou, Z., Song, Y., Zanette, A.: Accelerating unbiased llm evaluation via synthetic feedback. arXiv preprint arXiv:2502.10563 (2025)

  30. [30]

    Proceedings of the National Academy of Sciences of the United States of America121(2024) 18 R

    Zrnic, T., Candès, E.J.: Cross-prediction-powered inference. Proceedings of the National Academy of Sciences of the United States of America121(2024) 18 R. Luo et al. A Cross-Fitted Estimator and Confidence Intervals This section gives the full cross-fitted version of the estimator described in Sec- tion 4. Cross-fitting allows all labeled target-domain s...