pith. sign in

arxiv: 2606.29898 · v1 · pith:PEZTGLNYnew · submitted 2026-06-29 · 💻 cs.RO · cs.AI

Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies

Pith reviewed 2026-06-30 05:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robot manipulationpolicy validationoffline evaluationmean squared errorcritical intervalsaction alignmentSpearman correlation
0
0 comments X

The pith

Restricting MSE to task-critical intervals and aligning actions yields stronger correlation with robot policy rollout performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Critical Interval MSE as an offline metric to predict how well robot manipulation policies will perform in the real world. Standard validation loss on demonstrations often fails to correlate well with actual deployment results, making policy iteration slow and expensive. By focusing error calculation only on the segments that matter most for task success and adjusting actions to better mimic rollout conditions, CI-MSE achieves a Spearman's correlation of -0.87 with performance, compared to -0.61 for raw MSE. This closer alignment to the ideal value of -1 means developers can more reliably rank and select policies without constant real-world testing. The method is shown to be robust across simulation and real experiments with various checkpoints.

Core claim

Critical Interval MSE restricts the computation of mean squared error to task-critical segments of demonstrations and incorporates action-alignment procedures to better approximate rollout-time dynamics, resulting in validation errors that correlate more strongly with real-world policy performance than standard MSE across a range of checkpoints.

What carries the argument

Critical Interval MSE (CI-MSE), which limits error computation to identified task-critical segments and applies action alignment to match rollout behavior.

If this is right

  • Offline validation becomes a more reliable proxy, reducing the need for frequent real-world evaluations.
  • Policies can be compared and iterated more efficiently using only expert demonstrations.
  • The metric remains effective under some distribution shifts but has design boundaries.
  • Sensitivity analysis shows robustness to hyperparameter choices in segment identification and alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar interval-based restrictions could improve validation metrics in other sequential decision domains like autonomous driving.
  • If critical segments prove hard to identify automatically, the method's practicality may depend on domain-specific knowledge.
  • Extending the alignment procedure to handle longer horizons might further improve correlation in complex tasks.

Load-bearing premise

Task-critical segments can be identified reliably without introducing bias, and the action-alignment step accurately reflects actual rollout dynamics.

What would settle it

Running the same experiments on additional policy checkpoints or different robot tasks where the correlation of CI-MSE falls to levels comparable with raw MSE would falsify the claimed improvement.

Figures

Figures reproduced from arXiv: 2606.29898 by Haoxu Huang, Jiacheng You, Tongsam Zheng, Yang Gao, Yifan Chen.

Figure 1
Figure 1. Figure 1: Illustration of critical intervals in a robot manipulation trajectory. This illustration shows [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline for critical interval MSE computation. Critical intervals filter out uninformative [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulation evaluation-validation correlation. Left: evaluation success rate versus valida [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis of CI-MSE to its hyperparameters and interval annotation. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confidence intervals for Elo scores and partial success scores. The confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Episode-level composition of the LBM-Eval dataset used in the simulation experiments. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Composition of the LBM-Eval training and validation sets used in the simulation experi [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Environment and table textures used in visual distribution shift evaluation. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Validation error and rollout performance under different inference-time methods on the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
read the original abstract

Real-world evaluation is the gold standard for robot policies because it tests them against the physical conditions and deployment challenges they are ultimately designed to handle. However, real-world evaluation is also the bottleneck for iterating on robot policies: it is costly, difficult to reproduce, and often too sparse to reliably compare nearby model variants. A straightforward proxy for performance is validation loss on expert demonstrations, but this proxy is often poorly correlated with real-world performance. In this paper, we introduce Critical Interval MSE (CI-MSE), an intuitively simple yet effective offline validation metric. CI-MSE restricts error computation to task-critical segments and pairs it with simple action-alignment procedures that better match rollout-time behavior. Across simulation and real-world experiments, CI-MSE yields a stronger correlation between validation error and rollout performance than raw MSE. Across a wide range of policy checkpoints, CI-MSE achieves a Spearman's rank correlation of $-0.87$, much closer to the ideal value of $-1$ than raw MSE's $-0.61$, demonstrating a significant improvement. We show through sensitivity analysis that our metric is robust to a wide range of hyperparameters. We further study the effectiveness of CI-MSE under evaluation distribution shifts and suggest design boundaries when using this metric. In summary, this paper provides a simple and reliable offline validation tool for accelerating policy iteration. Project webpage: https://ci-mse.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces Critical Interval MSE (CI-MSE) as an offline validation metric for robot manipulation policies. It restricts MSE computation to task-critical segments of expert demonstrations and augments this with action-alignment procedures, claiming stronger correlation with real-world rollout performance than raw MSE (Spearman's rank correlation of -0.87 versus -0.61). The work reports results across simulation and real-world experiments on multiple policy checkpoints, plus sensitivity analysis and tests under distribution shifts.

Significance. If the critical-interval identification procedure proves reproducible from offline data alone without leakage from rollout outcomes, CI-MSE could meaningfully accelerate policy iteration in robotics by supplying a more reliable proxy than standard validation loss. The reported sensitivity analysis and distribution-shift experiments are positive features that strengthen the practical claim.

major comments (2)
  1. [Abstract / Methods description] The procedure for locating task-critical segments is described only at a high level in the abstract and is not accompanied by an explicit, reproducible algorithm (e.g., pseudocode, decision rules, or dependence on task-specific knowledge or human annotation). Because this step is load-bearing for the claimed correlation improvement, its absence prevents verification that the -0.87 result generalizes without bias to new policies or tasks.
  2. [Experimental protocol / Results] Implementation details of the action-alignment procedure and the precise experimental protocol used to compute CI-MSE (including how alignment is performed from offline data only) are not provided. This omission directly affects assessment of whether the alignment step introduces information unavailable at true offline validation time, undermining evaluation of the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on reproducibility. We address each major comment below and will revise the manuscript to supply the requested details.

read point-by-point responses
  1. Referee: [Abstract / Methods description] The procedure for locating task-critical segments is described only at a high level in the abstract and is not accompanied by an explicit, reproducible algorithm (e.g., pseudocode, decision rules, or dependence on task-specific knowledge or human annotation). Because this step is load-bearing for the claimed correlation improvement, its absence prevents verification that the -0.87 result generalizes without bias to new policies or tasks.

    Authors: We agree the abstract is high-level. The revised manuscript will add an explicit section with pseudocode, decision rules, and a description of the offline-only procedure for identifying critical intervals from expert demonstrations, ensuring full reproducibility without rollout leakage. revision: yes

  2. Referee: [Experimental protocol / Results] Implementation details of the action-alignment procedure and the precise experimental protocol used to compute CI-MSE (including how alignment is performed from offline data only) are not provided. This omission directly affects assessment of whether the alignment step introduces information unavailable at true offline validation time, undermining evaluation of the central claim.

    Authors: We agree that the current manuscript lacks sufficient implementation detail. The revision will include the complete action-alignment algorithm, the exact offline-only protocol for computing CI-MSE, and confirmation that no rollout information is used at validation time. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metric definition independent of reported correlation

full rationale

The paper introduces CI-MSE by restricting MSE to task-critical segments identified from task structure plus action-alignment rules, then reports an empirical Spearman's correlation of -0.87 versus raw MSE's -0.61 on held-out policy checkpoints. No equation or procedure in the provided text defines the critical intervals or alignment using the correlation value itself, nor does any self-citation chain supply a uniqueness result that forces the metric. The correlation is presented as an outcome of applying the independently motivated metric, not as an input used to tune or rename it. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits visibility into exact definitions; critical intervals and alignment likely rest on domain-specific choices whose independence from the final correlation numbers cannot be verified here.

axioms (1)
  • domain assumption Expert demonstration data is available and representative enough for offline validation
    The metric is computed on validation loss from expert demonstrations

pith-pipeline@v0.9.1-grok · 5787 in / 1237 out tokens · 29831 ms · 2026-06-30T05:56:52.874557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025

  2. [2]

    X. Li, K. Hsu, J. Gu, O. Mees, K. Pertsch, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kirmani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipu- lation policies in simulation. InProceedings of the 8th Conference on Robot Learning, pages 3705–3728, 2025

  3. [3]

    Atreya, K

    P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, J. Tremblay, K. Arora, K. Ellis, L. Macesanu, M. Leonard, M. Cho, O. Aslan, S. Dass, J. Wang, X. Yuan, X. Yang, A. Gupta, D. Jayaraman, G. Berseth, K. Daniilidis, R. Martin-Martin, Y . Lee, P. Liang, C. Finn, and S. Levine. Roboarena: Distributed real- w...

  4. [4]

    URLhttps://proceedings.mlr.press/v305/atreya25a.html

  5. [5]

    Y . Li, Y . Zhu, J. Wen, C. Shen, and Y . Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025. URLhttps://arxiv.org/abs/2505. 19017

  6. [6]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. InRobotics: Science and Systems, 2024. doi:10.15607/RSS.2024.XX.120. URL https://arxiv.org/abs/2403.12945

  7. [7]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models.arXiv preprint arXiv:2310.08864, 2023. URLhttps: //arxiv.org/abs/2310.08864

  8. [8]

    Hussenot, M

    L. Hussenot, M. Andrychowicz, D. Vincent, R. Dadashi, A. Raichuk, S. Ramos, N. Mom- chev, S. Girgin, R. Marinier, L. Stafiniak, M. Orsini, O. Bachem, M. Geist, and O. Pietquin. Hyperparameter selection for imitation learning. In M. Meila and T. Zhang, editors,Pro- ceedings of the 38th International Conference on Machine Learning, volume 139 ofPro- ceeding...

  9. [9]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In A. Faust, D. Hsu, and G. Neumann, editors,Proceedings of the 9 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Re- sea...

  10. [10]

    F. Lin, Y . Hu, P. Sheng, C. Wen, J. You, and Y . Gao. Data scaling laws in imitation learning for robotic manipulation. InInternational Conference on Learning Representations, volume 2025, pages 54877–54910, 2025

  11. [11]

    Florence, C

    P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. In A. Faust, D. Hsu, and G. Neumann, editors,Proceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 158–168. PMLR, 2022. URLhttps: //proceedings.mlr.press/v...

  12. [12]

    J. Pari, N. M. Shafiullah, S. P. Arunachalam, and L. Pinto. The surprising effectiveness of representation learning for visual imitation.arXiv preprint arXiv:2112.01511, 2022. URL https://arxiv.org/abs/2112.01511

  13. [13]

    Tune to Learn: How Controller Gains Shape Robot Policy Learning

    A. Bronars, Y . Park, and P. Agrawal. Tune to learn: How controller gains shape robot policy learning.arXiv preprint arXiv:2604.02523, 2026. URLhttps://arxiv.org/abs/2604. 02523

  14. [14]

    Tiezzi, T

    M. Tiezzi, T. Apicella, C. Cardenas-Perez, G. Fregonese, S. Dafarra, P. Morerio, D. Pucci, and A. Del Bue. Learning to evaluate autonomous behaviour in human-robot interaction.arXiv preprint arXiv:2507.06404, 2025. URLhttps://arxiv.org/abs/2507.06404

  15. [15]

    C. Wen, J. Lin, J. Qian, Y . Gao, and D. Jayaraman. Keyframe-focused visual imitation learn- ing. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 11123– 11133. PMLR, 18–24 Jul 2021. URLhttps://proceedings.mlr.press/v139/wen21d. html

  16. [16]

    E. Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstra- tion. In2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4613–4619, 2021. doi:10.1109/ICRA48506.2021.9560942. URLhttps://arxiv.org/ abs/2105.06411

  17. [17]

    Tsuji, Y

    T. Tsuji, Y . Kato, G. Solak, H. Zhang, T. Petri ˇc, F. Nori, and A. Ajoudani. A survey on imitation learning for contact-rich tasks in robotics.The International Journal of Robotics Research, 2026. doi:10.1177/02783649261417694. URLhttps://doi.org/10.1177/ 02783649261417694

  18. [18]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware. InRobotics: Science and Systems, 2023. doi:10.15607/RSS.2023. XIX.016. URLhttps://arxiv.org/abs/2304.13705

  19. [19]

    Black, M

    K. Black, M. Galliker, and S. Levine. Real-time execution of action chunking flow policies. Advances in Neural Information Processing Systems, 38:33383–33407, 2026

  20. [21]

    URLhttps://arxiv.org/abs/2402.08191

  21. [22]

    G. Zhou, V . Dean, M. K. Srirama, A. Rajeswaran, J. Pari, K. Hatch, A. Jain, T. Yu, P. Abbeel, L. Pinto, C. Finn, and A. Gupta. Train offline, test online: A real robot learning benchmark. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9197–9203,

  22. [23]

    doi:10.1109/ICRA48891.2023.10160594. 10

  23. [24]

    Y . Chen, K. Kimble, E. H. Adelson, T. Asfour, P. Chanrungmaneekul, S. Chitta, Y . Chitambar, Z. Chen, K. Goldberg, D. Kragic, H. Li, X. Li, Y . Li, A. Prather, N. Pollard, M. A. Roa-Garzon, R. Seney, S. Sha, S. Wang, Y . Xiang, K. Zhang, Y . Zhu, and K. Hang. Manipulationnet: An infrastructure for benchmarking real-world robot manipulation with physical ...

  24. [25]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020. doi: 10.1109/LRA.2020.2974707

  25. [26]

    J. Gu, F. Xiang, X. Li, Z. Ling, X. Liu, T. Mu, Y . Tang, S. Tao, X. Wei, Y . Yao, X. Yuan, P. Xie, Z. Huang, R. Chen, and H. Su. Maniskill2: A unified benchmark for generalizable manipulation skills. InInternational Conference on Learning Representations, 2023

  26. [27]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning. InAdvances in Neural Information Processing Sys- tems, 2023

  27. [28]

    Jiang and L

    N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In M. F. Balcan and K. Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 652– 661, New York, New York, USA, 20–22 Jun 2016. PMLR. URLhttps://proceedings. mlr.press/...

  28. [29]

    J. Fu, M. Norouzi, O. Nachum, G. Tucker, Z. Wang, A. Novikov, M. Yang, M. R. Zhang, Y . Chen, A. Kumar, C. Paduraru, S. Levine, and T. Le Paine. Benchmarks for deep off-policy evaluation. InInternational Conference on Learning Representations, 2021

  29. [30]

    L. Da, P. Jenkins, T. Schwantes, J. Dotson, and H. Wei. Probabilistic offline policy ranking with approximate bayesian computation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20370–20378, 2024

  30. [31]

    P. Gu, M. Zhao, X. He, Y . Cai, and B. An. Porank: A practical framework for learning to rank policies. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pages 4044–4052, 2024. doi:10.24963/ijcai.2024/447

  31. [32]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. Datasets and Benchmarks Track

  32. [33]

    S. Tan, S. Zhuang, K. Montgomery, W. Tang, A. Cuadron, C. Wang, R. Popa, and I. Stoica. Judgebench: A benchmark for evaluating llm-based judges. InInternational Conference on Learning Representations, 2025

  33. [34]

    Barreiros, A

    J. Barreiros, A. Beaulieu, A. Bhat, R. Cory, E. Cousineau, H. Dai, C.-H. Fang, K. Hashimoto, M. Z. Irshad, M. Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.Science Robotics, 11(113):eaea6201, 2026

  34. [35]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, et al.π0.5: a vision-language-action model with open-world general- ization. In9th Annual Conference on Robot Learning, 2025

  35. [36]

    X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X- vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

  36. [37]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Castañeda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 11 A Appendix A.1 Few-Shot Prompt for Critical Interval Annotation We use the following prompt template to ask the vision-language model to annota...