pith. sign in

arxiv: 2605.19986 · v1 · pith:WU3KZD3Bnew · submitted 2026-05-19 · 💻 cs.RO · cs.CV· cs.LG

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

Pith reviewed 2026-05-20 04:48 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG
keywords fine-grained manipulationdiagnostic evaluationvision-language-action modelsrobotic manipulationspatial perceptionbenchmarking frameworkcausal interventionembodied AI
0
0 comments X

The pith

The visual encoder's ability to preserve local spatial structure is the main bottleneck for fine-grained robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current robot manipulation benchmarks rely on binary success rates that mask specific failures in precise tasks and can overestimate capabilities by as much as 70 percent. The paper introduces a new diagnostic framework called MetaFine that rebuilds existing benchmarks into scenarios testing three separate capacities: understanding, perception, and controlled behavior. Targeted tests within this framework show that the visual encoder's handling of local spatial details limits performance most directly. Improving only that aspect of the encoder enables new manipulation skills without any changes to the downstream policy or controller.

Core claim

Through causal interventions on vision-language-action models, the work establishes that the visual encoder's capacity to preserve local spatial structure forms the central bottleneck for fine-grained manipulation precision. Enhancing this capacity alone unlocks previously inaccessible behaviors in tasks that require tight coupling of local attribute grounding and constraint-respecting motor execution.

What carries the argument

MetaFine, a diagnostic meta-evaluation framework built on a compositional task graph that absorbs existing benchmarks and reconstructs them into unified scenarios of graded complexity.

Load-bearing premise

The compositional task graph can absorb heterogeneous external benchmarks and turn them into diagnostic scenarios of varying complexity without introducing artifacts or biases.

What would settle it

Apply an intervention that specifically improves local spatial preservation in a visual encoder and check whether fine-grained success rates rise substantially in the MetaFine scenarios while other model components remain unchanged.

read the original abstract

Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MetaFine, a diagnostic meta-evaluation framework for fine-grained manipulation in vision-language-action (VLA) models. It argues that binary success rates in existing embodied AI benchmarks inflate reported capabilities by up to 70% and obscure architectural bottlenecks. MetaFine uses a compositional task graph to absorb heterogeneous external benchmarks and reconstruct them into diagnostic scenarios of varying complexity under a unified protocol, disentangling competencies along understanding, perception, and controlled behavior axes. Evaluations expose dimension-specific failures, and targeted causal interventions identify the visual encoder's preservation of local spatial structure as the key bottleneck whose improvement unlocks new manipulation capabilities without changes to downstream policies. The framework also supports hybrid real-sim validation using limited paired rollouts to calibrate simulation estimates.

Significance. If the reconstruction process and causal interventions hold without introducing artifacts, this work would be significant for the robotics and embodied AI community. Shifting evaluation from aggregate rankings to dimension-specific diagnosis could guide targeted repairs to components like visual encoders, improving real-world fine-grained dexterity and providing an actionable benchmarking compass beyond current binary metrics.

major comments (2)
  1. Abstract: The central claim that targeted causal intervention isolates the visual encoder's local spatial structure preservation as the key bottleneck (whose improvement unlocks capabilities without policy changes) depends on the compositional task graph faithfully disentangling the axes. The description of how heterogeneous benchmarks are absorbed and reconstructed lacks explicit validation (e.g., comparisons of spatial fidelity distributions or attribute grounding before/after reconstruction) to rule out systematic biases that could confound the causal attribution.
  2. Abstract: The reported up to 70% inflation of capabilities by conventional metrics underpins the motivation and the 'severe dimension-specific failures' finding; this requires concrete quantification, including the exact benchmarks compared, the definition of the inflation metric, and supporting statistics such as error bars or sample sizes, as these are load-bearing for the framework's diagnostic value.
minor comments (1)
  1. Abstract: The manuscript would benefit from a brief outline of the unified protocol details (e.g., complexity calibration rules or data handling for heterogeneous inputs) to support reproducibility claims, even if full methods appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity and validation that we address point by point below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: The central claim that targeted causal intervention isolates the visual encoder's local spatial structure preservation as the key bottleneck (whose improvement unlocks capabilities without policy changes) depends on the compositional task graph faithfully disentangling the axes. The description of how heterogeneous benchmarks are absorbed and reconstructed lacks explicit validation (e.g., comparisons of spatial fidelity distributions or attribute grounding before/after reconstruction) to rule out systematic biases that could confound the causal attribution.

    Authors: We agree that explicit validation is necessary to support the causal attribution. In the revised manuscript we will add a dedicated validation subsection (with new figures) that reports quantitative comparisons of spatial fidelity distributions and attribute grounding accuracy before versus after reconstruction across the absorbed benchmarks. These results will confirm that the compositional task graph preserves the relevant properties and does not introduce systematic biases that could confound the identification of the visual encoder as the bottleneck. revision: yes

  2. Referee: Abstract: The reported up to 70% inflation of capabilities by conventional metrics underpins the motivation and the 'severe dimension-specific failures' finding; this requires concrete quantification, including the exact benchmarks compared, the definition of the inflation metric, and supporting statistics such as error bars or sample sizes, as these are load-bearing for the framework's diagnostic value.

    Authors: The 70% figure is obtained from the diagnostic versus binary evaluations reported in Section 4 and the associated tables. To address the request for concreteness, we will revise the abstract to briefly define the inflation metric (relative difference between binary success rate and the fine-grained diagnostic score) and will add explicit references to the exact benchmarks, sample sizes, and error bars already present in the main results. This makes the supporting evidence immediately accessible without altering the reported value. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces MetaFine as a diagnostic meta-evaluation framework that absorbs external benchmarks via a compositional task graph and performs causal interventions to identify bottlenecks. No equations, derivations, or fitted parameters appear that reduce by construction to the framework's own inputs or prior self-citations. The central claim regarding the visual encoder's local spatial structure preservation is presented as an empirical outcome of targeted interventions rather than a self-definitional or statistically forced result. The framework description remains self-contained against external benchmarks, consistent with the default expectation that most papers exhibit no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption about fine-grained manipulation requirements and introduces new evaluation structures with no numerical free parameters or independently evidenced entities.

axioms (1)
  • domain assumption Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution.
    Invoked in the abstract opening to define the target regime.
invented entities (1)
  • MetaFine diagnostic meta-evaluation framework no independent evidence
    purpose: To disentangle manipulation competency along three axes and reconstruct benchmarks for diagnosis.
    Newly proposed in this work.

pith-pipeline@v0.9.0 · 11481 in / 1234 out tokens · 86599 ms · 2026-05-20T04:48:54.642733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

  1. [1]

    Nature Machine Intelligence7(4), 592–601 (2025)

    Mon-Williams, R., Li, G., Long, R., Du, W., Lucas, C.G.: Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence7(4), 592–601 (2025)

  2. [2]

    In: International Conference on Learning Representations (ICLR) (2026)

    Yu, H.-T., Peng, Y., Belongie, S., Wei, X.-S.: Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation. In: International Conference on Learning Representations (ICLR) (2026)

  3. [3]

    IEEE Transactions on Pattern Analysis and Machine Intelligence44(12), 8927–8948 (2022)

    Wei, X.-S., Song, Y.-Z., Mac Aodha, O., Wu, J., Peng, Y., Tang, J., Yang, J., Belongie, S.: Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(12), 8927–8948 (2022)

  4. [4]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J., Qiu, X.: LIBERO-Plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

  5. [5]

    Science 364(6446), 8414 (2019)

    Billard, A., Kragic, D.: Trends and challenges in robot manipulation. Science 364(6446), 8414 (2019)

  6. [6]

    Nature Machine Intelligence8, 158– 172 (2026)

    Li, X., Li, P., Qian, L., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Wang, X., Guo, D., Kong, T., Zhang, H., Liu, H.: What matters in building vision– language–action models for generalist robots. Nature Machine Intelligence8, 158– 172 (2026)

  7. [7]

    In: Proceedings of the 2020 Conference on Robot Learning (CoRL)

    Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Arm- strong, T., Krasin, I., Duong, D., Sindhwani, V., Lee, J.: Transporter Networks: Rearranging the visual world for robotic manipulation. In: Proceedings of the 2020 Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 155, pp. 726–747 (2021)

  8. [8]

    In: Proceedings of The 8th Conference on Robot Learning (CoRL)

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open- source vision-language-action model. In: Proceedings of The 8th Conference on Robot Learning (CoRL). Proceedings of Machine...

  9. [9]

    In: Proceedings of Robotics: Science and Systems (RSS) (2025) 17

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π 0: A vision-language-action flow model for general robot cont...

  10. [10]

    In: Proceedings of Robotics: Science and Systems (RSS) (2024)

    Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3D diffusion pol- icy: Generalizable visuomotor policy learning via simple 3D representations. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

    Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine- grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9834–9844 (2025)

  12. [12]

    In: International Conference on Learning Representations (ICLR) (2019)

    Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: International Conference on Learning Representations (ICLR) (2019)

  13. [13]

    International Journal of Computer Vision133(10), 7368–7405 (2025)

    Ma, G., Wang, Z., Yuan, Z., Wang, X., Yuan, B., Tao, D.: A comprehensive survey of data augmentation in visual reinforcement learning. International Journal of Computer Vision133(10), 7368–7405 (2025)

  14. [14]

    Journal of neuroscience5(7), 1688–1703 (1985)

    Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. Journal of neuroscience5(7), 1688–1703 (1985)

  15. [15]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: RoboTwin: Dual-arm robot benchmark with generative digital twins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27649–27660 (2025)

  16. [16]

    IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

  17. [17]

    Advances in Neural Information Processing Systems36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: LIBERO: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

  18. [18]

    In: Robotics: Science and Systems (RSS) (2023)

    Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual manipulation with low-cost hardware. In: Robotics: Science and Systems (RSS) (2023)

  19. [19]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

  20. [20]

    In: Proceedings of Robotics: Science and Systems (RSS) (2025)

    Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

  21. [21]

    In: Proceedings of The 9th Conference on Robot Learning (CoRL)

    Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., 18 Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springen- berg, J.T., Stachowicz, K., Ta...

  22. [22]

    Science382(6671), 669–674 (2023)

    Angelopoulos, A.N., Bates, S., Cand` es, E.J., Jordan, M.I., Zrnic, T.: Prediction- powered inference. Science382(6671), 669–674 (2023)

  23. [23]

    In: International Conference on Learning Representations (ICLR) (2023)

    Gu, J., Xiang, F., Li, X., Ling, Z., Liu, X., Mu, T., Tang, Y., Tao, S., Wei, X., Yao, Y., Yuan, X., Xie, P., Huang, Z., Chen, R., Su, H.: ManiSkill2: A unified benchmark for generalizable manipulation skills. In: International Conference on Learning Representations (ICLR) (2023)

  24. [24]

    IEEE Transactions on Automation Science and Engineering22, 15200–15214 (2025)

    Song, Y., Sun, P., Jin, P., Ren, Y., Zheng, Y., Li, Z., Chu, X., Zhang, Y., Li, T., Gu, J.: Learning 6-DoF fine-grained grasp detection based on part affor- dance grounding. IEEE Transactions on Automation Science and Engineering22, 15200–15214 (2025)

  25. [25]

    In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    Huang, H., Lin, F., Hu, Y., Wang, S., Gao, Y.: CoPa: General robotic manip- ulation through spatial constraints of parts with foundation models. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9488–9495 (2024)

  26. [26]

    In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

    Singh, G., Kalwar, S., Karim, M.F., Sen, B., Govindan, N., Sridhar, S., Krishna, K.M.: Constrained 6-DoF grasp generation on complex shapes for improved dual- arm manipulation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7344–7350 (2024)

  27. [27]

    In: Robotics: Science and Systems (RSS) (2023)

    Chi, C.,et al.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

  28. [28]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 909–918 (2019)

  29. [29]

    In: Proceedings of Robotics: Science and Systems (RSS) (2025)

    Yin, Y., Han, Z., Aarya, S., Xu, S., Wang, J., Peng, J., Wang, A., Yuille, A., Shu, T.: PartInstruct: Part-level instruction following for fine-grained robot manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

  30. [30]

    In: Proceedings of the 38th International Conference on Machine Learning (ICML)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 87...

  31. [31]

    Transactions on Machine Learning Research (2024)

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., J´ egou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visua...

  32. [32]

    In: Proceedings of The 7th Conference on Robot Learning (CoRL)

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W.E., Leal, I., Kuang, Y.,...

  33. [33]

    IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

    James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: RLBench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

  34. [34]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp

    Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3153–3160 (2024)

  35. [35]

    In: Proceed- ings of The 8th Conference on Robot Learning (CoRL)

    Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: Proceed- ings of The 8th Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 270, pp. 3705–3...

  36. [36]

    In: Proceedings of Robotics: Science and Systems (RSS) (2024)

    Torne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., Agrawal, P.: Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

  37. [37]

    ACM Transactions on Graphics42(4) (2023)

    Kerbl, B., Kopanas, G., Leimk¨ uhler, T., Drettakis, G.: 3D Gaussian Splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (2023)

  38. [38]

    Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025

    Li, X., Li, J., Zhang, Z., Zhang, R., Jia, F., Wang, T., Fan, H., Tseng, K.- K., Wang, R.: RoboGSim: A real2sim2real robotic Gaussian Splatting simulator. arXiv preprint arXiv:2411.11839 (2024)

  39. [39]

    arXiv preprint arXiv:2502.08645 (2025) 20

    Han, X., Liu, M., Chen, Y., Yu, J., Lyu, X., Tian, Y., Wang, B., Zhang, W., Pang, J.: Re3Sim: Generating high-fidelity simulation data via 3D-photorealistic real-to-sim for robotic manipulation. arXiv preprint arXiv:2502.08645 (2025) 20

  40. [40]

    Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body in- teractions.ArXiv, abs/2511.04665, 2025

    Zhang, K., Sha, S., Jiang, H., Loper, M., Song, H., Cai, G., Xu, Z., Hu, X., Zheng, C., Li, Y.: Real-to-sim robot policy evaluation with Gaussian Splatting simulation of soft-body interactions. arXiv preprint arXiv:2511.04665 (2025)

  41. [41]

    arXiv preprint arXiv:2512.19562 (2025)

    Sedlacek, M., Yefanov, P., Ponimatkin, G., Bardhan, J., Pilc, S., Fourmy, M., Kazakos, E., Snoek, C.G.M., Sivic, J., Petrik, V.: REALM: A real-to-sim val- idated benchmark for generalization in robotic manipulation. arXiv preprint arXiv:2512.19562 (2025)

  42. [42]

    slide the object along a surface edge until contact with the boundary

    Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In: Proceedings of Robotics: Science and Systems (RSS) (2024) 21 Appendix Overview This appendix provides additional details on MetaFine, extended experimental results, and architectural ...