Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

He-Yang Xu; Pengyuan Zhang; Serge Belongie; Xiaoshuai Hao; Xin Geng; Xiu-Shen Wei; Yuxin Peng; Zongyuan Ge

arxiv: 2605.19986 · v1 · pith:WU3KZD3Bnew · submitted 2026-05-19 · 💻 cs.RO · cs.CV· cs.LG

Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation

He-Yang Xu , Pengyuan Zhang , Zongyuan Ge , Xiaoshuai Hao , Serge Belongie , Xin Geng , Yuxin Peng , Xiu-Shen Wei This is my paper

Pith reviewed 2026-05-20 04:48 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords fine-grained manipulationdiagnostic evaluationvision-language-action modelsrobotic manipulationspatial perceptionbenchmarking frameworkcausal interventionembodied AI

0 comments

The pith

The visual encoder's ability to preserve local spatial structure is the main bottleneck for fine-grained robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current robot manipulation benchmarks rely on binary success rates that mask specific failures in precise tasks and can overestimate capabilities by as much as 70 percent. The paper introduces a new diagnostic framework called MetaFine that rebuilds existing benchmarks into scenarios testing three separate capacities: understanding, perception, and controlled behavior. Targeted tests within this framework show that the visual encoder's handling of local spatial details limits performance most directly. Improving only that aspect of the encoder enables new manipulation skills without any changes to the downstream policy or controller.

Core claim

Through causal interventions on vision-language-action models, the work establishes that the visual encoder's capacity to preserve local spatial structure forms the central bottleneck for fine-grained manipulation precision. Enhancing this capacity alone unlocks previously inaccessible behaviors in tasks that require tight coupling of local attribute grounding and constraint-respecting motor execution.

What carries the argument

MetaFine, a diagnostic meta-evaluation framework built on a compositional task graph that absorbs existing benchmarks and reconstructs them into unified scenarios of graded complexity.

Load-bearing premise

The compositional task graph can absorb heterogeneous external benchmarks and turn them into diagnostic scenarios of varying complexity without introducing artifacts or biases.

What would settle it

Apply an intervention that specifically improves local spatial preservation in a visual encoder and check whether fine-grained success rates rise substantially in the MetaFine scenarios while other model components remain unchanged.

read the original abstract

Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution. However, current embodied AI benchmarks collapse these capacities into binary success rates, systematically inflating reported capabilities by up to 70% and masking the architectural bottlenecks that impede real-world deployment. We introduce MetaFine, a diagnostic meta-evaluation framework that disentangles manipulation competency along three axes: understanding, perception, and controlled behavior. Built on a compositional task graph, MetaFine absorbs heterogeneous external benchmarks and reconstructs them into diagnostic scenarios of varying complexity under a unified protocol. Evaluating state-of-the-art vision-language-action (VLA) models through this lens exposes severe dimension-specific failures invisible to conventional metrics. Through targeted causal intervention, we identify the visual encoder's ability to preserve local spatial structure as a key bottleneck for fine-grained precision: improving it directly unlocks previously inaccessible manipulation capabilities without modifying downstream policies. MetaFine further supports hybrid real-sim validation, using limited paired real-world rollouts to calibrate scalable simulation-based estimates for more stable physical benchmarking. By shifting evaluation from ranking to diagnosis, MetaFine turns benchmarking into an actionable compass for repairing the layered capacities underlying genuine physical dexterity. The MetaFine framework, benchmarks, and supporting resources will be publicly released at our project page: https://metafine.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaFine gives a practical diagnostic split for manipulation failures and flags the visual encoder's spatial handling as a bottleneck, but the reconstruction step needs checks to rule out artifacts.

read the letter

The main thing here is that this paper moves evaluation past binary success rates by building MetaFine, a framework that splits fine-grained manipulation into understanding, perception, and controlled behavior, then uses causal interventions to tie the visual encoder's local spatial preservation to performance limits that can be fixed without policy changes. The compositional task graph that pulls in outside benchmarks and rebuilds them into graded scenarios is the concrete new piece, along with the hybrid real-sim calibration for more stable physical estimates. Public release of the framework and resources is a straightforward plus for anyone who wants to try it. The approach does surface dimension-specific failures that standard metrics hide, which aligns with the field's need for more actionable diagnostics. The central claim holds up on its own terms if the interventions are clean, and the shift from ranking to diagnosis is a useful reframing. The softer part is the reconstruction process itself. Absorbing heterogeneous benchmarks into a single protocol risks altering spatial fidelity or complexity in ways that could confound the causal attribution to the encoder. If the task graph introduces systematic biases in local attribute grounding, the finding that encoder fixes unlock capabilities without downstream changes would need extra controls to stand. The abstract is light on methods details and stats, so the full paper's error bars and fidelity checks matter for how much weight the bottleneck result carries. This is for embodied AI researchers working on VLA models who want better ways to debug dexterity issues rather than just chase higher scores. A reader focused on evaluation methods or robot deployment would get direct value from the diagnostic lens and the released tools. It deserves peer review because the framework idea is timely and grounded enough in the manipulation problem to warrant referee input, even if the causal story would benefit from tighter validation on the reconstruction side.

Referee Report

2 major / 1 minor

Summary. The paper introduces MetaFine, a diagnostic meta-evaluation framework for fine-grained manipulation in vision-language-action (VLA) models. It argues that binary success rates in existing embodied AI benchmarks inflate reported capabilities by up to 70% and obscure architectural bottlenecks. MetaFine uses a compositional task graph to absorb heterogeneous external benchmarks and reconstruct them into diagnostic scenarios of varying complexity under a unified protocol, disentangling competencies along understanding, perception, and controlled behavior axes. Evaluations expose dimension-specific failures, and targeted causal interventions identify the visual encoder's preservation of local spatial structure as the key bottleneck whose improvement unlocks new manipulation capabilities without changes to downstream policies. The framework also supports hybrid real-sim validation using limited paired rollouts to calibrate simulation estimates.

Significance. If the reconstruction process and causal interventions hold without introducing artifacts, this work would be significant for the robotics and embodied AI community. Shifting evaluation from aggregate rankings to dimension-specific diagnosis could guide targeted repairs to components like visual encoders, improving real-world fine-grained dexterity and providing an actionable benchmarking compass beyond current binary metrics.

major comments (2)

Abstract: The central claim that targeted causal intervention isolates the visual encoder's local spatial structure preservation as the key bottleneck (whose improvement unlocks capabilities without policy changes) depends on the compositional task graph faithfully disentangling the axes. The description of how heterogeneous benchmarks are absorbed and reconstructed lacks explicit validation (e.g., comparisons of spatial fidelity distributions or attribute grounding before/after reconstruction) to rule out systematic biases that could confound the causal attribution.
Abstract: The reported up to 70% inflation of capabilities by conventional metrics underpins the motivation and the 'severe dimension-specific failures' finding; this requires concrete quantification, including the exact benchmarks compared, the definition of the inflation metric, and supporting statistics such as error bars or sample sizes, as these are load-bearing for the framework's diagnostic value.

minor comments (1)

Abstract: The manuscript would benefit from a brief outline of the unified protocol details (e.g., complexity calibration rules or data handling for heterogeneous inputs) to support reproducibility claims, even if full methods appear later.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of clarity and validation that we address point by point below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The central claim that targeted causal intervention isolates the visual encoder's local spatial structure preservation as the key bottleneck (whose improvement unlocks capabilities without policy changes) depends on the compositional task graph faithfully disentangling the axes. The description of how heterogeneous benchmarks are absorbed and reconstructed lacks explicit validation (e.g., comparisons of spatial fidelity distributions or attribute grounding before/after reconstruction) to rule out systematic biases that could confound the causal attribution.

Authors: We agree that explicit validation is necessary to support the causal attribution. In the revised manuscript we will add a dedicated validation subsection (with new figures) that reports quantitative comparisons of spatial fidelity distributions and attribute grounding accuracy before versus after reconstruction across the absorbed benchmarks. These results will confirm that the compositional task graph preserves the relevant properties and does not introduce systematic biases that could confound the identification of the visual encoder as the bottleneck. revision: yes
Referee: Abstract: The reported up to 70% inflation of capabilities by conventional metrics underpins the motivation and the 'severe dimension-specific failures' finding; this requires concrete quantification, including the exact benchmarks compared, the definition of the inflation metric, and supporting statistics such as error bars or sample sizes, as these are load-bearing for the framework's diagnostic value.

Authors: The 70% figure is obtained from the diagnostic versus binary evaluations reported in Section 4 and the associated tables. To address the request for concreteness, we will revise the abstract to briefly define the inflation metric (relative difference between binary success rate and the fine-grained diagnostic score) and will add explicit references to the exact benchmarks, sample sizes, and error bars already present in the main results. This makes the supporting evidence immediately accessible without altering the reported value. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces MetaFine as a diagnostic meta-evaluation framework that absorbs external benchmarks via a compositional task graph and performs causal interventions to identify bottlenecks. No equations, derivations, or fitted parameters appear that reduce by construction to the framework's own inputs or prior self-citations. The central claim regarding the visual encoder's local spatial structure preservation is presented as an empirical outcome of targeted interventions rather than a self-definitional or statistically forced result. The framework description remains self-contained against external benchmarks, consistent with the default expectation that most papers exhibit no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on a domain assumption about fine-grained manipulation requirements and introduces new evaluation structures with no numerical free parameters or independently evidenced entities.

axioms (1)

domain assumption Fine-grained manipulation marks a regime where global scene context no longer suffices, and success hinges on the tight coupling of local attribute grounding, high-fidelity spatial perception, and constraint-respecting motor execution.
Invoked in the abstract opening to define the target regime.

invented entities (1)

MetaFine diagnostic meta-evaluation framework no independent evidence
purpose: To disentangle manipulation competency along three axes and reconstruct benchmarks for diagnosis.
Newly proposed in this work.

pith-pipeline@v0.9.0 · 11481 in / 1234 out tokens · 86599 ms · 2026-05-20T04:48:54.642733+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

visual encoder’s ability to preserve local spatial structure as a key bottleneck

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 2 internal anchors

[1]

Nature Machine Intelligence7(4), 592–601 (2025)

Mon-Williams, R., Li, G., Long, R., Du, W., Lucas, C.G.: Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence7(4), 592–601 (2025)

work page 2025
[2]

In: International Conference on Learning Representations (ICLR) (2026)

Yu, H.-T., Peng, Y., Belongie, S., Wei, X.-S.: Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation. In: International Conference on Learning Representations (ICLR) (2026)

work page 2026
[3]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(12), 8927–8948 (2022)

Wei, X.-S., Song, Y.-Z., Mac Aodha, O., Wu, J., Peng, Y., Tang, J., Yang, J., Belongie, S.: Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(12), 8927–8948 (2022)

work page 2022
[4]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J., Qiu, X.: LIBERO-Plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Science 364(6446), 8414 (2019)

Billard, A., Kragic, D.: Trends and challenges in robot manipulation. Science 364(6446), 8414 (2019)

work page 2019
[6]

Nature Machine Intelligence8, 158– 172 (2026)

Li, X., Li, P., Qian, L., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Wang, X., Guo, D., Kong, T., Zhang, H., Liu, H.: What matters in building vision– language–action models for generalist robots. Nature Machine Intelligence8, 158– 172 (2026)

work page 2026
[7]

In: Proceedings of the 2020 Conference on Robot Learning (CoRL)

Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Arm- strong, T., Krasin, I., Duong, D., Sindhwani, V., Lee, J.: Transporter Networks: Rearranging the visual world for robotic manipulation. In: Proceedings of the 2020 Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 155, pp. 726–747 (2021)

work page 2020
[8]

In: Proceedings of The 8th Conference on Robot Learning (CoRL)

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open- source vision-language-action model. In: Proceedings of The 8th Conference on Robot Learning (CoRL). Proceedings of Machine...

work page 2025
[9]

In: Proceedings of Robotics: Science and Systems (RSS) (2025) 17

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π 0: A vision-language-action flow model for general robot cont...

work page 2025
[10]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3D diffusion pol- icy: Generalizable visuomotor policy learning via simple 3D representations. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

work page 2024
[11]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine- grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9834–9844 (2025)

work page 2025
[12]

In: International Conference on Learning Representations (ICLR) (2019)

Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: International Conference on Learning Representations (ICLR) (2019)

work page 2019
[13]

International Journal of Computer Vision133(10), 7368–7405 (2025)

Ma, G., Wang, Z., Yuan, Z., Wang, X., Yuan, B., Tao, D.: A comprehensive survey of data augmentation in visual reinforcement learning. International Journal of Computer Vision133(10), 7368–7405 (2025)

work page 2025
[14]

Journal of neuroscience5(7), 1688–1703 (1985)

Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. Journal of neuroscience5(7), 1688–1703 (1985)

work page 1985
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: RoboTwin: Dual-arm robot benchmark with generative digital twins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27649–27660 (2025)

work page 2025
[16]

IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

work page 2022
[17]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: LIBERO: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

work page 2023
[18]

In: Robotics: Science and Systems (RSS) (2023)

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual manipulation with low-cost hardware. In: Robotics: Science and Systems (RSS) (2023)

work page 2023
[19]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

In: Proceedings of Robotics: Science and Systems (RSS) (2025)

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

work page 2025
[21]

In: Proceedings of The 9th Conference on Robot Learning (CoRL)

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., 18 Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springen- berg, J.T., Stachowicz, K., Ta...

work page 2025
[22]

Science382(6671), 669–674 (2023)

Angelopoulos, A.N., Bates, S., Cand` es, E.J., Jordan, M.I., Zrnic, T.: Prediction- powered inference. Science382(6671), 669–674 (2023)

work page 2023
[23]

In: International Conference on Learning Representations (ICLR) (2023)

Gu, J., Xiang, F., Li, X., Ling, Z., Liu, X., Mu, T., Tang, Y., Tao, S., Wei, X., Yao, Y., Yuan, X., Xie, P., Huang, Z., Chen, R., Su, H.: ManiSkill2: A unified benchmark for generalizable manipulation skills. In: International Conference on Learning Representations (ICLR) (2023)

work page 2023
[24]

IEEE Transactions on Automation Science and Engineering22, 15200–15214 (2025)

Song, Y., Sun, P., Jin, P., Ren, Y., Zheng, Y., Li, Z., Chu, X., Zhang, Y., Li, T., Gu, J.: Learning 6-DoF fine-grained grasp detection based on part affor- dance grounding. IEEE Transactions on Automation Science and Engineering22, 15200–15214 (2025)

work page 2025
[25]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Huang, H., Lin, F., Hu, Y., Wang, S., Gao, Y.: CoPa: General robotic manip- ulation through spatial constraints of parts with foundation models. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9488–9495 (2024)

work page 2024
[26]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Singh, G., Kalwar, S., Karim, M.F., Sen, B., Govindan, N., Sridhar, S., Krishna, K.M.: Constrained 6-DoF grasp generation on complex shapes for improved dual- arm manipulation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7344–7350 (2024)

work page 2024
[27]

In: Robotics: Science and Systems (RSS) (2023)

Chi, C.,et al.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

work page 2023
[28]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 909–918 (2019)

work page 2019
[29]

In: Proceedings of Robotics: Science and Systems (RSS) (2025)

Yin, Y., Han, Z., Aarya, S., Xu, S., Wang, J., Peng, J., Wang, A., Yuille, A., Shu, T.: PartInstruct: Part-level instruction following for fine-grained robot manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

work page 2025
[30]

In: Proceedings of the 38th International Conference on Machine Learning (ICML)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 87...

work page 2021
[31]

Transactions on Machine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., J´ egou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visua...

work page 2024
[32]

In: Proceedings of The 7th Conference on Robot Learning (CoRL)

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W.E., Leal, I., Kuang, Y.,...

work page 2023
[33]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: RLBench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

work page 2020
[34]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp

Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3153–3160 (2024)

work page 2024
[35]

In: Proceed- ings of The 8th Conference on Robot Learning (CoRL)

Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: Proceed- ings of The 8th Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 270, pp. 3705–3...

work page 2025
[36]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Torne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., Agrawal, P.: Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

work page 2024
[37]

ACM Transactions on Graphics42(4) (2023)

Kerbl, B., Kopanas, G., Leimk¨ uhler, T., Drettakis, G.: 3D Gaussian Splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (2023)

work page 2023
[38]

Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025

Li, X., Li, J., Zhang, Z., Zhang, R., Jia, F., Wang, T., Fan, H., Tseng, K.- K., Wang, R.: RoboGSim: A real2sim2real robotic Gaussian Splatting simulator. arXiv preprint arXiv:2411.11839 (2024)

work page arXiv 2024
[39]

arXiv preprint arXiv:2502.08645 (2025) 20

Han, X., Liu, M., Chen, Y., Yu, J., Lyu, X., Tian, Y., Wang, B., Zhang, W., Pang, J.: Re3Sim: Generating high-fidelity simulation data via 3D-photorealistic real-to-sim for robotic manipulation. arXiv preprint arXiv:2502.08645 (2025) 20

work page arXiv 2025
[40]

Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body in- teractions.ArXiv, abs/2511.04665, 2025

Zhang, K., Sha, S., Jiang, H., Loper, M., Song, H., Cai, G., Xu, Z., Hu, X., Zheng, C., Li, Y.: Real-to-sim robot policy evaluation with Gaussian Splatting simulation of soft-body interactions. arXiv preprint arXiv:2511.04665 (2025)

work page arXiv 2025
[41]

arXiv preprint arXiv:2512.19562 (2025)

Sedlacek, M., Yefanov, P., Ponimatkin, G., Bardhan, J., Pilc, S., Fourmy, M., Kazakos, E., Snoek, C.G.M., Sivic, J., Petrik, V.: REALM: A real-to-sim val- idated benchmark for generalization in robotic manipulation. arXiv preprint arXiv:2512.19562 (2025)

work page arXiv 2025
[42]

slide the object along a surface edge until contact with the boundary

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In: Proceedings of Robotics: Science and Systems (RSS) (2024) 21 Appendix Overview This appendix provides additional details on MetaFine, extended experimental results, and architectural ...

work page arXiv 2024

[1] [1]

Nature Machine Intelligence7(4), 592–601 (2025)

Mon-Williams, R., Li, G., Long, R., Du, W., Lucas, C.G.: Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence7(4), 592–601 (2025)

work page 2025

[2] [2]

In: International Conference on Learning Representations (ICLR) (2026)

Yu, H.-T., Peng, Y., Belongie, S., Wei, X.-S.: Benchmarking large vision-language models on fine-grained image tasks: A comprehensive evaluation. In: International Conference on Learning Representations (ICLR) (2026)

work page 2026

[3] [3]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(12), 8927–8948 (2022)

Wei, X.-S., Song, Y.-Z., Mac Aodha, O., Wu, J., Peng, Y., Tang, J., Yang, J., Belongie, S.: Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence44(12), 8927–8948 (2022)

work page 2022

[4] [4]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J., Qiu, X.: LIBERO-Plus: In-depth robustness analysis of vision-language-action models. arXiv preprint arXiv:2510.13626 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Science 364(6446), 8414 (2019)

Billard, A., Kragic, D.: Trends and challenges in robot manipulation. Science 364(6446), 8414 (2019)

work page 2019

[6] [6]

Nature Machine Intelligence8, 158– 172 (2026)

Li, X., Li, P., Qian, L., Liu, M., Wang, D., Liu, J., Kang, B., Ma, X., Wang, X., Guo, D., Kong, T., Zhang, H., Liu, H.: What matters in building vision– language–action models for generalist robots. Nature Machine Intelligence8, 158– 172 (2026)

work page 2026

[7] [7]

In: Proceedings of the 2020 Conference on Robot Learning (CoRL)

Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Arm- strong, T., Krasin, I., Duong, D., Sindhwani, V., Lee, J.: Transporter Networks: Rearranging the visual world for robotic manipulation. In: Proceedings of the 2020 Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 155, pp. 726–747 (2021)

work page 2020

[8] [8]

In: Proceedings of The 8th Conference on Robot Learning (CoRL)

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: OpenVLA: An open- source vision-language-action model. In: Proceedings of The 8th Conference on Robot Learning (CoRL). Proceedings of Machine...

work page 2025

[9] [9]

In: Proceedings of Robotics: Science and Systems (RSS) (2025) 17

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Smith, L., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U.:π 0: A vision-language-action flow model for general robot cont...

work page 2025

[10] [10]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3D diffusion pol- icy: Generalizable visuomotor policy learning via simple 3D representations. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

work page 2024

[11] [11]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp

Zhu, F., Wu, H., Guo, S., Liu, Y., Cheang, C., Kong, T.: Irasim: A fine- grained world model for robot manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9834–9844 (2025)

work page 2025

[12] [12]

In: International Conference on Learning Representations (ICLR) (2019)

Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to com- mon corruptions and perturbations. In: International Conference on Learning Representations (ICLR) (2019)

work page 2019

[13] [13]

International Journal of Computer Vision133(10), 7368–7405 (2025)

Ma, G., Wang, Z., Yuan, Z., Wang, X., Yuan, B., Tao, D.: A comprehensive survey of data augmentation in visual reinforcement learning. International Journal of Computer Vision133(10), 7368–7405 (2025)

work page 2025

[14] [14]

Journal of neuroscience5(7), 1688–1703 (1985)

Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. Journal of neuroscience5(7), 1688–1703 (1985)

work page 1985

[15] [15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mu, Y., Chen, T., Chen, Z., Peng, S., Lan, Z., Gao, Z., Liang, Z., Yu, Q., Zou, Y., Xu, M., Lin, L., Xie, Z., Ding, M., Luo, P.: RoboTwin: Dual-arm robot benchmark with generative digital twins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27649–27660 (2025)

work page 2025

[16] [16]

IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

work page 2022

[17] [17]

Advances in Neural Information Processing Systems36, 44776–44791 (2023)

Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: LIBERO: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

work page 2023

[18] [18]

In: Robotics: Science and Systems (RSS) (2023)

Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual manipulation with low-cost hardware. In: Robotics: Science and Systems (RSS) (2023)

work page 2023

[19] [19]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

In: Proceedings of Robotics: Science and Systems (RSS) (2025)

Kim, M.J., Finn, C., Liang, P.: Fine-tuning vision-language-action models: Opti- mizing speed and success. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

work page 2025

[21] [21]

In: Proceedings of The 9th Conference on Robot Learning (CoRL)

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M.R., Finn, C., Fusai, N., Galliker, M.Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., 18 Mothukuri, M., Nair, S., Pertsch, K., Ren, A.Z., Shi, L.X., Smith, L., Springen- berg, J.T., Stachowicz, K., Ta...

work page 2025

[22] [22]

Science382(6671), 669–674 (2023)

Angelopoulos, A.N., Bates, S., Cand` es, E.J., Jordan, M.I., Zrnic, T.: Prediction- powered inference. Science382(6671), 669–674 (2023)

work page 2023

[23] [23]

In: International Conference on Learning Representations (ICLR) (2023)

Gu, J., Xiang, F., Li, X., Ling, Z., Liu, X., Mu, T., Tang, Y., Tao, S., Wei, X., Yao, Y., Yuan, X., Xie, P., Huang, Z., Chen, R., Su, H.: ManiSkill2: A unified benchmark for generalizable manipulation skills. In: International Conference on Learning Representations (ICLR) (2023)

work page 2023

[24] [24]

IEEE Transactions on Automation Science and Engineering22, 15200–15214 (2025)

Song, Y., Sun, P., Jin, P., Ren, Y., Zheng, Y., Li, Z., Chu, X., Zhang, Y., Li, T., Gu, J.: Learning 6-DoF fine-grained grasp detection based on part affor- dance grounding. IEEE Transactions on Automation Science and Engineering22, 15200–15214 (2025)

work page 2025

[25] [25]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Huang, H., Lin, F., Hu, Y., Wang, S., Gao, Y.: CoPa: General robotic manip- ulation through spatial constraints of parts with foundation models. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9488–9495 (2024)

work page 2024

[26] [26]

In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp

Singh, G., Kalwar, S., Karim, M.F., Sen, B., Govindan, N., Sridhar, S., Krishna, K.M.: Constrained 6-DoF grasp generation on complex shapes for improved dual- arm manipulation. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7344–7350 (2024)

work page 2024

[27] [27]

In: Robotics: Science and Systems (RSS) (2023)

Chi, C.,et al.: Diffusion policy: Visuomotor policy learning via action diffusion. In: Robotics: Science and Systems (RSS) (2023)

work page 2023

[28] [28]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 909–918 (2019)

work page 2019

[29] [29]

In: Proceedings of Robotics: Science and Systems (RSS) (2025)

Yin, Y., Han, Z., Aarya, S., Xu, S., Wang, J., Peng, J., Wang, A., Yuille, A., Shu, T.: PartInstruct: Part-level instruction following for fine-grained robot manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2025)

work page 2025

[30] [30]

In: Proceedings of the 38th International Conference on Machine Learning (ICML)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 139, pp. 87...

work page 2021

[31] [31]

Transactions on Machine Learning Research (2024)

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fer- nandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Syn- naeve, G., Xu, H., J´ egou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning robust visua...

work page 2024

[32] [32]

In: Proceedings of The 7th Conference on Robot Learning (CoRL)

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.-W.E., Leal, I., Kuang, Y.,...

work page 2023

[33] [33]

IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: RLBench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters5(2), 3019–3026 (2020)

work page 2020

[34] [34]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp

Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3153–3160 (2024)

work page 2024

[35] [35]

In: Proceed- ings of The 8th Conference on Robot Learning (CoRL)

Li, X., Hsu, K., Gu, J., Mees, O., Pertsch, K., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., Levine, S., Wu, J., Finn, C., Su, H., Vuong, Q., Xiao, T.: Evaluating real-world robot manipulation policies in simulation. In: Proceed- ings of The 8th Conference on Robot Learning (CoRL). Proceedings of Machine Learning Research, vol. 270, pp. 3705–3...

work page 2025

[36] [36]

In: Proceedings of Robotics: Science and Systems (RSS) (2024)

Torne, M., Simeonov, A., Li, Z., Chan, A., Chen, T., Gupta, A., Agrawal, P.: Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation. In: Proceedings of Robotics: Science and Systems (RSS) (2024)

work page 2024

[37] [37]

ACM Transactions on Graphics42(4) (2023)

Kerbl, B., Kopanas, G., Leimk¨ uhler, T., Drettakis, G.: 3D Gaussian Splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (2023)

work page 2023

[38] [38]

Robogsim: A real2sim2real robotic gaussian splatting simulator, 2025

Li, X., Li, J., Zhang, Z., Zhang, R., Jia, F., Wang, T., Fan, H., Tseng, K.- K., Wang, R.: RoboGSim: A real2sim2real robotic Gaussian Splatting simulator. arXiv preprint arXiv:2411.11839 (2024)

work page arXiv 2024

[39] [39]

arXiv preprint arXiv:2502.08645 (2025) 20

Han, X., Liu, M., Chen, Y., Yu, J., Lyu, X., Tian, Y., Wang, B., Zhang, W., Pang, J.: Re3Sim: Generating high-fidelity simulation data via 3D-photorealistic real-to-sim for robotic manipulation. arXiv preprint arXiv:2502.08645 (2025) 20

work page arXiv 2025

[40] [40]

Real-to-sim robot policy evaluation with gaussian splatting simulation of soft-body in- teractions.ArXiv, abs/2511.04665, 2025

Zhang, K., Sha, S., Jiang, H., Loper, M., Song, H., Cai, G., Xu, Z., Hu, X., Zheng, C., Li, Y.: Real-to-sim robot policy evaluation with Gaussian Splatting simulation of soft-body interactions. arXiv preprint arXiv:2511.04665 (2025)

work page arXiv 2025

[41] [41]

arXiv preprint arXiv:2512.19562 (2025)

Sedlacek, M., Yefanov, P., Ponimatkin, G., Bardhan, J., Pilc, S., Fourmy, M., Kazakos, E., Snoek, C.G.M., Sivic, J., Petrik, V.: REALM: A real-to-sim val- idated benchmark for generalization in robotic manipulation. arXiv preprint arXiv:2512.19562 (2025)

work page arXiv 2025

[42] [42]

slide the object along a surface edge until contact with the boundary

Nasiriany, S., Maddukuri, A., Zhang, L., Parikh, A., Lo, A., Joshi, A., Mandlekar, A., Zhu, Y.: RoboCasa: Large-scale simulation of everyday tasks for generalist robots. In: Proceedings of Robotics: Science and Systems (RSS) (2024) 21 Appendix Overview This appendix provides additional details on MetaFine, extended experimental results, and architectural ...

work page arXiv 2024