VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

Alex S. Huang; Jiahui Zhang; Shiqing Tang; Yu Xiang

arxiv: 2605.20774 · v1 · pith:BIK2UAXAnew · submitted 2026-05-20 · 💻 cs.RO

VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models

Alex S. Huang , Jiahui Zhang , Shiqing Tang , Yu Xiang This is my paper

Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-Actionrobotic manipulation benchmarkreal-world evaluationreproducibilityimitation learningdomain adaptationoff-the-shelf components

0 comments

The pith

VLA-REPLICA offers a low-cost benchmark using standard parts for consistent real-world testing of vision-language-action models worldwide.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models promise general robotic manipulation but lack accessible real-world tests because simulations miss real complexity and existing real benchmarks demand expensive hardware or central facilities. The paper establishes VLA-REPLICA as an alternative built from off-the-shelf components that labs can assemble quickly and replicate to produce matching environments. It supplies a range of manipulation tasks along with a small demonstration dataset to support adaptation to the target setup and defines protocols for assessing models on both familiar and shifted conditions. Experiments with imitation learning and current VLA models expose their capabilities and gaps while matching results from separate constructions validate that the benchmark behaves consistently. A reader would care because this setup could let more groups run fair, real-world comparisons without high costs or custom builds.

Core claim

VLA-REPLICA is a benchmark for real-world evaluation of vision-language-action models that is constructed from off-the-shelf components so it can be quickly assembled and replicated across laboratories to create a consistent evaluation environment. The benchmark features a diverse suite of manipulation tasks, a small-scale demonstration dataset for target-domain adaptation, and real-world evaluation protocols for in-distribution and out-of-distribution settings. Experiments reveal model strengths and limitations, and consistent results across independently constructed setups demonstrate the benchmark's reproducibility.

What carries the argument

The VLA-REPLICA benchmark assembled from off-the-shelf components to enable quick replication and consistent policy evaluation across different laboratories.

If this is right

Researchers worldwide can evaluate VLA models in real-world settings using the same tasks and protocols.
Performance differences between models can be attributed to the models themselves rather than varying hardware setups.
Small demonstration datasets allow models to adapt to the specific benchmark environment before testing.
Protocols support testing both on tasks similar to training and on out-of-distribution scenarios.
Imitation learning and state-of-the-art VLA models can be compared directly on the same real-world platform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could encourage broader participation in VLA research by lowering the entry cost for real-world experiments.
Standardizing on such a benchmark might help the field converge on reliable progress metrics beyond simulation.
Future extensions might include more complex tasks or integration with additional sensors using the same base components.
Consistent replication might allow crowdsourced data collection or collaborative model training across sites.

Load-bearing premise

Different groups can assemble the off-the-shelf components into environments that are similar enough that any performance differences come from the models and not from variations in the physical setups.

What would settle it

Running the same VLA model on two separately assembled VLA-REPLICA systems and observing substantially different success rates on the manipulation tasks would show the benchmark lacks sufficient reproducibility.

Figures

Figures reproduced from arXiv: 2605.20774 by Alex S. Huang, Jiahui Zhang, Shiqing Tang, Yu Xiang.

**Figure 2.** Figure 2: VLA-REPLICA standardized method ensuring reproducibility. (a) Align the task space, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Examples of expert demonstrations collected in our dataset. (b) Examples of reference [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: For the task Open the Oven, the blue dots indicate the oven center locations in the training set. The green star indicates the selected oven location for a test scene. To enable standardized and reproducible evaluation, we define a test scene as the initial configuration of all objects in the workspace, including both target and distractor objects. The VLA-REPLICA benchmark provides a total of 90 test s… view at source ↗

**Figure 5.** Figure 5: Original and reproduced setups. 5 Conclusion & Limitation We introduced VLA-REPLICA, a low-cost and reproducible real-world benchmark for evaluating vision-language-action (VLA) models. Our benchmark combines an affordable hardware setup, standardized environment design, and a unified evaluation protocol covering both in-distribution adaptation and out-of-distribution generalization. Experiments with imita… view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLA-REPLICA gives a workable low-cost kit for real-world VLA testing that smaller labs could actually build, but the reproducibility claims need concrete numbers on hardware variation to hold up.

read the letter

This paper's main point is that a cheap, off-the-shelf robotics setup can let different labs run comparable real-world tests on vision-language-action models for tabletop tasks. The idea is practical and could help move evaluation beyond simulation or expensive centralized rigs. The authors back it with experiments on imitation learning and current VLA models plus a small demonstration dataset for adaptation, and they report consistent outcomes across separate builds for both in-distribution and out-of-distribution cases. That combination of low cost, task diversity, and dual evaluation protocols is the clearest new element here compared with prior real-world benchmarks. It directly addresses the access problem that has kept many groups from doing rigorous physical testing. The paper does a reasonable job describing the assembly process and showing where the tested models perform well or struggle, which gives readers something concrete to build on. The central claim of reproducibility across independent setups is the part that needs more scrutiny. The abstract states that results were consistent, yet it offers no measurements of actual differences in camera calibration, arm repeatability, gripper behavior, or lighting between the builds, and no per-setup success rates with variation shown. Without those details it is difficult to judge whether the consistency comes from robust task design or from setups that were not different enough. This matters because the whole value of the benchmark rests on performance gaps reflecting the models rather than hardware artifacts. The work is aimed at robotics researchers who want to test VLA policies without big budgets or shared facilities. Groups already running manipulation experiments or designing their own benchmarks would find the most immediate use. It deserves peer review because the core idea is useful and the experiments are a start, but referees should ask for the missing quantitative checks on inter-build variance before the reproducibility story can be taken as settled.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces VLA-REPLICA, a low-cost real-world benchmark for evaluating Vision-Language-Action (VLA) models. Built from off-the-shelf components for quick assembly and replication across laboratories, it provides a consistent environment with a diverse suite of manipulation tasks, a small-scale demonstration dataset for target-domain adaptation, and protocols for in-distribution and out-of-distribution evaluation. Experiments using imitation learning and state-of-the-art VLA models are reported, along with claims of consistent results across independently constructed setups.

Significance. If the reproducibility claims are substantiated with quantitative evidence of low inter-build variation, the benchmark would represent a meaningful advance by enabling accessible, standardized real-world testing of VLA models without reliance on expensive or centralized hardware. The combination of task diversity, adaptation data, and in/out-of-distribution protocols could help the community better assess generalization, addressing gaps in both simulation and prior real-world benchmarks.

major comments (1)

[Abstract and Experiments section] Abstract and Experiments section: The central claim that 'consistent results across independently constructed setups demonstrate the reproducibility of our benchmark' is load-bearing for the contribution, yet the manuscript provides no quantitative characterization of inter-build variation. There are no reported measurements of camera intrinsics/extrinsics differences, gripper force calibration spread, arm positioning repeatability, lighting/background variance, or per-setup success rates with standard deviations for the same policy across multiple independent assemblies. Without these, it is unclear whether observed consistency reflects robust task design or insufficient hardware diversity, directly affecting the assertion that performance gaps will reflect model quality rather than setup artifacts.

minor comments (2)

[Benchmark description] The description of the 'small-scale demonstration dataset' lacks specifics on its size, collection protocol, number of demonstrations per task, and exact usage for target-domain adaptation; adding these details would improve clarity.
[Results] Ensure that success-rate tables or figures include error bars or standard deviations across trials and, where relevant, across independent builds to support the reproducibility narrative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that quantitative evidence is needed to substantiate the reproducibility claims and will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract and Experiments section] Abstract and Experiments section: The central claim that 'consistent results across independently constructed setups demonstrate the reproducibility of our benchmark' is load-bearing for the contribution, yet the manuscript provides no quantitative characterization of inter-build variation. There are no reported measurements of camera intrinsics/extrinsics differences, gripper force calibration spread, arm positioning repeatability, lighting/background variance, or per-setup success rates with standard deviations for the same policy across multiple independent assemblies. Without these, it is unclear whether observed consistency reflects robust task design or insufficient hardware diversity, directly affecting the assertion that performance gaps will reflect model quality rather than setup artifacts.

Authors: We agree that the absence of quantitative inter-build variation metrics weakens the reproducibility claim. The manuscript reports only that results were consistent across two independently assembled setups without providing the specific measurements or statistical characterizations requested. In the revised manuscript we will add a new subsection in the Experiments section that reports: measured differences in camera intrinsics/extrinsics, gripper force calibration spread, arm positioning repeatability, lighting and background variance, and per-setup success rates (with means and standard deviations) for the same policies evaluated on multiple independent assemblies. These additions will clarify that observed performance differences primarily reflect model quality rather than setup artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark introduction

full rationale

The paper introduces VLA-REPLICA as a new real-world benchmark assembled from off-the-shelf components, including a task suite, small demonstration dataset, and evaluation protocols for in- and out-of-distribution settings. It reports experimental results with imitation learning and VLA models plus consistency across independent setups. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. No self-definitional steps, load-bearing self-citations, or ansatz smuggling appear in the abstract or described content. The work is self-contained as an empirical contribution rather than a mathematical derivation, making circularity analysis inapplicable and yielding a clean finding of none.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper adds a concrete hardware and protocol specification rather than new theoretical entities or fitted constants; it relies on the domain assumption that standardized physical setups can be achieved with consumer parts.

axioms (1)

domain assumption Real-world robotic evaluation benefits from standardized physical hardware that different laboratories can replicate.
Invoked to justify the need for a low-cost reproducible benchmark instead of simulation or centralized facilities.

pith-pipeline@v0.9.0 · 5711 in / 1211 out tokens · 33641 ms · 2026-05-21T04:55:16.924828+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

[1]

So101 arm.https://huggingface.co/docs/lerobot/so101,

work page
[2]

https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam,

So101 camera mount. https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam,

work page
[3]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

work page 2025
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control.arXiv preprint arxiv:2410.24...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page 2023
[6]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

work page 2023
[7]

Cadene, S

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024

work page 2024
[8]

Y . Chen, K. Kimble, E. H. Adelson, T. Asfour, P. Chanrungmaneekul, S. Chitta, Y . Chitambar, Z. Chen, K. Goldberg, D. Kragic, et al. Manipulationnet: An infrastructure for benchmark- ing real-world robot manipulation with physical skill challenges and embodied multimodal reasoning.arXiv preprint arXiv:2603.04363, 2026

work page arXiv 2026
[9]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 10

work page 2025
[10]

Collins, M

J. Collins, M. Robson, J. Yamada, M. Sridharan, K. Janik, and I. Posner. Ramp: A benchmark for evaluating robotic assembly manipulation and planning.IEEE Robotics and Automation Letters, 9(1):9–16, 2023

work page 2023
[11]

H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, Y . Liang, D. Goetting, C. Xu, H. Chen, Y . Qian, Y . Geng, J. Mao, W. Wan, M. Zhang, J. Lyu, S. Zhao, J. Zhang, J. Zhang, C. Zhao, H. Lu, Y . Ding, R. Gong, Y . Wang, Y . Kuang, R. Wu, B. Jia, C. Sferrazza, H. Dong, S. Huang, Y . Wang, J. Malik, and P. Abbeel. Robo...

work page arXiv 2025
[12]

M. Heo, Y . Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44 (10-11):1863–1891, 2025

work page 2025
[13]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

B. Jones. Dissecting and open-sourcing multitask diffusion trans- former policy, 2025. URL https://brysonkjones.substack.com/p/ dissecting-and-open-sourcing-multitask-diffusion-transformer-policy . Blog post

work page 2025
[16]

Khargonkar, S

N. Khargonkar, S. H. Allu, Y . Lu, B. Prabhakaran, and Y . Xiang. Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes. InIEEE International Conference on Robotics and Automation (ICRA), pages 8258–8264. IEEE, 2024

work page 2024
[17]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024

work page 2024
[19]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

J. Luo, C. Xu, F. Liu, L. Tan, Z. Lin, J. Wu, P. Abbeel, and S. Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025

work page 2025
[22]

McLean, E

R. McLean, E. Chatzaroulas, L. McCutcheon, F. Röder, T. Yu, Z. He, K. Zentner, R. Julian, J. K. Terry, I. Woungang, N. Farsad, and P. S. Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=1de3azE606. 11

work page 2025
[23]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

work page 2022
[24]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

work page 2024
[25]

Nasiriany, S

S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[26]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

work page 2025
[27]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[28]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Yakefu, B

A. Yakefu, B. Xie, C. Xu, E. Zhang, E. Zhou, F. Jia, H. Yang, H. Fan, H. Zhang, H. Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies.arXiv preprint arXiv:2510.17950, 2025

work page arXiv 2025
[30]

B. Yang, J. Zhang, V . Pong, S. Levine, and D. Jayaraman. Replab: A reproducible low-cost arm benchmark platform for robotic learning.arXiv preprint arXiv:1905.07447, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[31]

X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Tremblay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies, 2026. URLhttps://arxiv.org/abs/2604.09860

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

work page 2020
[33]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025. 12 A VLA-REPLICA Benchmark Setup Instructions This section provides step-by-step instructions for reliably reproducing our benchmark environment across different labor...

work page arXiv 2025
[36]

13 Table A.1: Parts list for the benchmark setup

Printtwo copiesof the snap-hook part (Part1.stl). 13 Table A.1: Parts list for the benchmark setup. Qty Item 1 Glendan 32×32 in box set (box tarp, 12× PVC pipes, 8× PVC edge connectors, white PP background sheet, white light diffuser sheet, 3×LED panel set, power cables) (link) 1 Intel RealSense D455 1 set 3-D printed camera mount (1×backplate, 2×snap-hoo...

work page
[37]

Printone copyof the camera backplate (Part2.stl)

work page
[38]

Repeat for the second hook

Attach one snap-hook (Part 1) to the backplate (Part 2) using one M3×6 mm screw. Repeat for the second hook. The assembled unit is referred to asPart 3(Fig. A.2(a))

work page
[39]

Screw Part 3 tightly to therear mounting holesof the D455 camera using two M4 ×6 mm screws (Fig. A.2(b))

work page
[40]

A.2(c)).Do not over-tighten the screws or apply excessive force to the hooks, as they may snap

To prevent the hooks from sliding on the PVC pipe, attach a small piece of rubber grip tape to theinsideof each hook (Fig. A.2(c)).Do not over-tighten the screws or apply excessive force to the hooks, as they may snap. Note: the CAD file for the snap-hook has an inner diameter that matches the outer diameter of the PVC pipe for the Glendan light box. Othe...

work page
[41]

If applicable (i.e. the SO-101 did not come pre-assembled), follow LeRobot’s SO-101 documentation page to assemble the SO-101 Follower arm: (https://huggingface.co/ docs/lerobot/so101).Don’t calibrate the assembled SO-101 arm yet

work page
[42]

Follow TheRobotStudio’s page to print and set up the wrist camera mount with the Vin- mooog webcam: ( https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam)

work page
[43]

Important Checklist: □Both snap-hooks are attached with M3 screws and sit flush against the backplate

Secure the camera mount onto the end-effector of the SO-101 with one M3×12 mm screw and the M3 nut. Important Checklist: □Both snap-hooks are attached with M3 screws and sit flush against the backplate. 14 □The mount is fastened to the D455 with M4 screws; the camera does not wobble. □Rubber grip tape is applied to the inside of both hooks. □The SO-101 fo...

work page
[44]

Construct the cube-shapedPVC frameusing the 12 pipes and 8 edge connectors supplied with the Glendan kit.Do not attach the zipper tarp yet; complete all internal installations first

work page
[45]

Secure the sheet using the velcro strips on both the −y and +y pipes, as close as possible to the+zface

Attach the white light diffuser sheetto the top ( +z) face of the frame using the supplied velcro strips. Secure the sheet using the velcro strips on both the −y and +y pipes, as close as possible to the+zface. (Fig. A.4(a)). 15

work page
[46]

A.4(b)): (a) LED Panel 1: +z face, pointed downward toward −z; center ≈ 7.5 inches from the −yface

Attach the provided white hooks onto the LED panelsand thenmount the three LED panelson the PVC frame before fitting the tarp (Fig. A.4(b)): (a) LED Panel 1: +z face, pointed downward toward −z; center ≈ 7.5 inches from the −yface. (b) LED Panel 2: +z face, pointed downward toward −z; center ≈ 7.5 inches from the +yface (mirror image of Strip 1). (c)LED P...

work page
[47]

Ensure that the −z side (the object workspace) is actually on the bottom

Slide thezipper tarpover the completed PVC/LED frame, ensuring that the −y face of the frame matches the side of the tarp with the zippers. Ensure that the −z side (the object workspace) is actually on the bottom

work page
[48]

A.4(c)): (a) Tuck the sheetunderthe −z PVC pipes so that the workspace is as flat as possible

Attach the white PP background sheetto the inner +y face using the velcro strips on the tarp and the sheet (Fig. A.4(c)): (a) Tuck the sheetunderthe −z PVC pipes so that the workspace is as flat as possible. Fold the sides of the sheet under the pipes if necessary to flatten the−zworkspace. (b) If the sheet bunches near the −y face, cut a small triangular...

work page
[49]

The center of the SO-101 base should be≈16.5 inches from the−xface (Fig

Clamp both sides of theSO-101 follower armto the edge of the table so that the front edge of its base touches the PVC pipe running between the −z and −y faces of the box. The center of the SO-101 base should be≈16.5 inches from the−xface (Fig. A.5(a))

work page
[50]

Attach the12 V power adaptorto the SO-101 arm

work page
[51]

A.5(b)): (a) Thenorthwest cornerof the tag must touch the vertical edge of the SO-101 base

Place the4 cm AprilTagon therightside of the SO-101 base (Fig. A.5(b)): (a) Thenorthwest cornerof the tag must touch the vertical edge of the SO-101 base. (b) Thesouth black borderof the tag must be aligned with the bottom edge of the SO-101 base

work page
[52]

Wrinkled paper causes unreliable detection; affix it flat using double-sided tape on all four corners

Double-check the tag orientation. Wrinkled paper causes unreliable detection; affix it flat using double-sided tape on all four corners. (a) SO-101 arm clamped to table. Center of base is16.5 inch from the−xface. (b) AprilTag aligned with the base edges. Figure A.5:SO-101 arm placement and AprilTag positioning. Important Checklist: □The SO-101 base center...

work page
[53]

Clone the benchmark repository, create a new Conda environment, and install dependencies: git clone https://github.com/IRVLUTD/VLAReplica.git cd VLAReplica conda env create -f environment.yml conda activate vlareplica

work page
[54]

Find available cameras indices with the command (note down the numbers): lerobot-find-cameras 17 Record the camera indices for the two cameras

work page
[55]

The terminal will output something like/dev/ttyACM1

Find USB device serial ports from the following command: lerobot-find-port Then unplug the SO-101 USB cable from the computer, and press Enter. The terminal will output something like/dev/ttyACM1. Record the serial port for the follwoer arm. A.6 Calibrate the SO-101 arm Next, calibrate the SO-101 follower according to the LeRobot Docs (https://huggingface...

work page
[56]

It should be under: ~/.cache/huggingface/lerobot/calibration/robots/<your-robot-id> in your root folder

Locate the calibration file that LeRobot saved to your device. It should be under: ~/.cache/huggingface/lerobot/calibration/robots/<your-robot-id> in your root folder

work page
[57]

Copy this .json file to:VLAReplica/calibration/robots/so101_follower

work page
[58]

The target pose values are listed in Table A.2

And rename that file to:so101_follower_arm.json A.7 Camera Calibration We provide a calibration script that detects the AprilTag and reports the camera pose in real time, allowing fine adjustment of the camera mount before locking it in place. The target pose values are listed in Table A.2. Table A.2: Target front-view camera pose relative to the AprilTag...

work page
[59]

In a new terminal inside the virtual environment, run the calibration script (replace <your-top-camera-index>with the number you recorded in Appendix A.5): python calibration/camera/detect_apriltag.py --camera-index <your-top-camera-index>

work page
[60]

A GUI window will display the live camera feed alongside the estimated AprilTag pose (Fig. A.6). Reach into the box and physically slide or tilt the camera mount along the PVC pipe until all reported values match Table A.2 as closely as possible

work page
[61]

Once satisfied, Pressqto exit the program

Some error is acceptable (see Table A.2). Once satisfied, Pressqto exit the program

work page
[62]

To solve this, we utilizevisual overlay matching(see Fig

Although the AprilTag pose estimator may output values close to Table A.2, there may still be slight camera misalignment. To solve this, we utilizevisual overlay matching(see Fig. A.7) to ensure the camera view is as close as possible toVLA-REPLICA’soriginal view. (a) First, calibrate the top camera for the second time. Run the following, replacing your-t...

work page

[1] [1]

So101 arm.https://huggingface.co/docs/lerobot/so101,

work page

[2] [2]

https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam,

So101 camera mount. https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam,

work page

[3] [3]

Atreya, K

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Neary, E. Hu, F. Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies. In Proceedings of the Conference on Robot Learning (CoRL 2025), 2025

work page 2025

[4] [4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control.arXiv preprint arxiv:2410.24...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...

work page 2023

[6] [6]

Brohan, N

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...

work page 2023

[7] [7]

Cadene, S

R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024

work page 2024

[8] [8]

Y . Chen, K. Kimble, E. H. Adelson, T. Asfour, P. Chanrungmaneekul, S. Chitta, Y . Chitambar, Z. Chen, K. Goldberg, D. Kragic, et al. Manipulationnet: An infrastructure for benchmark- ing real-world robot manipulation with physical skill challenges and embodied multimodal reasoning.arXiv preprint arXiv:2603.04363, 2026

work page arXiv 2026

[9] [9]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 10

work page 2025

[10] [10]

Collins, M

J. Collins, M. Robson, J. Yamada, M. Sridharan, K. Janik, and I. Posner. Ramp: A benchmark for evaluating robotic assembly manipulation and planning.IEEE Robotics and Automation Letters, 9(1):9–16, 2023

work page 2023

[11] [11]

H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, Y . Liang, D. Goetting, C. Xu, H. Chen, Y . Qian, Y . Geng, J. Mao, W. Wan, M. Zhang, J. Lyu, S. Zhao, J. Zhang, J. Zhang, C. Zhao, H. Lu, Y . Ding, R. Gong, Y . Wang, Y . Kuang, R. Wu, B. Jia, C. Sferrazza, H. Dong, S. Huang, Y . Wang, J. Malik, and P. Abbeel. Robo...

work page arXiv 2025

[12] [12]

M. Heo, Y . Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44 (10-11):1863–1891, 2025

work page 2025

[13] [13]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

B. Jones. Dissecting and open-sourcing multitask diffusion trans- former policy, 2025. URL https://brysonkjones.substack.com/p/ dissecting-and-open-sourcing-multitask-diffusion-transformer-policy . Blog post

work page 2025

[16] [16]

Khargonkar, S

N. Khargonkar, S. H. Allu, Y . Lu, B. Prabhakaran, and Y . Xiang. Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes. InIEEE International Conference on Robotics and Automation (ICRA), pages 8258–8264. IEEE, 2024

work page 2024

[17] [17]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024

work page 2024

[19] [19]

X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

J. Luo, C. Xu, F. Liu, L. Tan, Z. Lin, J. Wu, P. Abbeel, and S. Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025

work page 2025

[22] [22]

McLean, E

R. McLean, E. Chatzaroulas, L. McCutcheon, F. Röder, T. Yu, Z. He, K. Zentner, R. Julian, J. K. Terry, I. Woungang, N. Farsad, and P. S. Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=1de3azE606. 11

work page 2025

[23] [23]

O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

work page 2022

[24] [24]

Nasiriany, A

S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024

work page 2024

[25] [25]

Nasiriany, S

S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[26] [26]

Bjorck, N

NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....

work page 2025

[27] [27]

O’Neill, A

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024

[28] [28]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Yakefu, B

A. Yakefu, B. Xie, C. Xu, E. Zhang, E. Zhou, F. Jia, H. Yang, H. Fan, H. Zhang, H. Peng, et al. Robochallenge: Large-scale real-robot evaluation of embodied policies.arXiv preprint arXiv:2510.17950, 2025

work page arXiv 2025

[30] [30]

B. Yang, J. Zhang, V . Pong, S. Levine, and D. Jayaraman. Replab: A reproducible low-cost arm benchmark platform for robotic learning.arXiv preprint arXiv:1905.07447, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[31] [31]

X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Tremblay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies, 2026. URLhttps://arxiv.org/abs/2604.09860

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

work page 2020

[33] [33]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025. 12 A VLA-REPLICA Benchmark Setup Instructions This section provides step-by-step instructions for reliably reproducing our benchmark environment across different labor...

work page arXiv 2025

[36] [36]

13 Table A.1: Parts list for the benchmark setup

Printtwo copiesof the snap-hook part (Part1.stl). 13 Table A.1: Parts list for the benchmark setup. Qty Item 1 Glendan 32×32 in box set (box tarp, 12× PVC pipes, 8× PVC edge connectors, white PP background sheet, white light diffuser sheet, 3×LED panel set, power cables) (link) 1 Intel RealSense D455 1 set 3-D printed camera mount (1×backplate, 2×snap-hoo...

work page

[37] [37]

Printone copyof the camera backplate (Part2.stl)

work page

[38] [38]

Repeat for the second hook

Attach one snap-hook (Part 1) to the backplate (Part 2) using one M3×6 mm screw. Repeat for the second hook. The assembled unit is referred to asPart 3(Fig. A.2(a))

work page

[39] [39]

Screw Part 3 tightly to therear mounting holesof the D455 camera using two M4 ×6 mm screws (Fig. A.2(b))

work page

[40] [40]

A.2(c)).Do not over-tighten the screws or apply excessive force to the hooks, as they may snap

To prevent the hooks from sliding on the PVC pipe, attach a small piece of rubber grip tape to theinsideof each hook (Fig. A.2(c)).Do not over-tighten the screws or apply excessive force to the hooks, as they may snap. Note: the CAD file for the snap-hook has an inner diameter that matches the outer diameter of the PVC pipe for the Glendan light box. Othe...

work page

[41] [41]

If applicable (i.e. the SO-101 did not come pre-assembled), follow LeRobot’s SO-101 documentation page to assemble the SO-101 Follower arm: (https://huggingface.co/ docs/lerobot/so101).Don’t calibrate the assembled SO-101 arm yet

work page

[42] [42]

Follow TheRobotStudio’s page to print and set up the wrist camera mount with the Vin- mooog webcam: ( https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam)

work page

[43] [43]

Important Checklist: □Both snap-hooks are attached with M3 screws and sit flush against the backplate

Secure the camera mount onto the end-effector of the SO-101 with one M3×12 mm screw and the M3 nut. Important Checklist: □Both snap-hooks are attached with M3 screws and sit flush against the backplate. 14 □The mount is fastened to the D455 with M4 screws; the camera does not wobble. □Rubber grip tape is applied to the inside of both hooks. □The SO-101 fo...

work page

[44] [44]

Construct the cube-shapedPVC frameusing the 12 pipes and 8 edge connectors supplied with the Glendan kit.Do not attach the zipper tarp yet; complete all internal installations first

work page

[45] [45]

Secure the sheet using the velcro strips on both the −y and +y pipes, as close as possible to the+zface

Attach the white light diffuser sheetto the top ( +z) face of the frame using the supplied velcro strips. Secure the sheet using the velcro strips on both the −y and +y pipes, as close as possible to the+zface. (Fig. A.4(a)). 15

work page

[46] [46]

A.4(b)): (a) LED Panel 1: +z face, pointed downward toward −z; center ≈ 7.5 inches from the −yface

Attach the provided white hooks onto the LED panelsand thenmount the three LED panelson the PVC frame before fitting the tarp (Fig. A.4(b)): (a) LED Panel 1: +z face, pointed downward toward −z; center ≈ 7.5 inches from the −yface. (b) LED Panel 2: +z face, pointed downward toward −z; center ≈ 7.5 inches from the +yface (mirror image of Strip 1). (c)LED P...

work page

[47] [47]

Ensure that the −z side (the object workspace) is actually on the bottom

Slide thezipper tarpover the completed PVC/LED frame, ensuring that the −y face of the frame matches the side of the tarp with the zippers. Ensure that the −z side (the object workspace) is actually on the bottom

work page

[48] [48]

A.4(c)): (a) Tuck the sheetunderthe −z PVC pipes so that the workspace is as flat as possible

Attach the white PP background sheetto the inner +y face using the velcro strips on the tarp and the sheet (Fig. A.4(c)): (a) Tuck the sheetunderthe −z PVC pipes so that the workspace is as flat as possible. Fold the sides of the sheet under the pipes if necessary to flatten the−zworkspace. (b) If the sheet bunches near the −y face, cut a small triangular...

work page

[49] [49]

The center of the SO-101 base should be≈16.5 inches from the−xface (Fig

Clamp both sides of theSO-101 follower armto the edge of the table so that the front edge of its base touches the PVC pipe running between the −z and −y faces of the box. The center of the SO-101 base should be≈16.5 inches from the−xface (Fig. A.5(a))

work page

[50] [50]

Attach the12 V power adaptorto the SO-101 arm

work page

[51] [51]

A.5(b)): (a) Thenorthwest cornerof the tag must touch the vertical edge of the SO-101 base

Place the4 cm AprilTagon therightside of the SO-101 base (Fig. A.5(b)): (a) Thenorthwest cornerof the tag must touch the vertical edge of the SO-101 base. (b) Thesouth black borderof the tag must be aligned with the bottom edge of the SO-101 base

work page

[52] [52]

Wrinkled paper causes unreliable detection; affix it flat using double-sided tape on all four corners

Double-check the tag orientation. Wrinkled paper causes unreliable detection; affix it flat using double-sided tape on all four corners. (a) SO-101 arm clamped to table. Center of base is16.5 inch from the−xface. (b) AprilTag aligned with the base edges. Figure A.5:SO-101 arm placement and AprilTag positioning. Important Checklist: □The SO-101 base center...

work page

[53] [53]

Clone the benchmark repository, create a new Conda environment, and install dependencies: git clone https://github.com/IRVLUTD/VLAReplica.git cd VLAReplica conda env create -f environment.yml conda activate vlareplica

work page

[54] [54]

Find available cameras indices with the command (note down the numbers): lerobot-find-cameras 17 Record the camera indices for the two cameras

work page

[55] [55]

The terminal will output something like/dev/ttyACM1

Find USB device serial ports from the following command: lerobot-find-port Then unplug the SO-101 USB cable from the computer, and press Enter. The terminal will output something like/dev/ttyACM1. Record the serial port for the follwoer arm. A.6 Calibrate the SO-101 arm Next, calibrate the SO-101 follower according to the LeRobot Docs (https://huggingface...

work page

[56] [56]

It should be under: ~/.cache/huggingface/lerobot/calibration/robots/<your-robot-id> in your root folder

Locate the calibration file that LeRobot saved to your device. It should be under: ~/.cache/huggingface/lerobot/calibration/robots/<your-robot-id> in your root folder

work page

[57] [57]

Copy this .json file to:VLAReplica/calibration/robots/so101_follower

work page

[58] [58]

The target pose values are listed in Table A.2

And rename that file to:so101_follower_arm.json A.7 Camera Calibration We provide a calibration script that detects the AprilTag and reports the camera pose in real time, allowing fine adjustment of the camera mount before locking it in place. The target pose values are listed in Table A.2. Table A.2: Target front-view camera pose relative to the AprilTag...

work page

[59] [59]

In a new terminal inside the virtual environment, run the calibration script (replace <your-top-camera-index>with the number you recorded in Appendix A.5): python calibration/camera/detect_apriltag.py --camera-index <your-top-camera-index>

work page

[60] [60]

A GUI window will display the live camera feed alongside the estimated AprilTag pose (Fig. A.6). Reach into the box and physically slide or tilt the camera mount along the PVC pipe until all reported values match Table A.2 as closely as possible

work page

[61] [61]

Once satisfied, Pressqto exit the program

Some error is acceptable (see Table A.2). Once satisfied, Pressqto exit the program

work page

[62] [62]

To solve this, we utilizevisual overlay matching(see Fig

Although the AprilTag pose estimator may output values close to Table A.2, there may still be slight camera misalignment. To solve this, we utilizevisual overlay matching(see Fig. A.7) to ensure the camera view is as close as possible toVLA-REPLICA’soriginal view. (a) First, calibrate the top camera for the second time. Run the following, replacing your-t...

work page