VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models
Pith reviewed 2026-05-21 04:55 UTC · model grok-4.3
The pith
VLA-REPLICA offers a low-cost benchmark using standard parts for consistent real-world testing of vision-language-action models worldwide.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLA-REPLICA is a benchmark for real-world evaluation of vision-language-action models that is constructed from off-the-shelf components so it can be quickly assembled and replicated across laboratories to create a consistent evaluation environment. The benchmark features a diverse suite of manipulation tasks, a small-scale demonstration dataset for target-domain adaptation, and real-world evaluation protocols for in-distribution and out-of-distribution settings. Experiments reveal model strengths and limitations, and consistent results across independently constructed setups demonstrate the benchmark's reproducibility.
What carries the argument
The VLA-REPLICA benchmark assembled from off-the-shelf components to enable quick replication and consistent policy evaluation across different laboratories.
If this is right
- Researchers worldwide can evaluate VLA models in real-world settings using the same tasks and protocols.
- Performance differences between models can be attributed to the models themselves rather than varying hardware setups.
- Small demonstration datasets allow models to adapt to the specific benchmark environment before testing.
- Protocols support testing both on tasks similar to training and on out-of-distribution scenarios.
- Imitation learning and state-of-the-art VLA models can be compared directly on the same real-world platform.
Where Pith is reading between the lines
- This could encourage broader participation in VLA research by lowering the entry cost for real-world experiments.
- Standardizing on such a benchmark might help the field converge on reliable progress metrics beyond simulation.
- Future extensions might include more complex tasks or integration with additional sensors using the same base components.
- Consistent replication might allow crowdsourced data collection or collaborative model training across sites.
Load-bearing premise
Different groups can assemble the off-the-shelf components into environments that are similar enough that any performance differences come from the models and not from variations in the physical setups.
What would settle it
Running the same VLA model on two separately assembled VLA-REPLICA systems and observing substantially different success rates on the manipulation tasks would show the benchmark lacks sufficient reproducibility.
Figures
read the original abstract
Vision-Language-Action (VLA) models have shown strong promise for general-purpose robotic manipulation, but their real-world evaluation remains limited by a lack of accessible, reproducible, and consistent benchmarks. Simulation benchmarks fail to capture real-world complexity, while existing real-world benchmarks often require expensive hardware, centralized evaluation, or are limited in task diversity. We introduce VLA-REPLICA, a low-cost, easily reproducible real-world benchmark for evaluating VLA models. Built from off-the-shelf components, our system can be quickly assembled and replicated across laboratories, providing a consistent environment for policy evaluation anywhere in the world. VLA-REPLICA includes a diverse suite of manipulation tasks and a small-scale demonstration dataset for target-domain adaptation, with real-world evaluation protocols for both in-distribution and out-of-distribution settings. Experiments with imitation learning and state-of-the-art VLA models reveal model strengths and limitations, while consistent results across independently constructed setups demonstrate the reproducibility of our benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VLA-REPLICA, a low-cost real-world benchmark for evaluating Vision-Language-Action (VLA) models. Built from off-the-shelf components for quick assembly and replication across laboratories, it provides a consistent environment with a diverse suite of manipulation tasks, a small-scale demonstration dataset for target-domain adaptation, and protocols for in-distribution and out-of-distribution evaluation. Experiments using imitation learning and state-of-the-art VLA models are reported, along with claims of consistent results across independently constructed setups.
Significance. If the reproducibility claims are substantiated with quantitative evidence of low inter-build variation, the benchmark would represent a meaningful advance by enabling accessible, standardized real-world testing of VLA models without reliance on expensive or centralized hardware. The combination of task diversity, adaptation data, and in/out-of-distribution protocols could help the community better assess generalization, addressing gaps in both simulation and prior real-world benchmarks.
major comments (1)
- [Abstract and Experiments section] Abstract and Experiments section: The central claim that 'consistent results across independently constructed setups demonstrate the reproducibility of our benchmark' is load-bearing for the contribution, yet the manuscript provides no quantitative characterization of inter-build variation. There are no reported measurements of camera intrinsics/extrinsics differences, gripper force calibration spread, arm positioning repeatability, lighting/background variance, or per-setup success rates with standard deviations for the same policy across multiple independent assemblies. Without these, it is unclear whether observed consistency reflects robust task design or insufficient hardware diversity, directly affecting the assertion that performance gaps will reflect model quality rather than setup artifacts.
minor comments (2)
- [Benchmark description] The description of the 'small-scale demonstration dataset' lacks specifics on its size, collection protocol, number of demonstrations per task, and exact usage for target-domain adaptation; adding these details would improve clarity.
- [Results] Ensure that success-rate tables or figures include error bars or standard deviations across trials and, where relevant, across independent builds to support the reproducibility narrative.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that quantitative evidence is needed to substantiate the reproducibility claims and will revise the manuscript to address this.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: The central claim that 'consistent results across independently constructed setups demonstrate the reproducibility of our benchmark' is load-bearing for the contribution, yet the manuscript provides no quantitative characterization of inter-build variation. There are no reported measurements of camera intrinsics/extrinsics differences, gripper force calibration spread, arm positioning repeatability, lighting/background variance, or per-setup success rates with standard deviations for the same policy across multiple independent assemblies. Without these, it is unclear whether observed consistency reflects robust task design or insufficient hardware diversity, directly affecting the assertion that performance gaps will reflect model quality rather than setup artifacts.
Authors: We agree that the absence of quantitative inter-build variation metrics weakens the reproducibility claim. The manuscript reports only that results were consistent across two independently assembled setups without providing the specific measurements or statistical characterizations requested. In the revised manuscript we will add a new subsection in the Experiments section that reports: measured differences in camera intrinsics/extrinsics, gripper force calibration spread, arm positioning repeatability, lighting and background variance, and per-setup success rates (with means and standard deviations) for the same policies evaluated on multiple independent assemblies. These additions will clarify that observed performance differences primarily reflect model quality rather than setup artifacts. revision: yes
Circularity Check
No circularity in empirical benchmark introduction
full rationale
The paper introduces VLA-REPLICA as a new real-world benchmark assembled from off-the-shelf components, including a task suite, small demonstration dataset, and evaluation protocols for in- and out-of-distribution settings. It reports experimental results with imitation learning and VLA models plus consistency across independent setups. No derivation chain, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. No self-definitional steps, load-bearing self-citations, or ansatz smuggling appear in the abstract or described content. The work is self-contained as an empirical contribution rather than a mathematical derivation, making circularity analysis inapplicable and yielding a clean finding of none.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Real-world robotic evaluation benefits from standardized physical hardware that different laboratories can replicate.
Reference graph
Works this paper leans on
-
[1]
So101 arm.https://huggingface.co/docs/lerobot/so101,
-
[2]
https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam,
So101 camera mount. https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam,
- [3]
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky.π0: A vision-language- action flow model for general robot control.arXiv preprint arxiv:2410.24...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. Pe...
work page 2023
-
[6]
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, T. Jackson, S. Jesmonth, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, K.-H. Lee, S. Levine, Y . Lu, U. Malla, D. Manju- nath, I. Mordatch, O. Nachum, C. Parada, J. Peralta, E. Perez, K. Pertsc...
work page 2023
-
[7]
R. Cadene, S. Alibert, A. Soare, Q. Gallouedec, A. Zouitine, S. Palma, P. Kooijmans, M. Ar- actingi, M. Shukor, D. Aubakirova, M. Russi, F. Capuano, C. Pascal, J. Choghari, J. Moss, and T. Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/huggingface/lerobot, 2024
work page 2024
-
[8]
Y . Chen, K. Kimble, E. H. Adelson, T. Asfour, P. Chanrungmaneekul, S. Chitta, Y . Chitambar, Z. Chen, K. Goldberg, D. Kragic, et al. Manipulationnet: An infrastructure for benchmark- ing real-world robot manipulation with physical skill challenges and embodied multimodal reasoning.arXiv preprint arXiv:2603.04363, 2026
-
[9]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 10
work page 2025
-
[10]
J. Collins, M. Robson, J. Yamada, M. Sridharan, K. Janik, and I. Posner. Ramp: A benchmark for evaluating robotic assembly manipulation and planning.IEEE Robotics and Automation Letters, 9(1):9–16, 2023
work page 2023
-
[11]
H. Geng, F. Wang, S. Wei, Y . Li, B. Wang, B. An, C. T. Cheng, H. Lou, P. Li, Y .-J. Wang, Y . Liang, D. Goetting, C. Xu, H. Chen, Y . Qian, Y . Geng, J. Mao, W. Wan, M. Zhang, J. Lyu, S. Zhao, J. Zhang, J. Zhang, C. Zhao, H. Lu, Y . Ding, R. Gong, Y . Wang, Y . Kuang, R. Wu, B. Jia, C. Sferrazza, H. Dong, S. Huang, Y . Wang, J. Malik, and P. Abbeel. Robo...
-
[12]
M. Heo, Y . Lee, D. Lee, and J. J. Lim. Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation.The International Journal of Robotics Research, 44 (10-11):1863–1891, 2025
work page 2025
-
[13]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. Levin...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al. π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
B. Jones. Dissecting and open-sourcing multitask diffusion trans- former policy, 2025. URL https://brysonkjones.substack.com/p/ dissecting-and-open-sourcing-multitask-diffusion-transformer-policy . Blog post
work page 2025
-
[16]
N. Khargonkar, S. H. Allu, Y . Lu, B. Prabhakaran, and Y . Xiang. Scenereplica: Benchmarking real-world robot manipulation by creating replicable scenes. InIEEE International Conference on Robotics and Automation (ICRA), pages 8258–8264. IEEE, 2024
work page 2024
-
[17]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning (CoRL), 2024
work page 2024
-
[19]
X. Li, K. Hsu, J. Gu, K. Pertsch, O. Mees, H. R. Walke, C. Fu, I. Lunawat, I. Sieh, S. Kir- mani, S. Levine, J. Wu, C. Finn, H. Su, Q. Vuong, and T. Xiao. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.arXiv preprint arXiv:2306.03310, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
J. Luo, C. Xu, F. Liu, L. Tan, Z. Lin, J. Wu, P. Abbeel, and S. Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025
work page 2025
-
[22]
R. McLean, E. Chatzaroulas, L. McCutcheon, F. Röder, T. Yu, Z. He, K. Zentner, R. Julian, J. K. Terry, I. Woungang, N. Farsad, and P. S. Castro. Meta-world+: An improved, standardized, RL benchmark. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=1de3azE606. 11
work page 2025
-
[23]
O. Mees, L. Hermann, E. Rosete-Beas, and W. Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
work page 2022
-
[24]
S. Nasiriany, A. Maddukuri, L. Zhang, A. Parikh, A. Lo, A. Joshi, A. Mandlekar, and Y . Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots. InRobotics: Science and Systems (RSS), 2024
work page 2024
-
[25]
S. Nasiriany, S. Nasiriany, A. Maddukuri, and Y . Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[26]
NVIDIA, J. Bjorck, N. C. Fernando Castañeda, X. Da, R. Ding, L. J. Fan, Y . Fang, D. Fox, F. Hu, S. Huang, J. Jang, Z. Jiang, J. Kautz, K. Kundalia, L. Lao, Z. Li, Z. Lin, K. Lin, G. Liu, E. Llontop, L. Magne, A. Mandlekar, A. Narayan, S. Nasiriany, S. Reed, Y . L. Tan, G. Wang, Z. Wang, J. Wang, Q. Wang, J. Xiang, Y . Xie, Y . Xu, Z. Xu, S. Ye, Z. Yu, A....
work page 2025
-
[27]
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
work page 2024
-
[28]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
M. Shukor, D. Aubakirova, F. Capuano, P. Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [29]
-
[30]
B. Yang, J. Zhang, V . Pong, S. Levine, and D. Jayaraman. Replab: A reproducible low-cost arm benchmark platform for robotic learning.arXiv preprint arXiv:1905.07447, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[31]
X. Yang, R. Dagli, A. Zook, H. Hadfield, A. Goyal, S. Birchfield, F. Ramos, and J. Tremblay. Robolab: A high-fidelity simulation benchmark for analysis of task generalist policies, 2026. URLhttps://arxiv.org/abs/2604.09860
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020
work page 2020
-
[33]
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model
J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, et al. X-vla: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. arXiv preprint arXiv:2510.10274, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Z. Zhou, P. Atreya, Y . L. Tan, K. Pertsch, and S. Levine. Autoeval: Autonomous evaluation of generalist robot manipulation policies in the real world.arXiv preprint arXiv:2503.24278, 2025. 12 A VLA-REPLICA Benchmark Setup Instructions This section provides step-by-step instructions for reliably reproducing our benchmark environment across different labor...
-
[36]
13 Table A.1: Parts list for the benchmark setup
Printtwo copiesof the snap-hook part (Part1.stl). 13 Table A.1: Parts list for the benchmark setup. Qty Item 1 Glendan 32×32 in box set (box tarp, 12× PVC pipes, 8× PVC edge connectors, white PP background sheet, white light diffuser sheet, 3×LED panel set, power cables) (link) 1 Intel RealSense D455 1 set 3-D printed camera mount (1×backplate, 2×snap-hoo...
-
[37]
Printone copyof the camera backplate (Part2.stl)
-
[38]
Attach one snap-hook (Part 1) to the backplate (Part 2) using one M3×6 mm screw. Repeat for the second hook. The assembled unit is referred to asPart 3(Fig. A.2(a))
-
[39]
Screw Part 3 tightly to therear mounting holesof the D455 camera using two M4 ×6 mm screws (Fig. A.2(b))
-
[40]
A.2(c)).Do not over-tighten the screws or apply excessive force to the hooks, as they may snap
To prevent the hooks from sliding on the PVC pipe, attach a small piece of rubber grip tape to theinsideof each hook (Fig. A.2(c)).Do not over-tighten the screws or apply excessive force to the hooks, as they may snap. Note: the CAD file for the snap-hook has an inner diameter that matches the outer diameter of the PVC pipe for the Glendan light box. Othe...
-
[41]
If applicable (i.e. the SO-101 did not come pre-assembled), follow LeRobot’s SO-101 documentation page to assemble the SO-101 Follower arm: (https://huggingface.co/ docs/lerobot/so101).Don’t calibrate the assembled SO-101 arm yet
-
[42]
Follow TheRobotStudio’s page to print and set up the wrist camera mount with the Vin- mooog webcam: ( https://github.com/TheRobotStudio/SO-ARM100/tree/main/ Optional/Wrist_Cam_Mount_Vinmooog_Webcam)
-
[43]
Secure the camera mount onto the end-effector of the SO-101 with one M3×12 mm screw and the M3 nut. Important Checklist: □Both snap-hooks are attached with M3 screws and sit flush against the backplate. 14 □The mount is fastened to the D455 with M4 screws; the camera does not wobble. □Rubber grip tape is applied to the inside of both hooks. □The SO-101 fo...
-
[44]
Construct the cube-shapedPVC frameusing the 12 pipes and 8 edge connectors supplied with the Glendan kit.Do not attach the zipper tarp yet; complete all internal installations first
-
[45]
Attach the white light diffuser sheetto the top ( +z) face of the frame using the supplied velcro strips. Secure the sheet using the velcro strips on both the −y and +y pipes, as close as possible to the+zface. (Fig. A.4(a)). 15
-
[46]
A.4(b)): (a) LED Panel 1: +z face, pointed downward toward −z; center ≈ 7.5 inches from the −yface
Attach the provided white hooks onto the LED panelsand thenmount the three LED panelson the PVC frame before fitting the tarp (Fig. A.4(b)): (a) LED Panel 1: +z face, pointed downward toward −z; center ≈ 7.5 inches from the −yface. (b) LED Panel 2: +z face, pointed downward toward −z; center ≈ 7.5 inches from the +yface (mirror image of Strip 1). (c)LED P...
-
[47]
Ensure that the −z side (the object workspace) is actually on the bottom
Slide thezipper tarpover the completed PVC/LED frame, ensuring that the −y face of the frame matches the side of the tarp with the zippers. Ensure that the −z side (the object workspace) is actually on the bottom
-
[48]
A.4(c)): (a) Tuck the sheetunderthe −z PVC pipes so that the workspace is as flat as possible
Attach the white PP background sheetto the inner +y face using the velcro strips on the tarp and the sheet (Fig. A.4(c)): (a) Tuck the sheetunderthe −z PVC pipes so that the workspace is as flat as possible. Fold the sides of the sheet under the pipes if necessary to flatten the−zworkspace. (b) If the sheet bunches near the −y face, cut a small triangular...
-
[49]
The center of the SO-101 base should be≈16.5 inches from the−xface (Fig
Clamp both sides of theSO-101 follower armto the edge of the table so that the front edge of its base touches the PVC pipe running between the −z and −y faces of the box. The center of the SO-101 base should be≈16.5 inches from the−xface (Fig. A.5(a))
-
[50]
Attach the12 V power adaptorto the SO-101 arm
-
[51]
A.5(b)): (a) Thenorthwest cornerof the tag must touch the vertical edge of the SO-101 base
Place the4 cm AprilTagon therightside of the SO-101 base (Fig. A.5(b)): (a) Thenorthwest cornerof the tag must touch the vertical edge of the SO-101 base. (b) Thesouth black borderof the tag must be aligned with the bottom edge of the SO-101 base
-
[52]
Double-check the tag orientation. Wrinkled paper causes unreliable detection; affix it flat using double-sided tape on all four corners. (a) SO-101 arm clamped to table. Center of base is16.5 inch from the−xface. (b) AprilTag aligned with the base edges. Figure A.5:SO-101 arm placement and AprilTag positioning. Important Checklist: □The SO-101 base center...
-
[53]
Clone the benchmark repository, create a new Conda environment, and install dependencies: git clone https://github.com/IRVLUTD/VLAReplica.git cd VLAReplica conda env create -f environment.yml conda activate vlareplica
-
[54]
Find available cameras indices with the command (note down the numbers): lerobot-find-cameras 17 Record the camera indices for the two cameras
-
[55]
The terminal will output something like/dev/ttyACM1
Find USB device serial ports from the following command: lerobot-find-port Then unplug the SO-101 USB cable from the computer, and press Enter. The terminal will output something like/dev/ttyACM1. Record the serial port for the follwoer arm. A.6 Calibrate the SO-101 arm Next, calibrate the SO-101 follower according to the LeRobot Docs (https://huggingface...
-
[56]
Locate the calibration file that LeRobot saved to your device. It should be under: ~/.cache/huggingface/lerobot/calibration/robots/<your-robot-id> in your root folder
-
[57]
Copy this .json file to:VLAReplica/calibration/robots/so101_follower
-
[58]
The target pose values are listed in Table A.2
And rename that file to:so101_follower_arm.json A.7 Camera Calibration We provide a calibration script that detects the AprilTag and reports the camera pose in real time, allowing fine adjustment of the camera mount before locking it in place. The target pose values are listed in Table A.2. Table A.2: Target front-view camera pose relative to the AprilTag...
-
[59]
In a new terminal inside the virtual environment, run the calibration script (replace <your-top-camera-index>with the number you recorded in Appendix A.5): python calibration/camera/detect_apriltag.py --camera-index <your-top-camera-index>
-
[60]
A GUI window will display the live camera feed alongside the estimated AprilTag pose (Fig. A.6). Reach into the box and physically slide or tilt the camera mount along the PVC pipe until all reported values match Table A.2 as closely as possible
-
[61]
Once satisfied, Pressqto exit the program
Some error is acceptable (see Table A.2). Once satisfied, Pressqto exit the program
-
[62]
To solve this, we utilizevisual overlay matching(see Fig
Although the AprilTag pose estimator may output values close to Table A.2, there may still be slight camera misalignment. To solve this, we utilizevisual overlay matching(see Fig. A.7) to ensure the camera view is as close as possible toVLA-REPLICA’soriginal view. (a) First, calibrate the top camera for the second time. Run the following, replacing your-t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.