Recognition: unknown
Phone2Act: A Low-Cost, Hardware-Agnostic Teleoperation System for Scalable VLA Data Collection
Pith reviewed 2026-05-09 17:26 UTC · model grok-4.3
The pith
A smartphone can act as a 6-DoF controller to collect robot data for training VLA models affordably across hardware platforms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Phone2Act transforms a smartphone into a 6-DoF robot controller using ARCore, built on modular ROS 2 with interchangeable bridge nodes for hardware independence and a Universal Recorder that synchronizes multi-camera RGB streams with robot state to export directly in LeRobot format. Validation involves fine-tuning GR00T-N1.5 on 130 episodes collected with the system, resulting in a 90% success rate for a multi-stage pick-and-place task on a physical Dobot CR5 robot.
What carries the argument
The modular ROS 2 architecture with interchangeable bridge nodes that decouple control logic from hardware specifics, combined with the Universal Recorder for synchronized data export.
If this is right
- Robot platforms from industrial cobots to low-cost arms can be supported without modifying code.
- Collected data requires no post-processing and can be used immediately for VLA fine-tuning.
- Relatively small datasets of 130 episodes suffice for achieving high success rates on multi-stage manipulation tasks.
- Demonstration data collection becomes accessible without specialized or expensive hardware.
Where Pith is reading between the lines
- Smaller research groups without access to professional teleoperation rigs could now contribute to VLA datasets more easily.
- Scaling to longer or more varied tasks would likely require testing the data quality limits of phone-based control.
- The hardware-agnostic design suggests potential for community-driven data sharing across different robot types.
Load-bearing premise
The data collected via smartphone teleoperation has sufficient quality and precision to allow effective fine-tuning of VLA models with a modest number of episodes like 130.
What would settle it
Running the same pick-and-place task with data collected from a high-end dedicated teleoperation system and comparing the resulting success rate after identical fine-tuning to the 90% achieved with Phone2Act data.
Figures
read the original abstract
Collecting diverse, high-quality manipulation data for Vision-Language-Action (VLA) model training remains prohibitively expensive for many research groups, as existing teleoperation frameworks rely on specialized hardware or are tightly coupled to specific robot platforms. We present Phone2Act, a low-cost, hardware-agnostic teleoperation framework that transforms a commodity smartphone into a 6-DoF robot controller via Google ARCore. Built on a modular ROS 2 architecture, Phone2Act decouples control logic from hardware specifics through interchangeable bridge nodes, supporting platforms from industrial cobots to low-cost bimanual arms without code modification. A Universal Recorder synchronizes multi-camera RGB streams with robot state feedback and exports demonstrations natively in the LeRobot dataset format, eliminating post-processing and enabling immediate VLA fine-tuning. We validate the framework by fine-tuning GR00T-N1.5 on 130 collected episodes, achieving a 90% success rate on a real-world multi-stage pick-and-place task deployed on a physical Dobot CR5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Phone2Act, a low-cost teleoperation framework that uses a commodity smartphone with Google ARCore to provide 6-DoF control of robots. It employs a modular ROS 2 architecture with interchangeable bridge nodes for hardware agnosticism across platforms such as industrial cobots and bimanual arms. A Universal Recorder synchronizes multi-camera RGB streams with robot state and exports demonstrations directly in LeRobot format for immediate VLA training. The central validation consists of collecting 130 episodes and fine-tuning GR00T-N1.5 to achieve a 90% success rate on a real-world multi-stage pick-and-place task executed on a physical Dobot CR5 robot.
Significance. If the empirical results are substantiated, the work could meaningfully lower barriers to scalable VLA data collection by eliminating the need for specialized hardware, thereby enabling more research groups to contribute high-quality manipulation datasets. The modular ROS 2 design and native LeRobot export format represent practical strengths that directly address reproducibility and integration challenges in the field.
major comments (3)
- [Validation] The validation reports a 90% success rate after fine-tuning GR00T-N1.5 on 130 episodes but provides no details on the evaluation protocol, including the number of test trials, success criteria, failure mode breakdown, or statistical measures. This information is required to assess whether the result supports the claim of effective data quality for VLA transfer.
- [System Description / Teleoperation Module] No quantitative metrics are reported for ARCore 6-DoF tracking accuracy during teleoperation (e.g., end-effector position or orientation RMSE relative to ground truth or expert demonstrations). Without these, it is impossible to verify the weakest assumption that smartphone data achieves precision comparable to dedicated hardware.
- [Experiments] The manuscript contains no baseline comparisons or ablations showing VLA performance when trained on Phone2Act data versus data collected with conventional expert teleoperation hardware. This omission leaves open whether the 90% success rate is attributable to the framework or to task simplicity and the pre-trained model.
minor comments (2)
- [Abstract] The abstract refers to a 'multi-stage pick-and-place task' without enumerating the stages, objects, or workspace constraints, which would aid reader understanding of task complexity.
- [Figures] Figure captions and system diagrams would benefit from explicit labels indicating data flow between the smartphone, ROS 2 bridge nodes, and the Universal Recorder.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas to strengthen the presentation of our results and system. We will revise the manuscript accordingly to address these points.
read point-by-point responses
-
Referee: [Validation] The validation reports a 90% success rate after fine-tuning GR00T-N1.5 on 130 episodes but provides no details on the evaluation protocol, including the number of test trials, success criteria, failure mode breakdown, or statistical measures. This information is required to assess whether the result supports the claim of effective data quality for VLA transfer.
Authors: We agree that more details on the evaluation are needed to substantiate the results. In the revised version, we will expand the validation section to specify that the success rate was measured over 20 test episodes, with success defined as completing the entire multi-stage pick-and-place sequence without object drops or collisions. We will provide a failure mode breakdown (e.g., 10% failures due to grasp instability) and include error bars or standard deviations from multiple runs. This will better support the claim regarding data quality for VLA transfer. revision: yes
-
Referee: [System Description / Teleoperation Module] No quantitative metrics are reported for ARCore 6-DoF tracking accuracy during teleoperation (e.g., end-effector position or orientation RMSE relative to ground truth or expert demonstrations). Without these, it is impossible to verify the weakest assumption that smartphone data achieves precision comparable to dedicated hardware.
Authors: We recognize that direct quantitative metrics for ARCore tracking accuracy are absent from the manuscript. Performing such measurements would necessitate specialized equipment like a motion capture system not used in our low-cost setup. In revision, we will incorporate a discussion citing published ARCore accuracy benchmarks (position errors typically under 1 cm in indoor settings) and describe how the teleoperation interface allows the human operator to visually correct for any tracking inaccuracies in real-time. We will also list this as a limitation for future work. revision: partial
-
Referee: [Experiments] The manuscript contains no baseline comparisons or ablations showing VLA performance when trained on Phone2Act data versus data collected with conventional expert teleoperation hardware. This omission leaves open whether the 90% success rate is attributable to the framework or to task simplicity and the pre-trained model.
Authors: While we understand the value of such comparisons, the manuscript's focus is on introducing an accessible teleoperation system rather than benchmarking data quality against alternatives. The multi-stage nature of the pick-and-place task and the use of a pre-trained model that is only fine-tuned make the 90% success indicative of usable data. We will revise the discussion to explicitly address this by referencing related works on similar tasks and clarifying that the framework enables data collection at scale, which is the primary goal. Full ablations are planned for future extensions. revision: partial
Circularity Check
No circularity; empirical validation rests on external physical-robot experiment.
full rationale
The paper describes a smartphone-based teleoperation system and reports an empirical result: fine-tuning GR00T-N1.5 on 130 collected episodes yields 90% success on a Dobot CR5 pick-and-place task. No equations, fitted parameters, predictions, or derivation steps appear in the provided text. The central claim is supported by direct hardware deployment rather than by internal definitions, self-citations, or renaming of known results. This is a standard systems paper whose validation chain is externally falsifiable and contains no load-bearing reductions to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Octo: An open-source generalist robot policy,
Octo Model Team, D. Ghosh, H. Walke, K. Pertschet al., “Octo: An open-source generalist robot policy,” inProc. Robotics: Science and Systems (RSS), 2024
2024
-
[2]
OpenVLA: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiaoet al., “OpenVLA: An open-source vision-language-action model,” inProc. Conf. Robot Learning (CoRL), 2024
2024
-
[3]
Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,
Z. Fu, T. Z. Zhao, and C. Finn, “Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation,” inProc. Conf. Robot Learning (CoRL), 2024
2024
-
[4]
Virtual reality based robot teleoperation via human-scene interaction,
L. Meng, J. Liu, W. Chai, J. Wang, and M. Q.-H. Meng, “Virtual reality based robot teleoperation via human-scene interaction,”Procedia Comput. Sci., vol. 226, pp. 141–148, 2023
2023
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Daet al., “GR00T N1: An open foundation model for generalist humanoid robots,” 2025, arXiv preprint arXiv:2503.14734
work page internal anchor Pith review arXiv 2025
-
[6]
Tele-manipulation of robot arm with smartphone,
C. Parga, X. Li, and W. Yu, “Tele-manipulation of robot arm with smartphone,” inProc. 6th Int. Symp. Resilient Control Syst. (ISRCS), Aug. 2013, pp. 60–65
2013
-
[7]
Development of smartphone- based human-robot interfaces for individuals with disabilities,
L. Wu, R. Alqasemi, and R. Dubey, “Development of smartphone- based human-robot interfaces for individuals with disabilities,”IEEE Robot. Autom. Lett., vol. 5, no. 4, pp. 5835–5841, 2020
2020
-
[8]
RoboTurk: A crowdsourcing platform for robotic skill learning through imitation,
A. Mandlekar, Y . Zhu, A. Garg, J. Booheret al., “RoboTurk: A crowdsourcing platform for robotic skill learning through imitation,” inProc. Conf. Robot Learning (CoRL), 2018, pp. 879–893
2018
-
[9]
Learning multi-arm manipulation through collaborative teleoperation,
A. Tung, J. Wong, A. Mandlekar, R. Mart’in-Mart’inet al., “Learning multi-arm manipulation through collaborative teleoperation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2021, pp. 9212–9219
2021
-
[10]
Error-aware imitation learning from teleoperation data for mobile manipulation,
J. Wong, A. Tung, A. Kurenkov, A. Mandlekaret al., “Error-aware imitation learning from teleoperation data for mobile manipulation,” inProc. Conf. Robot Learning (CoRL), 2021, pp. 1367–1378
2021
-
[11]
5G virtual reality manipulator teleoperation using a mobile phone,
A. Werner and W. Melek, “5G virtual reality manipulator teleoperation using a mobile phone,”arXiv preprint arXiv:2403.02450, 2024
-
[12]
TidyBot++: An open-source holonomic mobile manipulator for robot learning,
J. Wu, W. Chong, R. Holmberg, A. Prasadet al., “TidyBot++: An open-source holonomic mobile manipulator for robot learning,” in Proc. Conf. Robot Learning (CoRL), 2024
2024
-
[13]
TinyVLA: Toward fast, data-efficient vision-language-action models for robotic manipulation,
J. Wen, Y . Zhu, J. Li, M. Zhuet al., “TinyVLA: Toward fast, data-efficient vision-language-action models for robotic manipulation,” IEEE Robot. Autom. Lett., vol. 10, no. 4, pp. 3988–3995, Apr. 2025
2025
-
[14]
π 0: A vision- language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmailet al., “π 0: A vision- language-action flow model for general robot control,” inProc. Robotics: Science and Systems (RSS), 2025
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.