VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

Bill Cai; Hadi Khezam; Lifeng Zhou; Ran Yang; Shijie Geng; Yiming Feng; Yue Zheng; Zijian An

arxiv: 2605.02037 · v2 · pith:OVCLD7AMnew · submitted 2026-05-03 · 💻 cs.RO · cs.AI

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

Zijian An , Hadi Khezam , Bill Cai , Ran Yang , Shijie Geng , Yiming Feng , Yue Zheng , Lifeng Zhou This is my paper

Pith reviewed 2026-05-08 19:29 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords low-cost roboticsvision-language-actionsoft graspingkirigami gripperrobotic manipulationVLA policy learningmodular hardware

0 comments

The pith

Low-cost modular robots with a kirigami soft gripper can train and run vision-language-action policies for delicate grasping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VILAS, a complete low-cost platform that combines an affordable collaborative arm, an electric gripper upgraded with a soft kirigami extension, dual cameras, and a unified communication layer. This setup supports teleoperation for data collection and then fine-tunes existing vision-language-action models on the collected demonstrations. The authors test the resulting policies on a grape-grasping task that requires gentle contact without force sensors. If the approach holds, it shows that advanced manipulation learning does not need expensive specialized hardware, making such systems reachable for more users and labs.

Core claim

VILAS integrates a Fairino FR5 arm, Jodell RG52-50 gripper with kirigami soft extension, and dual-camera module through a ZMQ architecture to handle teleoperation, data collection, and policy deployment in one framework. Fine-tuning of pi_0, pi_0.5, and GR00T N1.6 models on the same teleoperation dataset enables successful deployment on grape grasping, confirming that capable manipulation policies can be trained and run on low-cost modular hardware without explicit force sensing.

What carries the argument

the kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading to provide gentle and repeatable contact with delicate objects without force sensing

If this is right

Capable VLA policies can be successfully fine-tuned and deployed using only low-cost modular components and standard teleoperation demonstrations.
The soft gripper enables safe handling of fragile items like grapes without dedicated force or tactile sensors.
A single ZMQ-based framework can coordinate perception, control, data logging, and policy execution on accessible hardware.
Multiple pretrained VLA models show comparable real-world performance when adapted to the same platform and dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar kirigami extensions could be adapted for other delicate tasks such as handling produce or laboratory samples.
The modular ZMQ architecture might simplify integration of new arms or sensors across different robot setups.
If the soft gripper pattern proves reliable, it offers a low-cost route to compliant grasping that avoids specialized end-effectors.

Load-bearing premise

The kirigami gripper produces consistent gentle deformation on contact and the teleoperation data supplies enough variety for the VLA models to fine-tune effectively.

What would settle it

Run repeated grape-grasping trials with the fine-tuned models and check whether the gripper damages fruit or fails to pick it at rates no better than an untrained baseline.

Figures

Figures reproduced from arXiv: 2605.02037 by Bill Cai, Hadi Khezam, Lifeng Zhou, Ran Yang, Shijie Geng, Yiming Feng, Yue Zheng, Zijian An.

**Figure 1.** Figure 1: Overview of the VILAS system. (Left) The physical robotic platform. view at source ↗

**Figure 2.** Figure 2: Kirigami structure design and experimental demonstration. (a) Photograph of the fabricated kirigami structure buckled within a gripper during testing, (b) Top view in Fusion 360, (c) Isometric view in Fusion 360. A low cost kirigami based pattern was developed to be used as a soft extension grabber for a safe and effective method of handling delicate objects as shown in view at source ↗

**Figure 3.** Figure 3: Communication architecture of the VILAS system during data collection. view at source ↗

**Figure 4.** Figure 4: Policy deployment overview and representative execution sequence using view at source ↗

**Figure 5.** Figure 5: Execution sequence of the cherry grasping task using the view at source ↗

read the original abstract

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VILAS is a straightforward engineering paper that assembles existing VLA models, a cheap arm, and a kirigami soft gripper into one working low-cost platform, with the main new piece being the specific integration and gripper design for delicate grasping.

read the letter

VILAS shows how to run and fine-tune three public VLA models on a modular low-cost setup built around a Fairino FR5 arm, Jodell gripper, dual cameras, and ZMQ messaging. They add a kirigami-based soft extension to the gripper so it deforms predictably under load and handles fragile items like grapes without force sensors. The system supports teleoperation, data collection, and policy deployment in one framework, and they fine-tune pi_0, pi_0.5, and GR00T N1.6 on the same demonstration set for a grape-grasping task.

Referee Report

2 major / 2 minor

Summary. The manuscript presents VILAS, a low-cost modular robotic manipulation platform that integrates a Fairino FR5 collaborative arm, Jodell RG52-50 electric gripper, dual-camera perception, and a ZMQ-based communication architecture to support teleoperation, data collection, and end-to-end VLA policy deployment. It introduces a kirigami-based soft compliant gripper extension for safe handling of fragile objects without explicit force sensing. Three state-of-the-art VLA models (pi_0, pi_0.5, and GR00T N1.6) are fine-tuned from public checkpoints on an identical teleoperation dataset and evaluated on a grape grasping task, with the abstract claiming that the experiments validate effective training and deployment on low-cost hardware.

Significance. If the empirical validation holds with quantitative support, the work offers a practical contribution to accessible robotics by lowering hardware barriers for VLA research and addressing safe manipulation of delicate items via the soft gripper design. The unified ZMQ framework and identical-dataset fine-tuning across models provide a useful engineering demonstration and deployment insights, though the absence of metrics limits its value as a reproducible benchmark.

major comments (2)

[Abstract] Abstract: the claim that 'Experiments on a grape grasping task validate the effectiveness of the proposed system' is unsupported by any reported success rates, failure analysis, baselines, or trial counts, which is load-bearing for the central empirical assertion that capable policies can be trained and deployed on the hardware.
[Experiments] Experiments section: no details are provided on demonstration dataset size, data collection protocol, fine-tuning hyperparameters, or quantitative performance metrics for the three models, preventing assessment of whether the teleoperation data suffices for effective fine-tuning or whether the kirigami gripper performs predictably under load.

minor comments (2)

The ZMQ architecture description would benefit from a system diagram or pseudocode to clarify coordination between teleoperation, data logging, and policy inference.
[Abstract] Consider adding a brief comparison table of the three VLA models' deployment characteristics (e.g., inference latency or success patterns) to strengthen the practical insights claimed in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The points raised highlight important areas for strengthening the empirical claims and reproducibility. We will revise the manuscript to incorporate quantitative details and metrics as outlined below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Experiments on a grape grasping task validate the effectiveness of the proposed system' is unsupported by any reported success rates, failure analysis, baselines, or trial counts, which is load-bearing for the central empirical assertion that capable policies can be trained and deployed on the hardware.

Authors: We agree that the abstract's validation claim would benefit from explicit quantitative support to stand on its own. In the revised manuscript, we will update the abstract to include key metrics such as success rates (e.g., X/ Y trials for each model), trial counts, and a concise failure analysis summary. This will directly substantiate the assertion that capable policies can be trained and deployed on the low-cost hardware without relying solely on the experiments section. revision: yes
Referee: [Experiments] Experiments section: no details are provided on demonstration dataset size, data collection protocol, fine-tuning hyperparameters, or quantitative performance metrics for the three models, preventing assessment of whether the teleoperation data suffices for effective fine-tuning or whether the kirigami gripper performs predictably under load.

Authors: We acknowledge that the current Experiments section lacks these specifics, which are necessary for full reproducibility and evaluation. We will expand this section in the revision to report: the demonstration dataset size (number of trajectories collected), the teleoperation data collection protocol, fine-tuning hyperparameters for pi_0, pi_0.5, and GR00T N1.6, and quantitative performance metrics including success rates, failure modes, and any observations on the kirigami gripper's behavior under load during the grape grasping task. These additions will enable readers to assess the sufficiency of the data and the gripper's predictability. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper is an engineering demonstration of a low-cost robotic manipulation platform (VILAS) that integrates hardware components, a kirigami soft gripper, ZMQ architecture, and fine-tuning of three publicly released VLA models (pi_0, pi_0.5, GR00T N1.6) on teleoperation data for a grape-grasping task. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claim rests on empirical system performance and experimental validation rather than any reduction of outputs to inputs by construction, self-citation chains, or ansatz smuggling. The validation uses external pretrained checkpoints and real-world task results, rendering the presentation self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper is an engineering systems contribution focused on hardware integration and empirical evaluation. No free parameters, mathematical axioms, or invented physical entities are introduced in the central claim.

invented entities (1)

kirigami-based soft compliant gripper extension no independent evidence
purpose: To enable safe manipulation of fragile objects without relying on explicit force sensing by inducing predictable deformation under compressive loading
A custom design choice for the platform; abstract provides no independent evidence or external validation beyond the grape grasping experiment.

pith-pipeline@v0.9.0 · 5547 in / 1273 out tokens · 72352 ms · 2026-05-08T19:29:26.733353+00:00 · methodology

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)