arxiv: 2604.11674 · v2 · submitted 2026-04-13 · 💻 cs.RO · cs.AI

Recognition: no theorem link

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

Mingyang Li , Haofan Xu , Haowen Sun , Xinzhe Chen , Sihua Ren , Liqi Huang , Xinyang Sui , Chenyang Miao

show 4 more authors

Qiongjie Cui Zeyang Liu Xingyu Chen Xuguang Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:40 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords affordance-aware manipulationsimulation data generationrobotic manipulationsim-to-real transfervision-language-actiongrasp samplingbenchmark

0 comments

The pith

AffordSim generates scalable robot manipulation trajectories by grounding language-described affordances in simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current simulation methods for robot skills either pick generic stable grasps that ignore task goals or demand manual contact labels rewritten for every new object. AffordSim solves this by accepting a natural-language task description, predicting open-vocabulary 3D affordances, grounding them on object surfaces, sampling conditioned grasps, and filtering executable trajectories with motion planning while adding randomization for real-world transfer. It packages the approach as a 50-task benchmark spanning five robot bodies and over 500 objects. The generated data reaches 93 percent of manual-annotation success on affordance-critical tasks and 89 percent on composite ones, and policies trained on it achieve 24 percent average success when transferred zero-shot to a physical Franka arm. This matters because it removes the labeling bottleneck that currently limits training data for everyday manipulation skills.

Core claim

AffordSim integrates open-vocabulary 3D affordance prediction directly into simulation-based trajectory generation. Given a task description, the system synthesizes a scene, emits affordance queries, grounds them on object geometry, samples region-conditioned grasps, and retains only those motions that succeed under planning. On the resulting 50-task benchmark it collects trajectories at 93 percent of the rate achieved by manual contact annotations for affordance-critical skills and 89 percent for hard composite skills, while vision-language-action policies trained solely on the synthetic data transfer zero-shot to a real Franka FR3 arm with 24 percent average success.

What carries the argument

The pipeline that converts natural-language task descriptions into executable trajectories by predicting and grounding open-vocabulary 3D affordances on object surfaces to condition grasp sampling and motion planning.

If this is right

Trajectory collection no longer requires per-object manual contact rewriting, enabling data sets at the scale of hundreds of objects and tasks.
Vision-language-action policies trained entirely in simulation can execute affordance-dependent skills on physical hardware without additional fine-tuning.
Randomization of pose, appearance, lighting, and viewpoint during generation improves robustness when the learned policies cross the sim-to-real boundary.
A single benchmark now standardizes evaluation across multiple robot embodiments and both rigid and articulated objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grounding step could be applied to generate data for skills involving deformable or fluid objects once suitable affordance models exist.
The 24 percent real-robot success rate sets a concrete baseline that future improvements in affordance grounding or policy architecture can be measured against.
Because the method separates affordance prediction from embodiment-specific planning, it may generalize to new robot platforms with only motion-planner changes.

Load-bearing premise

Open-vocabulary 3D affordance predictions accurately locate task-relevant contact regions on previously unseen objects inside simulation.

What would settle it

Run the same set of tasks with both AffordSim-generated trajectories and manual-contact trajectories, then compare the resulting real-robot success rates; a statistically significant gap favoring manual data would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.11674 by Chenyang Miao, Haofan Xu, Haowen Sun, Liqi Huang, Mingyang Li, Qiongjie Cui, Sihua Ren, Xingyu Chen, Xinyang Sui, Xinzhe Chen, Xuguang Lan, Zeyang Liu.

**Figure 1.** Figure 1: Overview of AffordSim. Natural language task descriptions are processed by a VLM to generate simulation scenes in Isaac Sim. VoxAfford predicts 3D affordance maps on object point clouds, guiding grasp pose estimation and motion planning to produce semantically correct manipulation trajectories. Comprehensive domain randomization, including DA3-reconstructed real backgrounds, enables zero-shot sim-to-real t… view at source ↗

**Figure 2.** Figure 2: Domain randomization in AffordSim. Each column shows the cumulative effect of adding one randomization axis: object pose, lighting, background texture, and image noise. The rightmost column shows example background textures used for randomization. Three representative tasks are shown across rows. library (500+ objects, 50+ real-scanned), (2) initial poses for each object on the workspace, (3) the target ro… view at source ↗

**Figure 3.** Figure 3: Task gallery. Representative tasks from the AffordSim benchmark across seven manipulation categories: grasping, placing, stacking, pushing/pulling, pouring, mug hanging, and long-horizon composite tasks. 7. Long-Horizon Composite (8 tasks): Multi-step tasks that chain primitives from different categories, e.g., pick-pour-place or open-and-place. Requires sequential affordance reasoning across multiple ob… view at source ↗

**Figure 4.** Figure 4: Cross-embodiment support. Three representative tasks executed by four robot embodiments: Franka FR3, Franka Panda, UR5e, and Kinova. AffordSim generates affordance-guided trajectories for each embodiment without task-specific tuning. interaction—remain significantly more challenging for current methods, underscoring the importance of affordance-aware data generation. Pi 0.5 achieves the highest average pe… view at source ↗

read the original abstract

Many everyday robot manipulation skills are affordance-dependent, with success determined by whether the robot contacts the functional object region required by the subsequent action. Current simulation data generators obtain contacts from generic grasp estimators or per-object manual contact annotations, but generic estimators rank stable grasps without task semantics and often select contacts that are misaligned with the downstream action, while manual contact annotations must be rewritten for each new object and task. To solve these challenges, we introduce AffordSim, a scalable data generator and benchmark that integrates open-vocabulary 3D affordance prediction into simulation-based trajectory generation. Given a natural-language task description, AffordSim synthesizes a task-relevant scene, emits affordance queries, grounds them on object surfaces, samples region-conditioned grasps, and selects executable candidates with motion planning. It further randomizes object pose, texture, lighting, image noise, and cross-viewpoint backgrounds for sim-to-real transfer. We instantiate AffordSim as a 50-task benchmark across diverse manipulation skills, five robot embodiments, and 500+ rigid and articulated objects. AffordSim achieves 93% of the trajectory collection success rate of manual contact annotations on affordance-critical tasks and 89% on hard composite tasks. Vision-language-action policies trained on AffordSim data transfer zero-shot to a real Franka FR3, reaching 24% average success.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AffordSim combines open-vocabulary affordance prediction with scalable sim trajectory generation to reduce reliance on manual per-object labels, but the grounding accuracy step lacks direct metrics.

read the letter

The main point is that this paper gives a concrete pipeline for turning language task descriptions into affordance-conditioned manipulation trajectories inside simulation. It avoids both generic grasp estimators that ignore task semantics and the need to hand-label contact points for every new object. The system synthesizes scenes, queries an open-vocabulary 3D affordance model, grounds the predictions on surfaces, samples grasps in those regions, and filters with motion planning, then adds heavy randomization for sim-to-real. They built a 50-task benchmark spanning five embodiments and 500+ rigid and articulated objects, which is a solid amount of coverage. The headline numbers are that the generated trajectories reach 93% of the success rate of manual contact annotations on affordance-critical tasks and 89% on harder composites, and policies trained on the data get 24% zero-shot success on a real Franka FR3. That level of relative performance and the multi-embodiment setup are the parts that feel useful right now. The main gap is that we still do not see quantitative results on the affordance grounding itself—no precision, recall, or IoU numbers on the 500+ novel objects. Without those, it is hard to separate how much the affordance module is actually driving the gains versus the downstream planning and randomization. The 24% real-robot number is also modest, and more failure-case analysis or stronger baselines would make the claims easier to assess. This work is aimed at robotics groups that need large amounts of task-specific manipulation data without constant manual annotation. It is worth sending to peer review because the pipeline is practical and the benchmark scale is real, even if the experiments need tighter validation on the prediction step.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces AffordSim, a scalable simulation data generator and benchmark for affordance-aware robotic manipulation. It integrates open-vocabulary 3D affordance prediction with scene synthesis, surface grounding, region-conditioned grasp sampling, and motion planning to produce task-relevant trajectories from natural-language descriptions. The work instantiates a 50-task benchmark spanning five robot embodiments and 500+ rigid/articulated objects, with domain randomization for sim-to-real transfer. It claims AffordSim achieves 93% of manual contact annotation trajectory success on affordance-critical tasks and 89% on hard composite tasks, and that VLA policies trained on the generated data reach 24% average zero-shot success on a real Franka FR3.

Significance. If the central results hold, AffordSim addresses a practical scalability bottleneck in generating high-quality, affordance-aware manipulation data without per-object manual annotations. The scale of the benchmark (50 tasks, 500+ objects) and the reported sim-to-real transfer performance would be useful for the community if the generator and data are released reproducibly. The pipeline's combination of language grounding with standard randomization techniques is a pragmatic engineering contribution that could support better generalization in vision-language-action policies.

major comments (2)

[Abstract and §5] Abstract and §5 (Evaluation): The reported 93% and 89% trajectory collection success rates relative to manual annotations, and the 24% real-robot transfer, are presented without any quantitative metrics on the open-vocabulary 3D affordance grounding accuracy itself (e.g., no precision, recall, or IoU for predicted contact regions on the 500+ novel objects across the 50 tasks). This is load-bearing for the central claim because the method's value over generic grasp estimators rests on reliable task-relevant region identification; without these metrics it is impossible to isolate the affordance module's contribution from the motion-planning filter.
[§4] §4 (Method): The pipeline description states that affordance queries are 'grounded on object surfaces' and used to condition grasps, but provides no ablation or controlled comparison quantifying how much the affordance predictions improve trajectory executability or downstream policy performance over non-affordance baselines on the same objects and tasks.

minor comments (2)

[Abstract] The abstract introduces 'hard composite tasks' without a brief definition or reference to the specific task breakdown in the benchmark.
[§4] Ensure the specific open-vocabulary 3D affordance model and its hyperparameters are stated explicitly in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §5] The reported 93% and 89% trajectory collection success rates relative to manual annotations, and the 24% real-robot transfer, are presented without any quantitative metrics on the open-vocabulary 3D affordance grounding accuracy itself (e.g., no precision, recall, or IoU for predicted contact regions on the 500+ novel objects across the 50 tasks). This is load-bearing for the central claim because the method's value over generic grasp estimators rests on reliable task-relevant region identification; without these metrics it is impossible to isolate the affordance module's contribution from the motion-planning filter.

Authors: We agree that direct metrics on affordance grounding accuracy (precision, recall, IoU) are absent and would better isolate the module's contribution. The manuscript prioritizes end-to-end trajectory success versus manual annotations as the key utility metric for data generation. To address this, the revision will add an evaluation of affordance prediction accuracy on a representative subset of objects and tasks, using available ground-truth contact regions from the manual annotations for comparison. revision: yes
Referee: [§4] The pipeline description states that affordance queries are 'grounded on object surfaces' and used to condition grasps, but provides no ablation or controlled comparison quantifying how much the affordance predictions improve trajectory executability or downstream policy performance over non-affordance baselines on the same objects and tasks.

Authors: We acknowledge the value of explicit ablations to quantify the affordance module's impact. The current results report full-pipeline performance but lack controlled comparisons removing affordance conditioning. The revision will include new experiments ablating the affordance predictions, reporting differences in trajectory success rates and downstream VLA policy performance on identical objects and tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline composes external affordance and planning modules without self-referential reductions or fitted predictions

full rationale

The paper presents AffordSim as a compositional pipeline: natural-language task input → scene synthesis → open-vocabulary 3D affordance query emission and grounding → region-conditioned grasp sampling → motion-planning filtering → randomization for sim-to-real. No equations, parameter fits, or definitions appear that make any output (e.g., trajectory success rates) equivalent to its inputs by construction. Claims of 93% and 89% relative success versus manual annotations are empirical comparisons to an independent baseline, not statistical artifacts of fitting. No self-citation chains or uniqueness theorems are invoked to justify core choices. The derivation is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so internal details on parameters and assumptions are inaccessible. The approach appears to rest on the domain assumption that existing open-vocabulary models can be directly applied to 3D surface grounding for manipulation tasks.

axioms (1)

domain assumption Open-vocabulary 3D affordance prediction models can accurately ground natural-language task descriptions to functional regions on object surfaces in simulation
This assumption is required for the data generator to produce task-relevant trajectories without manual per-object annotations.

pith-pipeline@v0.9.0 · 5592 in / 1397 out tokens · 61738 ms · 2026-05-12T03:40:18.971702+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 2 internal anchors

[1]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Maniskill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659, 2023

Jiayuan Gu, Fanbo Xiang, Xuanlin Li, Zhan Ling, Xiqiang Liu, Tongzhou Mu, Yihe Tang, Stone Tao, Xinyue Wei, Yunchao Yao, et al. ManiSkill2: A unified benchmark for generalizable manipulation skills.arXiv preprint arXiv:2302.04659,

work page arXiv
[3]

GenSim2: Scaling robot data generation with multi-modal and reasoning LLMs.arXiv preprint arXiv:2410.03645,

Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. GenSim2: Scaling robot data generation with multi-modal and reasoning LLMs.arXiv preprint arXiv:2410.03645,

work page arXiv
[4]

RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920,

10 Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. RoboTwin: Dual-arm robot benchmark with generative digital twins.arXiv preprint arXiv:2409.02920,

work page arXiv
[5]

RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parber, Tsung-Wei Lo, Avanika Joshi, Huihan Welborn, and Yuke Zhu. RoboCasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,

work page internal anchor Pith review arXiv
[6]

RoboVerse: Towards a unified platform for scalable and generalizable robot learning.arXiv preprint arXiv:2504.09837,

RoboVerse Team. RoboVerse: Towards a unified platform for scalable and generalizable robot learning.arXiv preprint arXiv:2504.09837,

work page arXiv
[7]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455, 2023

Yufei Wang, Zhou Fan, Zackory Jia, Siddhartha Srinivasa, and Danfei Xu. RoboGen: Towards unleashing infinite data for automated robot learning via generative simulation.arXiv preprint arXiv:2311.01455,

work page arXiv