Recognition: unknown
BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances
Pith reviewed 2026-05-08 07:50 UTC · model grok-4.3
The pith
BridgeACT transfers human video demonstrations into executable robot actions by using affordances as an embodiment-agnostic bridge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BridgeACT models affordance as an embodiment-agnostic intermediate representation that bridges human demonstrations and robot actions. It decomposes each manipulation into grounding task-relevant affordance regions in the current scene and predicting task-conditioned 3D motion affordances from human videos. The resulting affordances are executed on a robot via a grasping module and a lightweight closed-loop motion controller, supporting direct real-world deployment and composition of complex tasks from basic affordance operations.
What carries the argument
The unified tool-target affordances, which act as the embodiment-agnostic bridge by extracting grasp regions and task-conditioned 3D motion trajectories from human videos and mapping them to robot execution modules.
If this is right
- Robots can perform real-world manipulation tasks using only models trained on human videos.
- Diverse tasks and object-to-object interactions are handled uniformly by composing sequences of affordance operations.
- Generalization holds to unseen objects, scenes, and viewpoints without retraining.
- Performance exceeds prior methods that require robot demonstration data or produce only perception-level outputs.
Where Pith is reading between the lines
- Large collections of internet human videos could become practical training sources for robots at scale.
- The same bridging idea might transfer skills between different robot hardware without per-robot data collection.
- Adding handling for dynamic obstacles or contact-rich interactions would test the limits of the closed-loop controller.
Load-bearing premise
Affordance regions and 3D motion affordances extracted from human videos can be mapped accurately to robot actions by a grasping module and closed-loop controller without robot-specific data or adaptation.
What would settle it
A real-robot trial in which affordance predictions from human videos are accurate yet the physical grasps or motions repeatedly fail to match the intended task would show the mapping step is insufficient.
Figures
read the original abstract
Learning robot manipulation from human videos is appealing due to the scale and diversity of human demonstrations, but transferring such demonstrations to executable robot behavior remains challenging. Prior work either relies on robot data for downstream adaptation or learns affordance representations that remain at the perception level and do not directly support real-world execution. We present BridgeACT, an affordance-driven framework that learns robotic manipulation directly from human videos without requiring any robot demonstration data. Our key idea is to model affordance as an embodiment-agnostic intermediate representation that bridges human demonstrations and robot actions. BridgeACT decomposes manipulation into two complementary problems: where to grasp and how to move. To this end, BridgeACT first grounds task-relevant affordance regions in the current scene, and then predicts task-conditioned 3D motion affordances from human demonstrations. The resulting affordances are mapped to robot actions through a grasping module and a lightweight closed-loop motion controller, enabling direct deployment on real robots. In addition, we represent complex manipulation tasks as compositions of affordance operations, which allows a unified treatment of diverse tasks and object-to-object interactions. Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BridgeACT, an affordance-driven framework for learning robotic manipulation directly from human videos without any robot demonstration data. It models affordances as embodiment-agnostic intermediates by grounding task-relevant regions in the scene and predicting task-conditioned 3D motion affordances from human demonstrations. These are mapped to executable robot actions via a grasping module and lightweight closed-loop controller, with complex tasks represented as compositions of affordance operations for unified handling of diverse manipulations and object interactions. The abstract claims outperformance over baselines and generalization to unseen objects, scenes, and viewpoints on real-world tasks.
Significance. If the central claims hold with rigorous validation, this would represent a meaningful advance in scalable robot learning by eliminating the need for robot-specific demonstration data and leveraging abundant human videos. The embodiment-agnostic affordance decomposition and compositional task representation are conceptually strong for improving generalization. The paper explicitly credits the use of external human video data as grounding, avoiding self-referential definitions.
major comments (2)
- [Abstract] Abstract: The claim that 'Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints' is made without any metrics, baselines, success rates, or experimental protocol details. This is load-bearing for the central claim of superiority and generalization, as the evaluation cannot be assessed from the provided text.
- [Method] Approach description: The mapping step from extracted affordances to robot actions relies on an unspecified 'grasping module' and 'lightweight closed-loop motion controller' with no details on whether these incorporate learned components trained on robot data, robot kinematics, per-robot calibration, or any form of robot-specific adaptation. This directly affects the key assertion of learning 'directly from human videos without requiring any robot demonstration data,' as any implicit robot data here would undermine the guarantee.
minor comments (1)
- [Abstract] The title references 'Unified Tool-Target Affordances' but the abstract does not explicitly define or distinguish tool versus target affordances; a short clarification in the introduction would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the presentation of our claims.
read point-by-point responses
-
Referee: [Abstract] The claim that 'Experiments on real-world manipulation tasks show that BridgeACT outperforms prior baselines and generalizes to unseen objects, scenes, and viewpoints' is made without any metrics, baselines, success rates, or experimental protocol details. This is load-bearing for the central claim of superiority and generalization, as the evaluation cannot be assessed from the provided text.
Authors: We agree that the abstract, as a high-level summary, does not include specific quantitative metrics or protocol details, which can make the performance claims harder to evaluate at a glance. The full manuscript provides these in the Experiments section, including success rates, baseline comparisons, and evaluation protocols across real-world tasks. To address this directly, we will revise the abstract to concisely incorporate key results (such as average success rates and generalization metrics) while preserving its brevity. This revision will better substantiate the claims without changing the underlying findings. revision: yes
-
Referee: [Method] Approach description: The mapping step from extracted affordances to robot actions relies on an unspecified 'grasping module' and 'lightweight closed-loop motion controller' with no details on whether these incorporate learned components trained on robot data, robot kinematics, per-robot calibration, or any form of robot-specific adaptation. This directly affects the key assertion of learning 'directly from human videos without requiring any robot demonstration data,' as any implicit robot data here would undermine the guarantee.
Authors: This is a valid concern for ensuring the embodiment-agnostic nature of our approach. The grasping module is a non-learned, geometry-driven component that directly uses the predicted task-relevant 3D affordance regions and standard point-cloud processing to compute grasp poses, with no training on robot demonstration data. The lightweight closed-loop motion controller executes the task-conditioned 3D motion affordances via simple feedback control (leveraging robot forward kinematics and real-time visual feedback) without any learned robot-specific components, demonstration data, or extensive per-robot calibration beyond standard deployment setup. We will revise the Method section to include explicit descriptions, implementation details, and pseudocode for these modules to unambiguously confirm that no robot demonstration data is involved, thereby reinforcing the core claim. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes a framework that extracts embodiment-agnostic affordance regions and task-conditioned 3D motion affordances from human videos, then maps them to robot actions via a grasping module and closed-loop controller. No equations, fitted parameters, or self-referential definitions appear in the provided text. The central claim rests on external human video data and standard affordance concepts rather than reducing to a tautology or self-citation chain. The mapping step is described at a high level without introducing fitted inputs called predictions or ansatzes smuggled via prior self-work. This is a normal non-circular case for a systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Affordance can be modeled as an embodiment-agnostic intermediate representation bridging human demonstrations and robot actions
Reference graph
Works this paper leans on
-
[1]
Hoi4d: A 4d egocentric dataset for category-level human-object interaction,
Y . Liu, Y . Liu, C. Jiang, K. Lyu, W. Wan, H. Shen, B. Liang, Z. Fu, H. Wang, and L. Yi, “Hoi4d: A 4d egocentric dataset for category-level human-object interaction,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 013–21 022
2022
-
[2]
Scaling egocentric vision: The epic-kitchens dataset,
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza- kos, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “Scaling egocentric vision: The epic-kitchens dataset,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 720–736
2018
-
[3]
C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,”arXiv preprint arXiv:2401.00025, 2023
-
[4]
arXiv preprint arXiv:2401.11439
C. Yuan, C. Wen, T. Zhang, and Y . Gao, “General flow as foundation af- fordance for scalable robot learning,”arXiv preprint arXiv:2401.11439, 2024
-
[5]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review arXiv 2022
-
[6]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “π 0.5: a vision- language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review arXiv 2025
-
[7]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketiet al., “Openvla: An open- source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review arXiv 2024
-
[8]
Uad: Unsupervised affordance distillation for generalization in robotic manipulation,
Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 3822–3831
2025
-
[9]
Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,
H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 306–324
2024
-
[10]
W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manip- ulation,”arXiv preprint arXiv:2409.01652, 2024
-
[11]
Vlm see, robot do: Human demo video to robot action plan via vision language model,
B. Wang, J. Zhang, S. Dong, I. Fang, and C. Feng, “Vlm see, robot do: Human demo video to robot action plan via vision language model,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 17 215–17 222
2025
-
[12]
Policy adaptation via language optimization: Decomposing tasks for few-shot imitation,
V . Myers, B. C. Zheng, O. Mees, S. Levine, and K. Fang, “Policy adaptation via language optimization: Decomposing tasks for few-shot imitation,”arXiv preprint arXiv:2408.16228, 2024
-
[13]
Tracegen: World modeling in 3d trace space enables learning from cross-embodiment videos,
S. Lee, Y . Jung, I. Chun, Y .-C. Lee, Z. Cai, H. Huang, A. Talreja, T. D. Dao, Y . Liang, J.-B. Huang, and F. Huang, “Tracegen: World modeling in 3d trace space enables learning from cross-embodiment videos,” 2025
2025
-
[14]
Pre-training auto-regressive robotic models with 4d representations,
D. Niu, Y . Sharma, H. Xue, G. Biamby, J. Zhang, Z. Ji, T. Darrell, and R. Herzig, “Pre-training auto-regressive robotic models with 4d representations,”arXiv preprint arXiv:2502.13142, 2025
-
[15]
Flow as the cross-domain manipulation interface
M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,”arXiv preprint arXiv:2407.15208, 2024
-
[16]
Correspondence- oriented imitation learning: Flexible visuomotor control with 3d condi- tioning,
Y . Cao, Z. Bhaumik, J. Jia, X. He, and K. Fang, “Correspondence- oriented imitation learning: Flexible visuomotor control with 3d condi- tioning,” 2025
2025
-
[17]
Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,
W. Huang, Y .-W. Chao, A. Mousavian, M.-Y . Liu, D. Fox, K. Mo, and L. Fei-Fei, “Pointworld: Scaling 3d world models for in-the-wild robotic manipulation,” 2026
2026
-
[18]
The epic- kitchens dataset: Collection, challenges and baselines,
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kaza- kos, D. Moltisanti, J. Munro, T. Perrett, W. Priceet al., “The epic- kitchens dataset: Collection, challenges and baselines,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4125–4141, 2020
2020
-
[19]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review arXiv 2025
-
[20]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huanget al., “Sam 3: Segment anything with concepts,”arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review arXiv 2025
-
[21]
Cotracker3: Simpler and better point tracking by pseudo- labelling real videos,
N. Karaev, Y . Makarov, J. Wang, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker3: Simpler and better point tracking by pseudo- labelling real videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 6013–6022. 9
2025
-
[22]
Point transformer v3: Simpler faster stronger,
X. Wu, L. Jiang, P.-S. Wang, Z. Liu, X. Liu, Y . Qiao, W. Ouyang, T. He, and H. Zhao, “Point transformer v3: Simpler faster stronger,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 4840–4851
2024
-
[23]
Pointnext: Revisiting pointnet++ with improved training and scaling strategies,
G. Qian, Y . Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,”Advances in neural information processing systems, vol. 35, pp. 23 192–23 204, 2022
2022
-
[24]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763
2021
-
[25]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[26]
Efficient diffusion training via min-snr weighting strategy,
T. Hang, S. Gu, C. Li, J. Bao, D. Chen, H. Hu, X. Geng, and B. Guo, “Efficient diffusion training via min-snr weighting strategy,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 7441–7451
2023
-
[27]
Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,
C. Yuan, S. Joshi, S. Zhu, H. Su, H. Zhao, and Y . Gao, “Roboengine: Plug-and-play robot data augmentation with semantic robot segmen- tation and background generation,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 7622–7629
2025
-
[28]
Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting,
W. Bao, L. Chen, L. Zeng, Z. Li, Y . Xu, J. Yuan, and Y . Kong, “Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 13 702–13 711
2023
-
[29]
Joint hand motion and interaction hotspots prediction from egocentric videos,
S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 3282–3292
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.