FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

Alois Christian Knoll; Boyang Zhong; Hongli Xu; Jianwei Zhang; Kaixin Bai; Lei Zhang; Xiaoyue Hu; Zhaopeng Chen; Zhenshan Bing; Zolt\'an-Csaba M\'arton

arxiv: 2509.19102 · v2 · pith:BKBF5PIXnew · submitted 2025-09-23 · 💻 cs.RO · cs.AI· cs.CV

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

Hongli Xu , Lei Zhang , Xiaoyue Hu , Boyang Zhong , Kaixin Bai , Zolt\'an-Csaba M\'arton , Zhenshan Bing , Zhaopeng Chen

show 2 more authors

Alois Christian Knoll Jianwei Zhang

This is my paper

Pith reviewed 2026-05-21 22:11 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords robotic manipulationimitation learningfunctional canonicalizationaction primitivesdiffusion policycategory generalizationsim-to-real transfer

0 comments

The pith

Functional object canonicalization using vision-language affordances lets imitation policies learn reusable pose-aware action primitives that generalize across object categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that long-horizon robotic tasks can be decomposed into short action chunks each tagged by actor, verb, and object. These chunks are then aligned across different objects by mapping each into a shared functional coordinate frame derived from affordance signals in large vision-language models. A diffusion policy called FuncDiffuser is trained directly on the resulting pose-aligned demonstrations. The alignment step supplies an inductive bias that lets the same learned primitives transfer to unseen objects, combine across tasks, and move from simulation to real hardware.

Core claim

FunCanon performs functional object canonicalization to place objects from different categories into common functional frames using affordance cues extracted from vision-language models. This produces aligned training data on which an object-centric and action-centric diffusion policy, FuncDiffuser, is trained. The policy therefore respects object poses and affordances by construction, turning task-specific imitation into reusable action-primitive learning.

What carries the argument

Functional object canonicalization: the mapping of arbitrary objects into shared functional frames via affordance cues from large vision-language models, which enables automatic trajectory transfer and pose-aware policy training.

If this is right

Category-level generalization becomes feasible because policies operate in functional rather than geometric coordinates.
Action primitives learned on one task can be reused on other tasks that share the same verb-object pair.
Sim-to-real transfer improves because the canonical frame removes appearance and pose differences between simulated and real objects.
Policy learning is simplified by focusing the model on short, pose-aware action chunks instead of full task sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment step could be applied to multi-step sequences involving several objects by composing multiple canonical frames.
If VLM affordance errors are the main failure mode, fine-tuning the VLM on robot-specific manipulation data would be a direct next experiment.
Extending the chunk representation to include tool or hand pose might further improve precision in contact-rich skills.

Load-bearing premise

Affordance cues extracted from large vision-language models are accurate and complete enough to produce correct functional alignments without losing task-critical details or adding errors.

What would settle it

Train two policies on the same demonstrations—one with functional canonicalization and one without—then measure success rate on a held-out object category; a large gap in favor of the canonicalized version would support the claim.

Figures

Figures reproduced from arXiv: 2509.19102 by Alois Christian Knoll, Boyang Zhong, Hongli Xu, Jianwei Zhang, Kaixin Bai, Lei Zhang, Xiaoyue Hu, Zhaopeng Chen, Zhenshan Bing, Zolt\'an-Csaba M\'arton.

**Figure 1.** Figure 1: We present FunCanon, a framework that transforms long-horizon manipulation into action chunks defined by actor, verb, and object. Through functional object canonicalization using affordance cues from vision–language models, we automatically generate manipulation trajectories based on functional alignment and train an object-centric and action-centric diffusion policies FuncDiffuser that generalize across o… view at source ↗

**Figure 2.** Figure 2: FunCanon Architecture. Our method consists of five components: (A) Action Primitive Modeling, where vision-language models perform AVO segmentation and task decomposition; (B) Functional Alignment, producing function-based canonicalization of objects; (C) Automatic Trajectory Transfer, which augments training data across instances and categories using online 3D datasets; (D) Action-Centric Diffusion Policy… view at source ↗

**Figure 3.** Figure 3: Real-World Experiment Setup with Franka Emika Robot [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Task Variants in Simulation with Generated Trajectories [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative Results of Functional Alignment. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Results of Real-World Experiments. TABLE III: Real World Experiment Results. Methods Pick, Place Pick, Pour (L1) Pick, Pour (L2) 3DA [3] 48% 54% 62% SPOT [4] 52% 76% 60% FunCanon (Ours) 88% 90% 88% success rates are reported in Tab. III. Our method consistently outperforms the baselines. Compared with SPOT, FunCanon improves success by at lease +14%. The advantage is even more pronounced over 3… view at source ↗

**Figure 7.** Figure 7: Visualization of inference results in RLBench simulation. Each snapshot shows the robot executing sub-tasks generated [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Object Reconstruction Results, including watering can, [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: Additional results of functional alignment, from 3D self-supervised feature visualization, part-aware region proposal, [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Additional 2D visualization results of automatic trajectory transfer in different instance-level and category-level variants. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Functional canonicalization with VLM affordances is the main new move, but its reliability for consistent alignment across objects is the part that still needs checking.

read the letter

The core idea is to split manipulation into short action chunks labeled by actor, verb, and object, then map every object instance into a shared functional frame using affordance signals from a vision-language model. Once aligned, a diffusion policy learns pose-aware primitives that can be reused across categories and tasks. That alignment step is what they call functional canonicalization, and it is the piece that lets them claim category-level generalization and easier sim-to-real transfer without retraining from scratch for every new object shape. The experiments on simulated and real benchmarks are presented as evidence that this inductive bias actually works in practice. The project site with demos is a plus for seeing the behavior in motion. What stands out is the explicit separation of functional alignment from policy training; it keeps the diffusion model from having to discover object geometry and affordances at the same time. That separation is a reasonable engineering choice and matches the reported gains in cross-task reuse. The soft spot is exactly the one the stress-test flags. If the VLM occasionally swaps a handle for a spout or gives inconsistent part labels across instances of the same category, the resulting canonical frames carry systematic offsets. Those offsets go straight into the training distribution for FuncDiffuser, so the claimed robustness to pose variation could be fragile. The abstract and available description do not include error rates on the VLM step, ablation on alignment accuracy, or failure cases where transfer breaks. Without those numbers it is hard to judge how much of the generalization comes from the canonicalization versus careful dataset curation. The rest of the pipeline uses standard diffusion-policy machinery, so the technical risk sits mostly in that upstream assumption. This paper is aimed at people building imitation-learning systems for household or warehouse manipulation who already work with diffusion models or action chunking. A reader who cares about practical generalization would find the framework worth testing, even if they end up replacing the VLM cue with a more controlled source. It is coherent enough on its own terms to deserve referee time; the central claim is testable and the experiments are on real hardware, so an editor should send it out rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces FunCanon, a framework that decomposes long-horizon manipulation tasks into action chunks defined by actor, verb, and object to promote compositionality and reuse. It performs functional object canonicalization by mapping objects into shared functional frames using affordance cues extracted from large vision-language models, enabling automatic manipulation trajectory transfer across object categories. An object-centric and action-centric diffusion policy (FuncDiffuser) is then trained on the aligned data. The authors claim that this yields category-level generalization, cross-task behavior reuse, and robust sim2real transfer, with experiments on simulated and real-world benchmarks demonstrating that functional canonicalization supplies a strong inductive bias for scalable imitation learning.

Significance. If the central claims hold, the work would provide a novel inductive bias for imitation learning in robotics by leveraging functional alignment to achieve pose-aware, category-general policies that support behavior reuse without task-specific retraining. This could meaningfully advance generalizable manipulation in complex domains where end-to-end methods typically overfit to narrow distributions.

major comments (2)

[§3.2 and Figure 3] §3.2 and Figure 3: functional canonicalization is defined by using VLM affordance cues to establish canonical frames prior to training FuncDiffuser. No quantitative bound on VLM error rate, consistency across instances, or sensitivity analysis is reported. Systematic misidentification of functional parts would introduce pose offsets that propagate directly into the diffusion policy's training distribution, undermining the claimed inductive bias for category-level generalization and cross-task reuse.
[Abstract and experimental sections] Abstract and experimental sections: the manuscript asserts success on benchmarks for category-level generalization, cross-task reuse, and sim2real deployment, yet provides no quantitative results, baselines, ablation studies, or error analysis. This absence leaves the central empirical claim without verifiable support.

minor comments (1)

The project website is referenced for demos and supplemental material; all quantitative results, implementation details, and evaluation protocols should be fully documented in the manuscript to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below and have revised the paper accordingly to strengthen the presentation of our method and results.

read point-by-point responses

Referee: [§3.2 and Figure 3] §3.2 and Figure 3: functional canonicalization is defined by using VLM affordance cues to establish canonical frames prior to training FuncDiffuser. No quantitative bound on VLM error rate, consistency across instances, or sensitivity analysis is reported. Systematic misidentification of functional parts would introduce pose offsets that propagate directly into the diffusion policy's training distribution, undermining the claimed inductive bias for category-level generalization and cross-task reuse.

Authors: We agree that quantifying the reliability of the VLM-based affordance cues is important for validating the robustness of functional canonicalization. In the revised manuscript we have added a new analysis subsection that reports VLM affordance prediction error rates on our collected dataset, measures consistency of canonical frame estimation across multiple instances of the same object category, and includes a sensitivity study that introduces controlled perturbations to the extracted affordance points and measures the resulting effect on downstream policy success rates. These additions show that moderate VLM inaccuracies do not critically degrade the learned policies, supporting the claimed inductive bias. revision: yes
Referee: [Abstract and experimental sections] Abstract and experimental sections: the manuscript asserts success on benchmarks for category-level generalization, cross-task reuse, and sim2real deployment, yet provides no quantitative results, baselines, ablation studies, or error analysis. This absence leaves the central empirical claim without verifiable support.

Authors: We acknowledge that the experimental claims in the abstract and main text would benefit from more explicit quantitative support. The original manuscript contains an experimental section with success-rate tables, baseline comparisons, and sim2real results; however, to directly address the referee’s concern we have expanded this section with additional ablation tables isolating the contribution of functional canonicalization, more comprehensive baseline methods, and error analysis with standard deviations across multiple random seeds. These revisions make the empirical evidence for category-level generalization, cross-task reuse, and sim2real transfer fully verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core derivation uses external affordance cues from large vision-language models to define functional canonical frames for object alignment and trajectory transfer, then trains a separate object-centric and action-centric diffusion policy (FuncDiffuser) on the resulting aligned data. This sequence depends on independent external components (VLMs and diffusion models) and experimental validation on sim/real benchmarks rather than any self-definitional reduction, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or steps in the abstract or described method collapse to their own inputs by construction; the claimed inductive bias for generalization arises from the alignment process and policy training, which remain falsifiable against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the reliability of VLM-derived affordances for functional alignment, which is treated as an input rather than derived or validated within the paper.

axioms (1)

domain assumption Large vision-language models can extract reliable affordance cues suitable for functional object alignment and trajectory transfer.
This premise is invoked when describing the mapping of objects into shared functional frames using VLM cues.

pith-pipeline@v0.9.0 · 5770 in / 1133 out tokens · 37745 ms · 2026-05-21T22:11:39.101623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R* = arg min_R∈SO(3) ||R v_s - v_t||_2^2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 5 internal anchors

[1]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, p. 02783649241273668,

work page
[2]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” arXiv preprint arXiv:2402.10885, 2024. 1, 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Spot: Se (3) pose trajectory diffusion for object-centric manipulation,

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se (3) pose trajectory diffusion for object-centric manipulation,”arXiv preprint arXiv:2411.00965, 2024. 1, 2, 6, 7

work page arXiv 2024
[5]

6-pack: Category-level 6d pose tracker with anchor-based keypoints,

C. Wang, R. Mart ´ın-Mart´ın, D. Xu, J. Lv, C. Lu, L. Fei- Fei, S. Savarese, and Y . Zhu, “6-pack: Category-level 6d pose tracker with anchor-based keypoints,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 10 059–10 066. 1

work page 2020
[6]

Learning 3d dynamic scene representations for robot manipulation,

Z. Xu, Z. He, J. Wu, and S. Song, “Learning 3d dynamic scene representations for robot manipulation,”arXiv preprint arXiv:2011.01968, 2020. 1

work page arXiv 2011
[7]

Uad: Unsupervised affordance distillation for generalization in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” arXiv preprint arXiv:2506.09284, 2025. 2, 3

work page arXiv 2025
[8]

One-shot 3d object canonicalization based on geometric and semantic consistency,

L. Jin, Y . Wang, W. Chen, Q. Dai, Q. Gao, X. Qin, and B. Chen, “One-shot 3d object canonicalization based on geometric and semantic consistency,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 850–16 859. 2

work page 2025
[9]

Tooleenet: Tool affordance 6d pose estimation,

Y . Wang, L. Zhang, Y . Tu, H. Zhang, K. Bai, Z. Chen, and J. Zhang, “Tooleenet: Tool affordance 6d pose estimation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 10 519– 10 526. 2

work page 2024
[10]

Deep reinforcement learning for robotic manipulation with asyn- chronous off-policy updates,

S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asyn- chronous off-policy updates,” in2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396. 2

work page 2017
[11]

Perceiver-actor: A multi-task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–

work page 2023
[12]

Generative skill chaining: Long-horizon skill planning with diffusion models,

U. A. Mishra, S. Xue, Y . Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” inConference on Robot Learning. PMLR, 2023, pp. 2905–2925. 2

work page 2023
[13]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inConference on robot learning. PMLR, 2022, pp. 894–906. 2

work page 2022
[14]

Vima: Robot manipulation with multimodal prompts,

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: Robot manipulation with multimodal prompts,” 2023. 2

work page 2023
[15]

Neural descriptor fields: Se (3)-equivariant object representations for manipulation,

A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitzmann, “Neural descriptor fields: Se (3)-equivariant object representations for manipulation,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 6394–

work page 2022
[16]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”arXiv preprint arXiv:2409.01652, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,

H. Chu, X. Deng, Q. Lv, X. Chen, Y . Li, J. Hao, and L. Nie, “3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,”arXiv preprint arXiv:2502.20041, 2025. 2

work page arXiv 2025
[18]

Grounding 3d object affordance with language instructions, visual observations and interac- tions,

H. Zhu, Q. Kong, K. Xu, X. Xia, B. Deng, J. Ye, R. Xiong, and Y . Wang, “Grounding 3d object affordance with language instructions, visual observations and interac- tions,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 337–17 346. 2

work page 2025
[19]

Affordancellm: Grounding affordance from vision language models,

S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li, “Affordancellm: Grounding affordance from vision language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7587–7597. 2

work page 2024
[20]

Learning instruction-guided manipulation affordance via large models for embodied robotic tasks,

D. Li, C. Zhao, S. Yang, L. Ma, Y . Li, and W. Zhang, “Learning instruction-guided manipulation affordance via large models for embodied robotic tasks,” in2024 International Conference on Advanced Robotics and Mechatronics (ICARM). IEEE, 2024, pp. 662–667. 2

work page 2024
[21]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning ro- bust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

TripoSR: Fast 3D Object Reconstruction from a Single Image

D. Tochilkin, D. Pankratz, Z. Liu, Z. Huang, , A. Letts, Y . Li, D. Liang, C. Laforte, V . Jampani, and Y .-P. Cao, “Triposr: Fast 3d object reconstruction from a single image,”arXiv preprint arXiv:2403.02151, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Founda- tionpose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Founda- tionpose: Unified 6d pose estimation and tracking of novel objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879. 5, 9

work page 2024
[24]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020. 5

work page 2020
[25]

Objaverse: A universe of annotated 3d objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 13 142–13 153. 5 APPENDIX Supplementary Materials consist of the following sections: • In...

work page 2023

[1] [1]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, p. 02783649241273668,

work page

[2] [2]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,”arXiv preprint arXiv:2403.03954, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations

T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” arXiv preprint arXiv:2402.10885, 2024. 1, 2, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Spot: Se (3) pose trajectory diffusion for object-centric manipulation,

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se (3) pose trajectory diffusion for object-centric manipulation,”arXiv preprint arXiv:2411.00965, 2024. 1, 2, 6, 7

work page arXiv 2024

[5] [5]

6-pack: Category-level 6d pose tracker with anchor-based keypoints,

C. Wang, R. Mart ´ın-Mart´ın, D. Xu, J. Lv, C. Lu, L. Fei- Fei, S. Savarese, and Y . Zhu, “6-pack: Category-level 6d pose tracker with anchor-based keypoints,” in2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 10 059–10 066. 1

work page 2020

[6] [6]

Learning 3d dynamic scene representations for robot manipulation,

Z. Xu, Z. He, J. Wu, and S. Song, “Learning 3d dynamic scene representations for robot manipulation,”arXiv preprint arXiv:2011.01968, 2020. 1

work page arXiv 2011

[7] [7]

Uad: Unsupervised affordance distillation for generalization in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” arXiv preprint arXiv:2506.09284, 2025. 2, 3

work page arXiv 2025

[8] [8]

One-shot 3d object canonicalization based on geometric and semantic consistency,

L. Jin, Y . Wang, W. Chen, Q. Dai, Q. Gao, X. Qin, and B. Chen, “One-shot 3d object canonicalization based on geometric and semantic consistency,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 16 850–16 859. 2

work page 2025

[9] [9]

Tooleenet: Tool affordance 6d pose estimation,

Y . Wang, L. Zhang, Y . Tu, H. Zhang, K. Bai, Z. Chen, and J. Zhang, “Tooleenet: Tool affordance 6d pose estimation,” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 10 519– 10 526. 2

work page 2024

[10] [10]

Deep reinforcement learning for robotic manipulation with asyn- chronous off-policy updates,

S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asyn- chronous off-policy updates,” in2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396. 2

work page 2017

[11] [11]

Perceiver-actor: A multi-task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–

work page 2023

[12] [12]

Generative skill chaining: Long-horizon skill planning with diffusion models,

U. A. Mishra, S. Xue, Y . Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” inConference on Robot Learning. PMLR, 2023, pp. 2905–2925. 2

work page 2023

[13] [13]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inConference on robot learning. PMLR, 2022, pp. 894–906. 2

work page 2022

[14] [14]

Vima: Robot manipulation with multimodal prompts,

Y . Jiang, A. Gupta, Z. Zhang, G. Wang, Y . Dou, Y . Chen, L. Fei-Fei, A. Anandkumar, Y . Zhu, and L. Fan, “Vima: Robot manipulation with multimodal prompts,” 2023. 2

work page 2023

[15] [15]

Neural descriptor fields: Se (3)-equivariant object representations for manipulation,

A. Simeonov, Y . Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V . Sitzmann, “Neural descriptor fields: Se (3)-equivariant object representations for manipulation,” in2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 6394–

work page 2022

[16] [16]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”arXiv preprint arXiv:2409.01652, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,

H. Chu, X. Deng, Q. Lv, X. Chen, Y . Li, J. Hao, and L. Nie, “3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,”arXiv preprint arXiv:2502.20041, 2025. 2

work page arXiv 2025

[18] [18]

Grounding 3d object affordance with language instructions, visual observations and interac- tions,

H. Zhu, Q. Kong, K. Xu, X. Xia, B. Deng, J. Ye, R. Xiong, and Y . Wang, “Grounding 3d object affordance with language instructions, visual observations and interac- tions,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 17 337–17 346. 2

work page 2025

[19] [19]

Affordancellm: Grounding affordance from vision language models,

S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li, “Affordancellm: Grounding affordance from vision language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 7587–7597. 2

work page 2024

[20] [20]

Learning instruction-guided manipulation affordance via large models for embodied robotic tasks,

D. Li, C. Zhao, S. Yang, L. Ma, Y . Li, and W. Zhang, “Learning instruction-guided manipulation affordance via large models for embodied robotic tasks,” in2024 International Conference on Advanced Robotics and Mechatronics (ICARM). IEEE, 2024, pp. 662–667. 2

work page 2024

[21] [21]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning ro- bust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

TripoSR: Fast 3D Object Reconstruction from a Single Image

D. Tochilkin, D. Pankratz, Z. Liu, Z. Huang, , A. Letts, Y . Li, D. Liang, C. Laforte, V . Jampani, and Y .-P. Cao, “Triposr: Fast 3d object reconstruction from a single image,”arXiv preprint arXiv:2403.02151, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Founda- tionpose: Unified 6d pose estimation and tracking of novel objects,

B. Wen, W. Yang, J. Kautz, and S. Birchfield, “Founda- tionpose: Unified 6d pose estimation and tracking of novel objects,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 17 868–17 879. 5, 9

work page 2024

[24] [24]

Rlbench: The robot learning benchmark & learning environment,

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3019–3026, 2020. 5

work page 2020

[25] [25]

Objaverse: A universe of annotated 3d objects,

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 13 142–13 153. 5 APPENDIX Supplementary Materials consist of the following sections: • In...

work page 2023