pith. machine review for the scientific record. sign in

arxiv: 2604.08410 · v2 · submitted 2026-04-09 · 💻 cs.CV · cs.RO

Recognition: unknown

BLaDA: Bridging Language to Functional Dexterous Actions within 3DGS Fields

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:20 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords dexterous manipulationlanguage grounding3D Gaussian Splattingaffordance detectionzero-shot learningfunctional graspingrobotic controlsemantic parsing
0
0 comments X

The pith

BLaDA parses open-vocabulary instructions into a sextuple of constraints and applies triangular geometry in 3D Gaussian Splatting fields to produce functional dexterous actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BLaDA as a zero-shot framework that links natural language commands directly to robot hand movements for tasks like grasping handles or opening containers in real scenes. It breaks instructions into structured constraints, locates action points using geometric rules on a continuous 3D scene model, and converts those into wrist and finger commands. This matters for robots operating without task-specific training data or fixed label sets, where current systems often fail due to poor coupling between meaning and physical pose. If the method holds, it would let general instructions drive reliable manipulation across varied objects and environments while keeping each step traceable. The design favors modular steps over end-to-end networks to improve both performance and human understanding of the process.

Core claim

BLaDA grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation by first parsing natural language into a structured sextuple via Knowledge-guided Language Parsing, then identifying functional regions in 3D Gaussian Splatting fields under triangular geometric constraints with the TriLocation module, and finally decoding the constraints into physically plausible wrist poses and finger-level commands through the KGT3D+ module, yielding higher affordance grounding precision and manipulation success rates than prior approaches across diverse categories and tasks.

What carries the argument

The TriLocation module, which identifies functional regions in 3D Gaussian Splatting fields by enforcing triangular geometric constraints derived from parsed language instructions.

If this is right

  • Robots gain the ability to execute functional grasps on novel objects using only everyday language descriptions.
  • Affordance regions are located consistently with pose and semantics, reducing reliance on category-specific training labels.
  • Modular constraint chains improve controllability, allowing intervention at language parsing, localization, or execution stages.
  • Success rates rise in unstructured settings because semantic and geometric information remain coupled throughout the pipeline.
  • The approach generalizes across object categories and manipulation types without retraining for each new task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parsing and triangular localization steps could support multi-step sequences such as pick-then-place by extending the sextuple to include temporal ordering.
  • Replacing 3D Gaussian Splatting with other dense scene representations might test whether the performance gain stems from the continuous field or from the geometric constraint logic itself.
  • Deployment on physical arms would reveal how well the decoded wrist poses and finger commands transfer when sensor noise and dynamics are present.
  • The framework's explicit constraints could serve as an interface for safety checks that verify proposed actions against physical limits before execution.

Load-bearing premise

Open-vocabulary instructions can be parsed reliably into an accurate sextuple of manipulation constraints, and triangular geometric constraints applied to 3D Gaussian Splatting fields will correctly mark functional regions without any predefined affordance labels.

What would settle it

Run the system on instructions describing ambiguous or occluded functional parts in cluttered real-world scenes and measure whether affordance localization precision or task success rates fall below those of modular baselines that use explicit affordance maps.

Figures

Figures reproduced from arXiv: 2604.08410 by Dongsheng Luo, Fan Yang, Guorun Yan, Jiacheng Lin, Kailun Yang, Ruize Liao, Wanjun Jia, Wenrui Chen, Yaonan Wang, Zhiyong Li.

Figure 1
Figure 1. Figure 1: Comparison of existing pipelines: (a) end-to-end VLA is [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of BLaDA. The top illustrates the construction of knowledge-guided functionality prompting and example demonstrations [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Extract verbs, intent phrases, and tool types from human [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the TriLocation. a. We design the HSE module (highlighted in yellow), consisting of Select and Context-Aware Cropping [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real-world experiment setting and 6 typical scenarios [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Relevance maps of given language instructions. We project the language-activated 3D Gaussian semantic features onto [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of the effect of the local coordinate [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dexterous grasping demonstration workflow based on 3D reconstructed points. Left of the dashed line: predicted [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hyperparameter analysis of α and γ. The optimal configuration is highlighted. the index finger after reaching above the plant to trigger spraying. These results indicate that our method can stably extract the “task-function-part-finger” elements from open￾domain instructions and ground them into topology/semantics￾constrained grasps and functional actions, enabling closed￾loop generalization from “understa… view at source ↗
read the original abstract

In unstructured environments, functional dexterous grasping calls for the tight integration of semantic understanding, precise 3D functional localization, and physically interpretable execution. Modular hierarchical methods are more controllable and interpretable than end-to-end VLA approaches, but existing ones still rely on predefined affordance labels and lack the tight semantic--pose coupling needed for functional dexterous manipulation. To address this, we propose BLaDA (Bridging Language to Dexterous Actions in 3DGS fields), an interpretable zero-shot framework that grounds open-vocabulary instructions as perceptual and control constraints for functional dexterous manipulation. BLaDA establishes an interpretable reasoning chain by first parsing natural language into a structured sextuple of manipulation constraints via a Knowledge-guided Language Parsing (KLP) module. To achieve pose-consistent spatial reasoning, we introduce the Triangular Functional Point Localization (TriLocation) module, which utilizes 3D Gaussian Splatting as a continuous scene representation and identifies functional regions under triangular geometric constraints. Finally, the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module decodes these semantic-geometric constraints into physically plausible wrist poses and finger-level commands. Extensive experiments on complex benchmarks demonstrate that BLaDA significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation across diverse categories and tasks. Code will be publicly available at https://github.com/PopeyePxx/BLaDA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes BLaDA, a modular zero-shot framework for functional dexterous manipulation that parses open-vocabulary natural language instructions into a structured sextuple of constraints using the Knowledge-guided Language Parsing (KLP) module, localizes functional regions via triangular geometric constraints on 3D Gaussian Splatting fields with the Triangular Functional Point Localization (TriLocation) module, and decodes these into wrist poses and finger commands via the 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module. It claims this interpretable pipeline significantly outperforms prior methods in affordance grounding precision and manipulation success rates across diverse object categories and tasks without relying on predefined affordance labels.

Significance. If the zero-shot claims and outperformance results are substantiated, the work offers a promising interpretable alternative to end-to-end VLA models by tightly coupling semantic language understanding with geometric constraints in continuous 3DGS representations. The modular design with explicit KLP-TriLocation-KGT3D+ reasoning chain could improve controllability and generalization in unstructured environments.

major comments (3)
  1. [Abstract] Abstract: The central claim that BLaDA 'significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation' is unsupported by any reported metrics, baselines, datasets, or error analysis, which directly undermines evaluation of the outperformance assertion.
  2. [Method (KLP)] Method section describing KLP: The module is said to parse instructions into a 'structured sextuple of manipulation constraints' but provides no explicit definition of the sextuple fields, the external knowledge base, or the prompting strategy, which is load-bearing for the claimed robustness to arbitrary open-vocabulary input.
  3. [Method (TriLocation)] Method section describing TriLocation: The use of 'triangular geometric constraints' on 3DGS fields to identify functional regions lacks any derivation, justification, or analysis showing why three-point constraints suffice to isolate task-relevant geometry across categories in a zero-shot setting without predefined labels or fine-tuning.
minor comments (2)
  1. [Abstract] The abstract introduces multiple new acronyms (KLP, TriLocation, KGT3D+) without a concise one-sentence overview of their roles, which reduces immediate clarity for readers.
  2. [Method] Consider including a high-level diagram or pseudocode for the overall pipeline early in the method section to clarify the flow from language parsing to execution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment point-by-point below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that BLaDA 'significantly outperforms existing methods in both affordance grounding precision and the success rate of functional manipulation' is unsupported by any reported metrics, baselines, datasets, or error analysis, which directly undermines evaluation of the outperformance assertion.

    Authors: The full paper includes detailed experimental evaluations in the Experiments section, reporting specific metrics such as affordance grounding precision (e.g., IoU scores) and manipulation success rates on benchmarks, with comparisons to baselines. However, to make the abstract self-contained, we will revise it to include key quantitative results, mention the datasets used, and briefly note the baselines and error analysis performed. revision: yes

  2. Referee: [Method (KLP)] Method section describing KLP: The module is said to parse instructions into a 'structured sextuple of manipulation constraints' but provides no explicit definition of the sextuple fields, the external knowledge base, or the prompting strategy, which is load-bearing for the claimed robustness to arbitrary open-vocabulary input.

    Authors: We will clarify this in the revised manuscript by explicitly defining the sextuple fields (object, verb, spatial constraint, orientation, grasp configuration, and additional constraint). The external knowledge base combines LLM-inferred commonsense with a structured database of manipulation priors. The prompting strategy is a carefully designed zero-shot prompt with chain-of-thought reasoning. A new figure or table will illustrate the parsing process and prompt template. revision: yes

  3. Referee: [Method (TriLocation)] Method section describing TriLocation: The use of 'triangular geometric constraints' on 3DGS fields to identify functional regions lacks any derivation, justification, or analysis showing why three-point constraints suffice to isolate task-relevant geometry across categories in a zero-shot setting without predefined labels or fine-tuning.

    Authors: We agree additional justification is warranted. In the revision, we will derive the triangular constraints mathematically, showing how three points (e.g., from language-parsed keypoints) define a unique plane and bounding volume in the 3DGS representation that captures functional affordances. We will provide justification based on geometric principles and include cross-category analysis demonstrating zero-shot generalization without labels or fine-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: modular framework introduces independent components without self-referential derivations

full rationale

The paper presents BLaDA as a new zero-shot framework composed of three explicitly introduced modules (KLP for language-to-sextuple parsing, TriLocation for triangular-constraint localization in 3DGS fields, and KGT3D+ for grasp execution). No equations, fitted parameters, or derivations are shown that reduce by construction to their own inputs. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no known results are merely renamed. The central claims rest on the empirical performance of these novel modules rather than tautological redefinitions or statistical forcing from subsets of the same data. This is the normal case of an engineering paper whose internal logic is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on two domain assumptions about language-to-constraint mapping and geometric accuracy of 3DGS; no free parameters or new physical entities are introduced in the abstract.

axioms (2)
  • domain assumption Natural language instructions can be parsed into a structured sextuple of manipulation constraints
    Invoked by the KLP module to create perceptual and control constraints
  • domain assumption 3D Gaussian Splatting supplies a continuous scene representation sufficient for triangular geometric localization of functional regions
    Core premise of the TriLocation module
invented entities (3)
  • Knowledge-guided Language Parsing (KLP) module no independent evidence
    purpose: Parse natural language into structured sextuple of manipulation constraints
    New component introduced to bridge language to constraints
  • Triangular Functional Point Localization (TriLocation) module no independent evidence
    purpose: Identify functional regions under triangular geometric constraints in 3DGS
    New localization technique proposed
  • 3D Keypoint Grasp Matrix Transformation Execution (KGT3D+) module no independent evidence
    purpose: Decode semantic-geometric constraints into wrist poses and finger commands
    New execution decoder introduced

pith-pipeline@v0.9.0 · 5598 in / 1426 out tokens · 69244 ms · 2026-05-10T17:20:50.001814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Dexgraspvla: A vision-language- action framework towards general dexterous grasping,

    Y . Zhonget al., “DexGraspVLA: A vision-language-action framework towards general dexterous grasping,”arXiv preprint arXiv:2502.20900, 2025

  2. [2]

    Grasp like humans: Learning generalizable multi-fingered grasping from human proprioceptive sensorimotor integration,

    C. Guoet al., “Grasp like humans: Learning generalizable multi-fingered grasping from human proprioceptive sensorimotor integration,”IEEE Transactions on Robotics, vol. 41, pp. 5700–5719, 2025

  3. [3]

    Toward human-like grasp: Functional grasp by dexterous robotic hand via object-hand semantic representation,

    T. Zhu, R. Wu, J. Hang, X. Lin, and Y . Sun, “Toward human-like grasp: Functional grasp by dexterous robotic hand via object-hand semantic representation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 12 521–12 534, 2023

  4. [4]

    ContactDB: Analyzing and predicting grasp contact via thermal imaging,

    S. Brahmbhatt, C. Ham, C. C. Kemp, and J. Hays, “ContactDB: Analyzing and predicting grasp contact via thermal imaging,” inProc. CVPR, 2019, pp. 8709–8719

  5. [5]

    ContactGrasp: Functional multi-finger grasp synthesis from contact,

    S. Brahmbhatt, A. Handa, J. Hayset al., “ContactGrasp: Functional multi-finger grasp synthesis from contact,” inProc. IROS, 2019, pp. 2386–2393

  6. [6]

    Contact transfer: A direct, user-driven method for human to robot transfer of grasps and manipulations,

    A. Lakshmipathy, D. Bauer, C. Baueret al., “Contact transfer: A direct, user-driven method for human to robot transfer of grasps and manipulations,” inProc. ICRA, 2022, pp. 6195–6201

  7. [7]

    ContactGrasp: Func- tional multi-finger grasp synthesis from contact,

    S. Brahmbhatt, A. Handa, J. Hays, and D. Fox, “ContactGrasp: Func- tional multi-finger grasp synthesis from contact,” inProc. IROS, 2019, pp. 2386–2393

  8. [8]

    RT-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovichet al., “RT-2: Vision-language-action models transfer web knowledge to robotic control,” inProc. CoRL, vol. 229, 2023, pp. 2165– 2183

  9. [9]

    DeeR-VLA: Dynamic inference of multimodal large language models for efficient robot execution,

    Y . Yueet al., “DeeR-VLA: Dynamic inference of multimodal large language models for efficient robot execution,” inProc. NeurIPS, vol. 37, 2024, pp. 56 619–56 643

  10. [10]

    RoboMamba: Efficient vision-language-action model for robotic reasoning and manipulation,

    J. Liuet al., “RoboMamba: Efficient vision-language-action model for robotic reasoning and manipulation,” inProc. NeurIPS, vol. 37, 2024, pp. 40 085–40 110

  11. [11]

    DiffusionVLA: Scaling robot foundation models via unified diffusion and autoregression,

    J. Wenet al., “DiffusionVLA: Scaling robot foundation models via unified diffusion and autoregression,” inProc. ICML, 2025

  12. [12]

    DexVLG: Dexterous vision-language-grasp model at scale,

    J. Heet al., “DexVLG: Dexterous vision-language-grasp model at scale,” arXiv preprint arXiv:2507.02747, 2025

  13. [13]

    Towards human-level bimanual dexterous manipulation with reinforcement learning,

    Y . Chenet al., “Towards human-level bimanual dexterous manipulation with reinforcement learning,” inProc. NeurIPS, vol. 35, 2022, pp. 5150– 5163

  14. [14]

    ReKep: Spatio- temporal reasoning of relational keypoint constraints for robotic manip- ulation,

    W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “ReKep: Spatio- temporal reasoning of relational keypoint constraints for robotic manip- ulation,” inProc. CoRL, vol. 270, 2024, pp. 4573–4602

  15. [15]

    Language-guided dexterous functional grasping by LLM generated grasp functionality and synergy for humanoid manipulation,

    Z. Liet al., “Language-guided dexterous functional grasping by LLM generated grasp functionality and synergy for humanoid manipulation,” IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 10 506–10 519, 2025

  16. [16]

    AffordDexGrasp: Open-set language-guided dexterous grasp with generalizable-instructive affordance,

    Y . Weiet al., “AffordDexGrasp: Open-set language-guided dexterous grasp with generalizable-instructive affordance,” inProc. ICCV, 2025, pp. 11 818–11 828

  17. [17]

    Learning precise affordances from egocentric videos for robotic manipulation,

    G. Liet al., “Learning precise affordances from egocentric videos for robotic manipulation,” inProc. ICCV, 2025, pp. 10 581–10 591

  18. [18]

    Learning granularity-aware affordances from human- object interaction for tool-based functional dexterous grasping,

    F. Yanget al., “Learning granularity-aware affordances from human- object interaction for tool-based functional dexterous grasping,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 11, pp. 19 589–19 603, 2025

  19. [19]

    Multi-keypoint affordance representation for functional dexterous grasping,

    ——, “Multi-keypoint affordance representation for functional dexterous grasping,”IEEE Robotics and Automation Letters, vol. 10, no. 10, pp. 10 306–10 313, 2025. 13

  20. [20]

    Task-oriented tool manipulation with robotic dexterous hands: A knowledge graph approach from fingers to functionality,

    ——, “Task-oriented tool manipulation with robotic dexterous hands: A knowledge graph approach from fingers to functionality,”IEEE Transactions on Cybernetics, vol. 55, no. 1, pp. 395–408, 2025

  21. [21]

    Learning affordance grounding from exocentric images,

    H. Luo, W. Zhai, J. Zhang, Y . Cao, and D. Tao, “Learning affordance grounding from exocentric images,” inProc. CVPR, 2022, pp. 2242– 2251

  22. [22]

    LOCATE: Localize and transfer object parts for weakly supervised affordance grounding,

    G. Li, V . Jampani, D. Sun, and L. Sevilla-Lara, “LOCATE: Localize and transfer object parts for weakly supervised affordance grounding,” inCVPR, 2023, pp. 10 922–10 931

  23. [23]

    Grounding 3D object affordance with language instruc- tions, visual observations and interactions,

    H. Zhuet al., “Grounding 3D object affordance with language instruc- tions, visual observations and interactions,” inProc. CVPR, 2025, pp. 17 337–17 346

  24. [24]

    Long-horizon language-conditioned imitation learning for robotic manipulation,

    X. Yaoet al., “Long-horizon language-conditioned imitation learning for robotic manipulation,”IEEE/ASME Transactions on Mechatronics, vol. 30, no. 6, pp. 5628–5639, 2025

  25. [25]

    Distilled feature fields enable few-shot language-guided manipulation,

    W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipulation,” inProc. CoRL, vol. 229, 2023, pp. 405–424

  26. [26]

    3D gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–14, 2023

  27. [27]

    GaussianGrasper: 3D language gaussian splatting for open-vocabulary robotic grasping,

    Y . Zhenget al., “GaussianGrasper: 3D language gaussian splatting for open-vocabulary robotic grasping,”IEEE Robotics and Automation Letters, vol. 9, no. 9, pp. 7827–7834, 2024

  28. [28]

    GraspSplats: Efficient manip- ulation with 3D feature splatting,

    M. Ji, R.-Z. Qiu, X. Zou, and X. Wang, “GraspSplats: Efficient manip- ulation with 3D feature splatting,” inProc. CoRL, vol. 270, 2024, pp. 1443–1460

  29. [29]

    Do as I can, not as I say: Grounding language in robotic affordances,

    M. Ahnet al., “Do as I can, not as I say: Grounding language in robotic affordances,” inProc. CoRL, vol. 205, 2022, pp. 287–318

  30. [30]

    Robotic grasp detection based on category-level object pose estimation with self-supervised learning,

    S. Yu, D.-H. Zhai, and Y . Xia, “Robotic grasp detection based on category-level object pose estimation with self-supervised learning,” IEEE/ASME Transactions on Mechatronics, vol. 29, no. 1, pp. 625–635, 2024

  31. [31]

    Combined task and motion planning through an extensible planner- independent interface layer,

    S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel, “Combined task and motion planning through an extensible planner- independent interface layer,” inProc. ICRA, 2014, pp. 639–646

  32. [32]

    6-DoF pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,

    S. Tyreeet al., “6-DoF pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,” inProc. IROS, 2022, pp. 13 081–13 088

  33. [33]

    FoundationPose: Unified 6D pose estimation and tracking of novel objects,

    B. Wen, W. Yang, J. Kautz, and S. Birchfield, “FoundationPose: Unified 6D pose estimation and tracking of novel objects,” inProc. CVPR, 2024, pp. 17 868–17 879

  34. [34]

    Efficient grasp detection network with gaussian-based grasp representation for robotic manipulation,

    H. Cao, G. Chen, Z. Li, Q. Feng, J. Lin, and A. Knoll, “Efficient grasp detection network with gaussian-based grasp representation for robotic manipulation,”IEEE/ASME Transactions on Mechatronics, vol. 28, no. 3, pp. 1384–1394, 2023

  35. [35]

    Real-world multiobject, multigrasp detection,

    F.-J. Chu, R. Xu, and P. A. Vela, “Real-world multiobject, multigrasp detection,”IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3355–3362, 2018

  36. [36]

    DDGC: Generative deep dexterous grasping in clutter,

    J. Lundell, F. Verdoja, and V . Kyrki, “DDGC: Generative deep dexterous grasping in clutter,”IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6899–6906, 2021

  37. [37]

    Learning object grasping for soft robot hands,

    C. Choi, W. Schwarting, J. DelPreto, and D. Rus, “Learning object grasping for soft robot hands,”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2370–2377, 2018

  38. [38]

    DexFuncGrasp: A robotic dexterous functional grasp dataset constructed from a cost-effective real-simulation annotation system,

    J. Hanget al., “DexFuncGrasp: A robotic dexterous functional grasp dataset constructed from a cost-effective real-simulation annotation system,” inProc. AAAI, 2024, pp. 10 306–10 313

  39. [39]

    Learning transferable visual models from natural language supervision,

    A. Radfordet al., “Learning transferable visual models from natural language supervision,” inProc. ICML, 2021, pp. 8748–8763

  40. [40]

    Segment anything,

    A. Kirillovet al., “Segment anything,” inProc. ICCV, 2023, pp. 3992– 4003

  41. [41]

    You only look once: Unified, real-time object detection,

    J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” inProc. CVPR, 2016, pp. 779–788

  42. [42]

    USAC: A universal framework for random sample consensus,

    R. Raguram, O. Chum, M. Pollefeys, J. Matas, and J.-M. Frahm, “USAC: A universal framework for random sample consensus,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 2022–2038, 2013

  43. [43]

    Principal components analysis (PCA),

    A. Ma ´ckiewicz and W. Ratajczak, “Principal components analysis (PCA),”Computers & Geosciences, vol. 19, no. 3, pp. 303–342, 1993

  44. [44]

    GPT-4 Technical Report

    J. Achiamet al., “GPT-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  45. [45]

    DeepSeek-V3 Technical Report

    A. Liuet al., “DeepSeek-V3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  46. [46]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comaniciet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  47. [47]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chiet al., “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025