Traj2Action: A Co-Denoising Framework for Trajectory-Guided Human-to-Robot Skill Transfer

Guo-jun Qi; Han Zhou; Jinjin Cao; Liyuan Ma; Xueji Fang

arxiv: 2510.00491 · v3 · submitted 2025-10-01 · 💻 cs.RO · cs.AI

Traj2Action: A Co-Denoising Framework for Trajectory-Guided Human-to-Robot Skill Transfer

Han Zhou , Jinjin Cao , Liyuan Ma , Xueji Fang , Guo-jun Qi This is my paper

Pith reviewed 2026-05-18 11:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords human-to-robot skill transfertrajectory guidanceco-denoisingmanipulation policy learningembodiment gapFranka robothuman video demonstrations

0 comments

The pith

Traj2Action transfers human manipulation skills to robots by using 3D endpoint trajectories as an intermediate representation and a co-denoising process for actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework called Traj2Action to overcome the challenge of transferring manipulation skills from human videos to robots, which have very different body structures. It uses the three-dimensional trajectory of the operational endpoint as a common representation that both human and robot data can contribute to. The method first generates a coarse trajectory plan by combining both data sources and then refines this into detailed robot actions such as gripper orientation and state through a joint denoising mechanism. Experiments on a real Franka robot show performance improvements of up to 27 percent on short tasks and 22 percent on longer ones, with further benefits from adding more human data. This matters because robot demonstrations are costly to collect while human videos can be obtained more easily at scale.

Core claim

Traj2Action bridges the morphological gap between human and robot embodiments by treating the 3D trajectory of the operational endpoint as a unified intermediate representation. The policy learns to generate a coarse trajectory that serves as a high-level motion plan from both human and robot data, and this plan then conditions the co-denoising synthesis of precise robot-specific actions including orientation and gripper state.

What carries the argument

The co-denoising framework that conditions precise action generation on a coarse trajectory plan derived from combined human and robot demonstrations.

If this is right

Performance on short-horizon real-world tasks improves by up to 27% compared to the π0 baseline.
Performance on long-horizon real-world tasks improves by 22.25% over the same baseline.
Robot policy learning shows significant gains as the amount of human demonstration data increases.
The framework supports cross-task generalization in manipulation skills.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the trajectory representation works well across embodiments, it could reduce the need for robot-specific teleoperation data in favor of readily available human videos.
Scaling human data might eventually allow robots to learn complex skills without direct robot demonstrations.
Similar trajectory-guided approaches could apply to other domains where embodiment differences hinder direct transfer, such as different robot arms.

Load-bearing premise

The 3D trajectory of the operational endpoint contains enough information to preserve all necessary manipulation knowledge despite differences in human and robot body structures.

What would settle it

A direct comparison experiment where the same robot policy is trained with and without the trajectory conditioning step, checking if the performance difference disappears on tasks requiring fine orientation control.

read the original abstract

Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action, a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms a high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Our work centers on two core objectives: first, the systematic verification of the Traj2Action framework's effectiveness-spanning architectural design, cross-task generalization, and data efficiency and second, the revelation of key laws that govern robot policy learning during the integration of human hand demonstration data. This research focus enables us to provide a scalable paradigm tailored to address human-to-robot skill transfer across morphological gaps. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over $\pi_0$ baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Traj2Action shows real Franka gains by conditioning action diffusion on a learned 3D endpoint trajectory from mixed human-robot data, but the gains rest on an untested claim that position alone bridges the embodiment gap.

read the letter

The main takeaway is that this paper gets concrete performance lifts on a physical Franka arm by first predicting a coarse 3D trajectory from combined human and robot data, then feeding that into a co-denoising process to produce full robot actions including orientation and gripper state. The abstract reports up to 27% improvement over π0 on short-horizon tasks and 22.25% on long-horizon ones, plus better scaling as human data volume increases.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Traj2Action, a co-denoising framework for human-to-robot skill transfer that employs the 3D trajectory of the operational endpoint as a unified intermediate representation to bridge the morphological gap. The policy first generates a coarse trajectory plan from mixed human and robot data, which then conditions the synthesis of robot-specific actions (orientation, gripper state) via co-denoising. The central empirical claims are performance gains of up to 27% and 22.25% over the π₀ baseline on short- and long-horizon real-world tasks with a Franka robot, plus improved scaling behavior as human demonstration data increases.

Significance. If the reported gains are robustly attributable to the trajectory-guided bridging mechanism, the work provides a practical and scalable route for incorporating human video data into robot policy learning without full teleoperation. The real-hardware experiments and explicit investigation of human-data scaling laws constitute concrete strengths that could influence imitation-learning pipelines in robotics.

major comments (2)

[Abstract] Abstract: The core assumption that the 3D operational-endpoint trajectory constitutes a 'sufficiently rich and embodiment-agnostic intermediate representation' that 'preserves all necessary manipulation knowledge' is load-bearing for the central claim yet receives no direct validation. No information-theoretic analysis, ablation on omitted signals (contact forces, fine orientation, object interaction), or failure-mode study is described; without such evidence the 27%/22.25% gains could plausibly arise from the diffusion architecture or additional robot data rather than the proposed bridging mechanism.
[Experiments] Experiments section (real-world evaluation): The performance deltas are reported as absolute percentages without accompanying trial counts, standard deviations, statistical tests, or explicit data-split details. This omission prevents assessment of whether the improvements are statistically reliable or sensitive to post-hoc choices, directly affecting confidence in the data-efficiency and cross-task generalization claims.

minor comments (2)

[Methods] The co-denoising process is described at a high level in the abstract and introduction; an explicit equation or pseudocode block early in the methods would improve reproducibility and allow readers to trace how the coarse trajectory conditions the action denoising step.
[Figures] Figure captions and axis labels in the scaling-law plots could be expanded to indicate the exact human-to-robot data ratios used at each point, aiding interpretation of the reported gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript's claims and reporting.

read point-by-point responses

Referee: [Abstract] Abstract: The core assumption that the 3D operational-endpoint trajectory constitutes a 'sufficiently rich and embodiment-agnostic intermediate representation' that 'preserves all necessary manipulation knowledge' is load-bearing for the central claim yet receives no direct validation. No information-theoretic analysis, ablation on omitted signals (contact forces, fine orientation, object interaction), or failure-mode study is described; without such evidence the 27%/22.25% gains could plausibly arise from the diffusion architecture or additional robot data rather than the proposed bridging mechanism.

Authors: We agree that the sufficiency of the 3D trajectory as an intermediate representation is central to the contribution and would benefit from more direct supporting analysis. The manuscript already contains ablations in Section 4.3 that isolate the effect of trajectory conditioning versus direct mixed-data training, showing consistent gains attributable to the bridging step rather than architecture alone. However, we did not include an information-theoretic analysis or a dedicated failure-mode study on omitted signals such as contact forces. In the revised manuscript we will add a limitations subsection that discusses scenarios where the trajectory representation may be insufficient (e.g., force-sensitive tasks) and include additional ablation results comparing trajectory-guided co-denoising against variants that incorporate extra signals or omit the intermediate plan entirely. These changes will help rule out alternative explanations for the observed improvements. revision: yes
Referee: [Experiments] Experiments section (real-world evaluation): The performance deltas are reported as absolute percentages without accompanying trial counts, standard deviations, statistical tests, or explicit data-split details. This omission prevents assessment of whether the improvements are statistically reliable or sensitive to post-hoc choices, directly affecting confidence in the data-efficiency and cross-task generalization claims.

Authors: We acknowledge that the current presentation of results lacks the statistical details needed for full assessment of reliability. Each real-world task was evaluated over 10 independent trials; the reported percentages are mean success rates across these trials. In the revised Experiments section we will explicitly state the trial count, report standard deviations (or error bars on figures), include the outcomes of paired statistical tests (e.g., t-tests) comparing Traj2Action against the π₀ baseline, and move the data-split protocol from the supplementary material into the main text with a clear reference. These additions will directly address concerns about statistical robustness and sensitivity to experimental choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation on external data

full rationale

The paper introduces Traj2Action as an empirical co-denoising framework that uses 3D operational-endpoint trajectories as an intermediate representation to bridge human-robot embodiment gaps. Performance gains (up to 27% and 22.25% over π₀) are reported from real-world Franka robot experiments on short- and long-horizon tasks, trained on mixed human and robot datasets. No equations, predictions, or first-principles derivations reduce these results to internal fitted quantities or self-definitions by construction. The central assumption about trajectory richness is an architectural hypothesis tested via ablation and scaling experiments rather than a tautological renaming or self-citation chain. The derivation remains self-contained through standard diffusion-model training and external benchmarking.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that 3D trajectories capture transferable manipulation semantics and that co-denoising can effectively fuse coarse plans with fine actions; no new physical entities are postulated.

free parameters (1)

trajectory noise schedule
Hyperparameters controlling the denoising process for both trajectory and action branches are chosen to fit the training data.

axioms (1)

domain assumption 3D endpoint trajectories are embodiment-invariant enough to transfer manipulation knowledge
Invoked as the core bridging mechanism in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1268 out tokens · 23023 ms · 2026-05-18T11:08:12.396076+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce a unified trajectory representation of human hand and robot end-effector 3D positions, which reduces embodiment discrepancies... coarse trajectory planning... co-denoising training scheme over trajectories and actions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Ltraj(T) = E... ||gT(Tτ,τ,...)||² ; Laction(θ) = E... ||πθ(aτ,Tτ,...)||² ; total loss L = Ltraj + Laction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.