Iterative Compositional Data Generation for Robot Control

Anh-Quan Pham; Dani S. Bassett; Eric Eaton; Jorge Mendez-Mendez; Marcel Hussing; Shubhankar P. Patankar

arxiv: 2512.10891 · v5 · pith:GGYGFLDRnew · submitted 2025-12-11 · 💻 cs.RO · cs.LG

Iterative Compositional Data Generation for Robot Control

Anh-Quan Pham , Marcel Hussing , Shubhankar P. Patankar , Dani S. Bassett , Jorge Mendez-Mendez , Eric Eaton This is my paper

Pith reviewed 2026-05-21 16:59 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords roboticsdata generationcompositional modelsdiffusion transformerzero-shot learningreinforcement learningmulti-object manipulation

0 comments

The pith

A semantic compositional diffusion transformer generates high-quality transitions for unseen robotic task combinations after training on a limited subset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the expense of collecting robotic manipulation demonstrations for the vast number of possible task combinations in multi-object and multi-robot settings. It introduces a model that breaks down transitions into separate components for the robot, objects, obstacles, and objectives, then uses attention to capture how these parts interact. After training on only some tasks, this allows generating synthetic data for completely new combinations without additional real demonstrations. An iterative process then validates this data with reinforcement learning and refines the model, leading to solving nearly all previously unseen tasks. This matters for scaling robot learning because it reduces reliance on costly real-world data collection for every possible scenario.

Core claim

By factorizing transitions into robot-, object-, obstacle-, and objective-specific components and learning their interactions through attention, the semantic compositional diffusion transformer can zero-shot generate high-quality transitions for unseen task combinations once trained on a limited subset. Incorporating an iterative self-improvement procedure where synthetic data is validated via offline reinforcement learning and fed back into training substantially improves performance, ultimately solving nearly all held-out tasks and revealing emergent compositional structure in the representations.

What carries the argument

semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention

If this is right

High-quality transitions can be generated for task combinations not seen during training.
Control policies for these unseen tasks can be learned from the generated synthetic data.
The iterative self-improvement procedure boosts zero-shot performance over standard baselines.
Nearly all held-out tasks become solvable after the self-improvement rounds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar compositional factorization could extend to other sequential decision-making domains with modular elements.
Validating generated data in simulation before real-world use would help ensure transferability.
If attention captures interactions well, the approach might handle variations in robot or environment not seen in training.

Load-bearing premise

The robotic manipulation domain has a clean compositional structure that factors into independent robot, object, obstacle, and objective components whose interactions can be learned through attention without needing real data for every possible combination.

What would settle it

If control policies learned from the model's generated transitions for held-out tasks show low success rates in a robot simulation benchmark even after multiple self-improvement iterations, this would indicate the central claim does not hold.

Figures

Figures reproduced from arXiv: 2512.10891 by Anh-Quan Pham, Dani S. Bassett, Eric Eaton, Jorge Mendez-Mendez, Marcel Hussing, Shubhankar P. Patankar.

**Figure 2.** Figure 2: Four example CompoSuite tasks, each defined by selecting one element from each axis. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the 16-dimensional task indicator. For every task in CompoSuite, the model receives [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of our semantic compositional transformer architecture. We factorize each transition [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Zero-shot success rate for non-iterable RL models and RL models trained on synthetic data [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of different diffusion architectures over iterations of our self-improvement procedure. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Average intervention influence over all training tasks. The heatmap shows how masking each encoder module (rows) changes the predictions produced by each decoder module (columns). Brighter values indicate a larger influence. The plot shows strong diagonal effects, meaning each factor depends most on its own encoder, but it also reveals notable cross-factor interactions. In particular, the robot encoder… view at source ↗

**Figure 9.** Figure 9: Return difference between RL policies trained on data generated by the monolithic architecture and policies trained on ground-truth data over varying number of training tasks. As the number of training tasks approaches 56 (∼ 20%), there is steep increase in the performance gap indicating the sub-optimality of generated data from the diffusion model. When sufficient expert data is available, standard feed-f… view at source ↗

**Figure 10.** Figure 10: Performance of the monolithic SynthER-based diffusion architecture and the semantic compo [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of the 14 training tasks used in the IIWA-only split. Tasks are shown in numerical [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of the 32 held-out test tasks used for zero-shot evaluation. Tasks are displayed [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

read the original abstract

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a factorized diffusion model plus RL-validated iteration can generate usable data for unseen robotic task combos, but the abstract gives no numbers to judge how well it actually works.

read the letter

This paper's main result is that a diffusion model factorized by semantic components can generate data for new robot task combinations after training on a subset, and an iterative loop with offline RL validation gets it to cover almost all held-out tasks. The new part is the four-way factorization into robot, object, obstacle, and objective, with attention handling interactions, plus the self-improvement procedure that adds validated synthetic data back into training. That combination is not in the cited prior work. It does well at showing the idea can beat simple baselines and at suggesting that compositional structure emerges in the learned reps. The problem setup is clear and the goal of scalable data for combinatorial robotics is important. The soft spots are mostly around evidence. The abstract mentions clear outperformance and near-complete coverage but gives no numbers, variances, or specifics on the RL validation to keep errors from building up. If the domain has couplings that attention can't separate, like mass affecting multiple things at once, the zero-shot generation could fail for some cases. The stress test concern about non-compositional effects is worth checking in the full experiments. This is for robotics researchers focused on learning from limited data or generative models for control. Anyone thinking about multi-task or compositional generalization in physical systems could get something out of it. I would send it to peer review. The core idea is solid enough and the claims are specific enough to be worth referee time, though the authors will likely need to add quantitative details and tests for the factorization assumption.

Referee Report

3 major / 2 minor

Summary. The paper proposes a semantic compositional diffusion transformer that factorizes robotic manipulation transitions into robot-, object-, obstacle-, and objective-specific components, learning their interactions via attention. Trained on a limited subset of tasks, the model zero-shot generates high-quality transitions for unseen task combinations; an iterative self-improvement loop then validates synthetic data via offline RL and retrains, yielding substantial gains over monolithic and hard-coded baselines and solving nearly all held-out tasks while exhibiting emergent compositional structure in the representations.

Significance. If the empirical claims hold, the work would be significant for robotics data generation by tackling the combinatorial explosion in multi-object/multi-robot settings through explicit factorization and self-improvement rather than pure scaling. The incorporation of externally validated RL performance into the loop (rather than purely self-referential fitting) is a strength, as is the focus on zero-shot generalization to held-out combinations. This could reduce reliance on expensive real demonstrations if the learned representations prove robust to component recombination.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the central claim of 'near-complete coverage of held-out tasks' and 'substantially improves zero-shot performance' is presented without quantitative metrics, success rates, error bars, or the number of held-out tasks/combinations; this is load-bearing for assessing whether the iterative procedure actually solves the generalization problem or merely reports qualitative improvement.
[Method] Method section on the semantic compositional diffusion transformer: the factorization into four independent components whose interactions are captured solely by attention is asserted to enable zero-shot synthesis, but no explicit test (e.g., component-swap invariance or dynamic consistency of generated trajectories under novel combinations) is described to rule out overfitting to seen co-occurrences or irreducible couplings such as object mass simultaneously affecting grasp and avoidance.
[Iterative Procedure] Iterative self-improvement procedure: insufficient detail is given on how offline RL validation filters or corrects low-quality synthetic data to prevent compounding errors in subsequent rounds; without this mechanism being load-bearing and quantified, the self-improvement loop risks reinforcing inconsistencies rather than reliably expanding coverage.

minor comments (2)

[Method] Clarify the precise form of the semantic embeddings for each component (robot, object, etc.) and how they are injected into the diffusion transformer to aid reproducibility.
[Abstract] The abstract could explicitly name the monolithic and hard-coded compositional baselines used for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment in turn below, clarifying our approach where possible and noting revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim of 'near-complete coverage of held-out tasks' and 'substantially improves zero-shot performance' is presented without quantitative metrics, success rates, error bars, or the number of held-out tasks/combinations; this is load-bearing for assessing whether the iterative procedure actually solves the generalization problem or merely reports qualitative improvement.

Authors: We agree that explicit quantitative metrics are necessary to substantiate the claims about coverage and improvement. The original manuscript reports results primarily through qualitative descriptions and selected figures in the Experiments section. In the revised version we have updated the abstract to report concrete success rates (including the number of held-out combinations), added error bars to all relevant plots, and included a table summarizing performance across seeds and task counts. revision: yes
Referee: [Method] Method section on the semantic compositional diffusion transformer: the factorization into four independent components whose interactions are captured solely by attention is asserted to enable zero-shot synthesis, but no explicit test (e.g., component-swap invariance or dynamic consistency of generated trajectories under novel combinations) is described to rule out overfitting to seen co-occurrences or irreducible couplings such as object mass simultaneously affecting grasp and avoidance.

Authors: The primary evidence for the factorization's utility is the model's ability to generate usable transitions for held-out combinations that were never seen during training. We acknowledge that additional targeted diagnostics would make the argument more robust. We have therefore added component-swap experiments and trajectory-consistency checks in the revised manuscript. We note, however, that certain physical couplings (such as mass affecting both grasp and avoidance) are irreducible by design and are handled through the learned attention rather than explicit separation. revision: partial
Referee: [Iterative Procedure] Iterative self-improvement procedure: insufficient detail is given on how offline RL validation filters or corrects low-quality synthetic data to prevent compounding errors in subsequent rounds; without this mechanism being load-bearing and quantified, the self-improvement loop risks reinforcing inconsistencies rather than reliably expanding coverage.

Authors: We have expanded the Method section with a precise description of the validation step: synthetic transitions are scored by the value function of a policy trained via offline RL on the current dataset, and only transitions whose estimated return exceeds a fixed threshold are retained. We have also added an ablation that removes this filtering step and quantifies the resulting degradation in later iterations, demonstrating that the validation mechanism measurably limits error accumulation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical held-out evaluation and external RL validation

full rationale

The paper presents an empirical method: a semantic compositional diffusion transformer trained on a limited subset of task combinations, followed by zero-shot generation for unseen combinations and an iterative loop that filters synthetic data using offline RL performance. No derivation chain reduces a claimed prediction or first-principles result to its own inputs by construction. The factorization into robot/object/obstacle/objective components is an explicit architectural ansatz implemented via attention, not a self-definitional mapping. The iterative self-improvement incorporates externally measured RL success on held-out tasks rather than purely internal fitting. No self-citations are invoked as load-bearing uniqueness theorems, and no fitted parameters are relabeled as independent predictions. The central results are therefore self-contained against external benchmarks (held-out task success rates) and do not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that robotic tasks admit a clean four-way factorization whose interactions can be captured by attention, plus standard diffusion-model hyperparameters that are fitted during training.

free parameters (2)

diffusion noise schedule and step count
Standard diffusion hyperparameters chosen to match the data distribution; their specific values affect generation quality.
attention layer dimensions and number of heads
Model capacity choices that are tuned on the limited training subset.

axioms (1)

domain assumption Robotic manipulation domains possess compositional structure factorizable into robot-, object-, obstacle-, and objective-specific components.
Directly invoked when the model is defined to factorize transitions and learn interactions through attention.

invented entities (1)

semantic compositional diffusion transformer no independent evidence
purpose: To generate high-quality transitions for unseen task combinations by factorizing and attending over components.
New architecture introduced by the paper; no independent evidence outside the reported experiments.

pith-pipeline@v0.9.0 · 5711 in / 1500 out tokens · 62409 ms · 2026-05-21T16:59:25.390544+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assume a functionally compositional task graph... transformer as a GNN

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

IIWA, Box, ObjectDoor, Trashcan

work page
[2]

IIWA, Hollowbox, ObjectDoor, PickPlace

work page
[3]

IIWA, Dumbbell, ObjectDoor, PickPlace

work page
[4]

IIWA, Dumbbell, ObjectWall, Push

work page
[5]

IIWA, Plate, None, Shelf

work page
[6]

IIWA, Box, GoalWall, Trashcan

work page
[7]

IIWA, Plate, ObjectWall, Shelf

work page
[8]

IIWA, Hollowbox, GoalWall, Trashcan

work page
[9]

IIWA, Box, ObjectWall, Shelf

work page
[10]

IIWA, Box, None, Trashcan

work page
[11]

IIWA, Plate, ObjectWall, PickPlace

work page
[12]

IIWA, Box, GoalWall, PickPlace

work page
[13]

IIWA, Box, None, Push

work page
[14]

IIWA, Box, ObjectDoor, Shelf

work page
[15]

IIWA, Dumbbell, GoalWall, Shelf

work page
[16]

IIWA, Box, None, PickPlace

work page
[17]

IIWA, Box, GoalWall, Shelf

work page
[18]

IIWA, Hollowbox, None, PickPlace

work page
[19]

IIWA, Dumbbell, ObjectDoor, Push

work page
[20]

IIWA, Box, None, Shelf

work page
[21]

IIWA, Plate, None, PickPlace

work page
[22]

IIWA, Dumbbell, None, Shelf

work page
[23]

IIWA, Dumbbell, ObjectDoor, Shelf

work page
[24]

IIWA, Hollowbox, GoalWall, PickPlace

work page
[25]

IIWA, Dumbbell, GoalWall, Trashcan

work page
[26]

IIWA, Plate, ObjectDoor, Push

work page
[27]

IIWA, Plate, ObjectDoor, Shelf

work page
[28]

IIWA, Hollowbox, None, Trashcan

work page
[29]

IIWA, Box, ObjectDoor, PickPlace

work page
[30]

IIWA, Box, ObjectDoor, Push

work page
[31]

IIWA, Hollowbox, None, Shelf

work page
[32]

IIWA, Dumbbell, ObjectWall, Shelf

work page
[33]

IIWA, Hollowbox, GoalWall, Shelf

work page
[34]

IIWA, Box, ObjectWall, Push

work page
[35]

IIWA, Hollowbox, ObjectWall, Shelf

work page
[36]

IIWA, Hollowbox, None, Push

work page
[37]

IIWA, Plate, GoalWall, Shelf

work page
[38]

IIWA, Plate, ObjectDoor, PickPlace

work page
[39]

IIWA, Plate, GoalWall, Trashcan

work page
[40]

IIWA, Dumbbell, GoalWall, PickPlace

work page
[41]

IIWA, Hollowbox, ObjectDoor, Trashcan

work page
[42]

IIWA, Dumbbell, ObjectWall, Trashcan

work page
[43]

IIWA, Plate, None, Push

work page
[44]

IIWA, Plate, GoalWall, Push

work page
[45]

IIWA, Dumbbell, None, Push

work page
[46]

Tasks are shown in numerical order (1–14), arranged left-to-right and top-to-bottom

IIWA, Plate, GoalWall, PickPlace Figure 11: Visualization of the 14 training tasks used in the IIWA-only split. Tasks are shown in numerical order (1–14), arranged left-to-right and top-to-bottom. Each image depicts one unique combination ofObject, Obstacle, andObjectivepaired with the IIWA robot. 24 Figure 12: Visualization of the 32 held-out test tasks ...

work page

[1] [1]

IIWA, Box, ObjectDoor, Trashcan

work page

[2] [2]

IIWA, Hollowbox, ObjectDoor, PickPlace

work page

[3] [3]

IIWA, Dumbbell, ObjectDoor, PickPlace

work page

[4] [4]

IIWA, Dumbbell, ObjectWall, Push

work page

[5] [5]

IIWA, Plate, None, Shelf

work page

[6] [6]

IIWA, Box, GoalWall, Trashcan

work page

[7] [7]

IIWA, Plate, ObjectWall, Shelf

work page

[8] [8]

IIWA, Hollowbox, GoalWall, Trashcan

work page

[9] [9]

IIWA, Box, ObjectWall, Shelf

work page

[10] [10]

IIWA, Box, None, Trashcan

work page

[11] [11]

IIWA, Plate, ObjectWall, PickPlace

work page

[12] [12]

IIWA, Box, GoalWall, PickPlace

work page

[13] [13]

IIWA, Box, None, Push

work page

[14] [14]

IIWA, Box, ObjectDoor, Shelf

work page

[15] [15]

IIWA, Dumbbell, GoalWall, Shelf

work page

[16] [16]

IIWA, Box, None, PickPlace

work page

[17] [17]

IIWA, Box, GoalWall, Shelf

work page

[18] [18]

IIWA, Hollowbox, None, PickPlace

work page

[19] [19]

IIWA, Dumbbell, ObjectDoor, Push

work page

[20] [20]

IIWA, Box, None, Shelf

work page

[21] [21]

IIWA, Plate, None, PickPlace

work page

[22] [22]

IIWA, Dumbbell, None, Shelf

work page

[23] [23]

IIWA, Dumbbell, ObjectDoor, Shelf

work page

[24] [24]

IIWA, Hollowbox, GoalWall, PickPlace

work page

[25] [25]

IIWA, Dumbbell, GoalWall, Trashcan

work page

[26] [26]

IIWA, Plate, ObjectDoor, Push

work page

[27] [27]

IIWA, Plate, ObjectDoor, Shelf

work page

[28] [28]

IIWA, Hollowbox, None, Trashcan

work page

[29] [29]

IIWA, Box, ObjectDoor, PickPlace

work page

[30] [30]

IIWA, Box, ObjectDoor, Push

work page

[31] [31]

IIWA, Hollowbox, None, Shelf

work page

[32] [32]

IIWA, Dumbbell, ObjectWall, Shelf

work page

[33] [33]

IIWA, Hollowbox, GoalWall, Shelf

work page

[34] [34]

IIWA, Box, ObjectWall, Push

work page

[35] [35]

IIWA, Hollowbox, ObjectWall, Shelf

work page

[36] [36]

IIWA, Hollowbox, None, Push

work page

[37] [37]

IIWA, Plate, GoalWall, Shelf

work page

[38] [38]

IIWA, Plate, ObjectDoor, PickPlace

work page

[39] [39]

IIWA, Plate, GoalWall, Trashcan

work page

[40] [40]

IIWA, Dumbbell, GoalWall, PickPlace

work page

[41] [41]

IIWA, Hollowbox, ObjectDoor, Trashcan

work page

[42] [42]

IIWA, Dumbbell, ObjectWall, Trashcan

work page

[43] [43]

IIWA, Plate, None, Push

work page

[44] [44]

IIWA, Plate, GoalWall, Push

work page

[45] [45]

IIWA, Dumbbell, None, Push

work page

[46] [46]

Tasks are shown in numerical order (1–14), arranged left-to-right and top-to-bottom

IIWA, Plate, GoalWall, PickPlace Figure 11: Visualization of the 14 training tasks used in the IIWA-only split. Tasks are shown in numerical order (1–14), arranged left-to-right and top-to-bottom. Each image depicts one unique combination ofObject, Obstacle, andObjectivepaired with the IIWA robot. 24 Figure 12: Visualization of the 32 held-out test tasks ...

work page