SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Justin Yu; Mac Schwager; Philipp Wu; Pieter Abbeel; Qianzhong Chen; Yide Shentu

arxiv: 2509.25358 · v4 · submitted 2025-09-29 · 💻 cs.RO

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Qianzhong Chen , Justin Yu , Mac Schwager , Pieter Abbeel , Yide Shentu , Philipp Wu This is my paper

Pith reviewed 2026-05-18 12:17 UTC · model grok-4.3

classification 💻 cs.RO

keywords robot manipulationreward modelingbehavior cloningdeformable objectslong-horizon tasksimitation learningstage-aware supervision

0 comments

The pith

Stage-aware reward modeling with natural language subtask annotations provides consistent progress labels for long-horizon robot manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reward modeling method to handle long sequences of robot actions on deformable objects where demonstration quality varies. It draws on natural language descriptions of subtasks to predict both the current stage of the task and the fine-grained progress inside that stage from video. This produces more reliable training signals than labeling by raw frame index, which breaks down when demonstrations differ in length or execution speed. The resulting rewards then support a filtered and reweighted form of behavior cloning that focuses training on higher-quality examples. Real-robot experiments on T-shirt folding show success rates rising from single digits to over sixty percent even when starting from crumpled states.

Core claim

We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame index based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates.

What carries the argument

Stage-aware reward model that jointly predicts task stage and fine-grained progress from video observations using natural language subtask annotations to generate consistent supervision signals.

If this is right

Reward estimates enable filtering and reweighting of demonstrations to improve policy training via Reward-Aligned Behavior Cloning.
The method achieves 83 percent success on T-shirt folding from the flattened state and 67 percent from the crumpled state in real-world rollouts.
The reward model generalizes to out-of-distribution scenarios and significantly outperforms vanilla behavior cloning baselines in real-world tests and human validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Consistent language-based subtask labels could reduce the need for precisely timed annotations when collecting robot demonstration data.
Similar stage-aware reward modeling may extend to other sequential robot skills that involve variable execution speeds or object states.
The approach points toward using lightweight natural language input to structure rewards for longer-horizon tasks without dense manual labeling.

Load-bearing premise

Natural language subtask annotations can be provided consistently across demonstrations of varying length and quality to produce stable, non-brittle progress labels.

What would settle it

Measuring whether the reward model produces accurate stage and progress predictions on new video sequences with inconsistent lengths or qualities, and whether policies trained using the filtered RA-BC process maintain the reported success rates on those sequences.

read the original abstract

Large-scale robot learning has made progress on complex manipulation tasks, yet long horizon, contact rich problems, especially those involving deformable objects, remain challenging due to inconsistent demonstration quality. We propose a stage-aware, video-based reward modeling framework that jointly predicts task stage and fine-grained progress, using natural language subtask annotations to derive consistent labels across variable-length demonstrations. This avoids the brittleness of frame index based labeling and provides stable supervision even in tasks like T-shirt folding. Our reward model is robust to demonstration variability, generalizes to out-of-distribution scenarios, and improves downstream policy training. Building on it, we introduce Reward-Aligned Behavior Cloning (RA-BC), which filters and reweights demonstrations based on reward estimates. Experiments show that our method significantly outperforms baselines in both real-world rollouts and human validation. On T-shirt folding, we achieve 83% success from the flattened state and 67% from the crumpled state, compared to 8% and 0% with vanilla BC. Overall, our results highlight reward modeling as a scalable and annotation-efficient solution for long horizon robotic manipulation. Project website: https://qianzhong-chen.github.io/sarm.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARM reports large gains on T-shirt folding via language-derived stage and progress labels for reward modeling, but thin experimental details leave the robustness open to question.

read the letter

The main point is that this paper trains a reward model to jointly predict task stage and fine-grained progress from video, using natural language subtask annotations to label demonstrations of varying length and quality. They then filter and reweight those demos with the predicted rewards in a method they call RA-BC. On real-robot T-shirt folding they report 83% success from flattened and 67% from crumpled states, against 8% and 0% for vanilla behavior cloning. That is the headline result worth checking.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SARM, a stage-aware reward modeling framework for long-horizon robot manipulation tasks involving deformable objects. It derives consistent task stage and fine-grained progress labels from natural language subtask annotations on variable-length demonstrations to avoid frame-index brittleness, jointly predicts these quantities in a video-based reward model, and uses the resulting rewards in Reward-Aligned Behavior Cloning (RA-BC) to filter and reweight demonstrations. Real-robot experiments on T-shirt folding report 83% success from the flattened state and 67% from the crumpled state, versus 8% and 0% for vanilla behavior cloning, with claims of robustness to demonstration variability and out-of-distribution generalization.

Significance. If the empirical claims are substantiated, the work could meaningfully advance scalable reward modeling for contact-rich, long-horizon manipulation by replacing brittle indexing with annotation-derived progress signals. The RA-BC reweighting mechanism offers a practical way to leverage imperfect demonstrations, and the focus on deformable objects addresses an under-served area. Strengths include the explicit handling of variable demonstration quality and the annotation-efficient design, though the absence of supporting experimental controls limits current assessment of impact.

major comments (2)

Abstract and Experiments: The headline results (83% and 67% success on T-shirt folding versus 8% and 0% for vanilla BC) are presented without any information on the number of real-robot trials, statistical significance, baseline implementation details, or failure-mode analysis. These omissions are load-bearing for the central empirical claim, as the quantitative support cannot be evaluated without them.
Methods (label derivation) and Experiments: Progress and stage labels are derived directly from the same natural-language subtask annotations used to define the task stages. No inter-annotator agreement metrics, annotation protocol details, or ablation that removes the language component are reported. This leaves open the possibility that downstream policy gains partly reflect reweighting of already-labeled data rather than discovery of an independent progress signal, which is central to attributing the reported improvements to the proposed framework.

minor comments (2)

The description of how the reward model is trained on the derived labels could be expanded with a short pseudocode or diagram to clarify the joint stage-and-progress prediction objective.
Consider reporting the exact number of demonstrations used for training the reward model and for RA-BC filtering, as this directly affects claims of annotation efficiency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where revisions will be incorporated to strengthen the manuscript.

read point-by-point responses

Referee: Abstract and Experiments: The headline results (83% and 67% success on T-shirt folding versus 8% and 0% for vanilla BC) are presented without any information on the number of real-robot trials, statistical significance, baseline implementation details, or failure-mode analysis. These omissions are load-bearing for the central empirical claim, as the quantitative support cannot be evaluated without them.

Authors: We agree that these details are necessary to properly substantiate the empirical claims. In the revised manuscript we will expand the Experiments section to report the precise number of real-robot trials conducted for each condition, include statistical significance testing appropriate to the sample sizes, provide fuller implementation details for all baselines to ensure reproducibility, and add a dedicated failure-mode analysis of observed rollouts. revision: yes
Referee: Methods (label derivation) and Experiments: Progress and stage labels are derived directly from the same natural-language subtask annotations used to define the task stages. No inter-annotator agreement metrics, annotation protocol details, or ablation that removes the language component are reported. This leaves open the possibility that downstream policy gains partly reflect reweighting of already-labeled data rather than discovery of an independent progress signal, which is central to attributing the reported improvements to the proposed framework.

Authors: We appreciate the referee's point on the source of supervision. The language annotations are used solely to produce consistent stage and progress labels for training; the reward model is a video-only predictor that must infer these quantities from visual observations at test time. This separation is what permits RA-BC to filter and reweight previously unseen demonstrations. We will add a detailed description of the annotation protocol and label derivation procedure to the Methods section. We will also include an ablation that isolates the contribution of the learned visual progress signal. Inter-annotator agreement can be reported if multiple annotators participated in the original labeling. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives stage and progress labels from natural language subtask annotations, trains a video-based reward model to predict those labels from observations, and applies the resulting reward estimates to filter/reweight demonstrations in RA-BC. This is a standard supervised learning setup in which the model learns an independent mapping from visual input to the annotation-derived targets; the downstream policy gains arise from the model's generalization rather than any redefinition or direct reuse of the original labels. No equation, prediction step, or self-citation reduces the claimed result to its inputs by construction. The framework is self-contained against external benchmarks (real-world success rates versus vanilla BC) and does not rely on load-bearing self-citations or ansatzes smuggled from prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the availability of consistent natural-language subtask annotations and on the assumption that a video model trained on those annotations will produce rewards that improve policy learning beyond standard behavior cloning.

axioms (1)

domain assumption Natural language subtask annotations can be obtained consistently across variable-length demonstrations without introducing new labeling noise.
The method derives progress labels directly from these annotations; inconsistent or subjective labels would propagate into the reward model.

pith-pipeline@v0.9.0 · 5751 in / 1268 out tokens · 31500 ms · 2026-05-18T12:17:56.971054+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations
cs.RO 2026-05 unverdicted novelty 6.0

A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
cs.RO 2026-03 unverdicted novelty 6.0

Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.