pith. sign in

arxiv: 2508.06206 · v5 · pith:BF4S6654new · submitted 2025-08-08 · 💻 cs.RO · cs.CV

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Pith reviewed 2026-05-22 00:03 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords affordance groundingreinforcement learningGRPOmultimodal large language modelszero-shot generalizationchain-of-thought reasoningembodied AIrobot perception
0
0 comments X

The pith

Reinforcement learning with a custom reward function enables zero-shot generalization and emergent reasoning in affordance grounding for multimodal models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a multimodal large language model can learn to predict action-relevant regions on objects by training exclusively through reinforcement learning with Group Relative Policy Optimization, guided by a reward function that scores format, perception, and cognition. This setup uses no explicit reasoning examples yet produces robust performance on objects and scenes outside the training distribution along with step-by-step reasoning that appears only at test time. A reader would care because affordance prediction is central to robots acting on everyday objects, and methods that improve out-of-domain performance without large labeled reasoning datasets could reduce data collection costs in embodied robotics.

Core claim

Affordance-R1 integrates cognitive Chain-of-Thought guided Group Relative Policy Optimization within a reinforcement learning paradigm for affordance grounding. A custom affordance function supplies format, perception, and cognition rewards to steer optimization, and training occurs on the ReasonAff dataset. When trained solely via this RL procedure and without explicit reasoning data, the model achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities.

What carries the argument

Group Relative Policy Optimization (GRPO) directed by a custom affordance reward function that scores format compliance, perceptual accuracy, and cognitive reasoning quality to shape the model's outputs on object-action regions.

If this is right

  • The model outperforms established methods on standard affordance grounding benchmarks.
  • It demonstrates open-world generalization across unseen objects and environments.
  • It supports improved performance in human-robot interaction and embodied manipulation tasks.
  • It is the first approach to combine GRPO-based reinforcement learning with reasoning for affordance problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the emergent reasoning persists across domains, the same RL recipe could be tested on other grounding or planning tasks that currently require supervised reasoning traces.
  • Deploying the model on physical robot hardware would test whether predicted affordance regions translate into successful real-world actions.
  • The separation of training from explicit reasoning data suggests a route to elicit capabilities in multimodal models for tasks where labeled reasoning examples are scarce.

Load-bearing premise

The custom affordance function is assumed to deliver effective and unbiased optimization signals that produce genuine generalization and reasoning rather than overfitting to the training distribution.

What would settle it

Measure accuracy and reasoning quality on a new collection of objects drawn from categories absent from training and check whether both metrics remain high; a sharp drop would indicate the generalization claim does not hold.

read the original abstract

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Affordance-R1, a multimodal LLM framework for affordance grounding that integrates Chain-of-Thought reasoning via Group Relative Policy Optimization (GRPO) reinforcement learning. It introduces a custom affordance reward function with format, perception, and cognition components, constructs the ReasonAff reasoning dataset, and trains exclusively with RL without supervised reasoning traces. The central claim is that this yields robust zero-shot out-of-domain generalization, emergent test-time reasoning, and superior performance over prior methods in comprehensive experiments on embodied tasks.

Significance. If substantiated, the result would be significant for embodied AI and robotics by demonstrating that GRPO-based RL can induce transferable affordance reasoning without explicit supervision, addressing limitations in OOD generalization for human-robot interaction and manipulation. The public release of code and the ReasonAff dataset strengthens reproducibility and enables follow-up work.

major comments (2)
  1. [Section 3 (reward function and GRPO training)] The central claim of genuine zero-shot OOD generalization and emergent reasoning rests on the custom affordance reward function (format + perception + cognition) providing unbiased optimization signals. However, the manuscript does not include ablations or distributional analysis showing that the cognition or perception rewards are independent of patterns in the ReasonAff training distribution (e.g., object/action correlations or LLM-judge biases). This is load-bearing: without such evidence, the policy may optimize for reward-maximizing output formats that succeed on train-like data rather than acquiring transferable reasoning, directly undermining the no-explicit-reasoning-data generalization assertion.
  2. [Section 4 (experiments and results)] Experiments section: While the abstract and introduction assert comprehensive experiments and outperformance, the quantitative support for robust zero-shot generalization (e.g., specific OOD metrics, error analysis, or comparisons isolating the RL component) is not presented with sufficient detail to evaluate the claim against the skeptic concern of pattern-matching on ReasonAff. This weakens the evidential basis for the strongest claim.
minor comments (2)
  1. [Section 3.2] Notation for the affordance reward components could be clarified with explicit equations or pseudocode to distinguish how each term is computed (e.g., whether perception reward uses ground-truth annotations or model predictions).
  2. [Section 4.1] The manuscript would benefit from a clearer statement of the baseline methods and exact evaluation protocols (e.g., IoU thresholds for affordance grounding) to facilitate direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We have carefully reviewed the major concerns and provide detailed point-by-point responses below. We agree that additional evidence and detail will strengthen the presentation of our claims regarding zero-shot generalization and emergent reasoning in Affordance-R1.

read point-by-point responses
  1. Referee: [Section 3 (reward function and GRPO training)] The central claim of genuine zero-shot OOD generalization and emergent reasoning rests on the custom affordance reward function (format + perception + cognition) providing unbiased optimization signals. However, the manuscript does not include ablations or distributional analysis showing that the cognition or perception rewards are independent of patterns in the ReasonAff training distribution (e.g., object/action correlations or LLM-judge biases). This is load-bearing: without such evidence, the policy may optimize for reward-maximizing output formats that succeed on train-like data rather than acquiring transferable reasoning, directly undermining the no-explicit-reasoning-data generalization assertion.

    Authors: We agree that verifying the independence of the reward components from training distribution patterns is essential to support the claims of emergent, transferable reasoning. In the revised manuscript, we will add ablations isolating the format, perception, and cognition rewards and report their individual effects on OOD generalization performance. We will also include a distributional analysis comparing reward signals and output patterns on in-distribution versus OOD samples, along with checks for correlations with object-action pairs or potential LLM-judge biases in ReasonAff. These additions will help demonstrate that the rewards encourage generalizable cognitive processes rather than dataset-specific optimization. revision: yes

  2. Referee: [Section 4 (experiments and results)] Experiments section: While the abstract and introduction assert comprehensive experiments and outperformance, the quantitative support for robust zero-shot generalization (e.g., specific OOD metrics, error analysis, or comparisons isolating the RL component) is not presented with sufficient detail to evaluate the claim against the skeptic concern of pattern-matching on ReasonAff. This weakens the evidential basis for the strongest claim.

    Authors: We acknowledge that the experiments section would benefit from expanded quantitative detail to more convincingly address concerns about potential pattern-matching. In the revision, we will augment the results with finer-grained OOD metrics stratified by domain shift categories, a systematic error analysis with categorized failure modes, and targeted ablations that isolate the GRPO reinforcement learning component (e.g., comparisons against supervised fine-tuning baselines and non-RL variants). These enhancements will provide stronger evidential support for the zero-shot generalization and emergent reasoning assertions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed generalization from RL training

full rationale

The paper trains via GRPO reinforcement learning using an externally hand-designed affordance reward function (format + perception + cognition components) on the constructed ReasonAff dataset, without supervised reasoning traces. The claimed zero-shot OOD generalization and emergent test-time reasoning are presented as empirical outcomes of this optimization process rather than quantities defined in terms of the inputs or fitted directly to evaluation metrics. No self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text; the reward signal is independent of the target generalization metric.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the three-part reward function and the assumption that RL alone can produce emergent reasoning without supervised CoT data.

axioms (1)
  • domain assumption The affordance function with format, perception, and cognition rewards guides optimization directions effectively.
    Invoked as the mechanism that enables successful GRPO training and generalization.

pith-pipeline@v0.9.0 · 5815 in / 1229 out tokens · 43131 ms · 2026-05-22T00:03:07.538439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 6.0

    Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...

  2. OneThinker: All-in-one Reasoning Model for Image and Video

    cs.CV 2025-12 unverdicted novelty 5.0

    OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.

  3. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 4.0

    Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...