Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model
Pith reviewed 2026-05-22 00:03 UTC · model grok-4.3
The pith
Reinforcement learning with a custom reward function enables zero-shot generalization and emergent reasoning in affordance grounding for multimodal models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Affordance-R1 integrates cognitive Chain-of-Thought guided Group Relative Policy Optimization within a reinforcement learning paradigm for affordance grounding. A custom affordance function supplies format, perception, and cognition rewards to steer optimization, and training occurs on the ReasonAff dataset. When trained solely via this RL procedure and without explicit reasoning data, the model achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities.
What carries the argument
Group Relative Policy Optimization (GRPO) directed by a custom affordance reward function that scores format compliance, perceptual accuracy, and cognitive reasoning quality to shape the model's outputs on object-action regions.
If this is right
- The model outperforms established methods on standard affordance grounding benchmarks.
- It demonstrates open-world generalization across unseen objects and environments.
- It supports improved performance in human-robot interaction and embodied manipulation tasks.
- It is the first approach to combine GRPO-based reinforcement learning with reasoning for affordance problems.
Where Pith is reading between the lines
- If the emergent reasoning persists across domains, the same RL recipe could be tested on other grounding or planning tasks that currently require supervised reasoning traces.
- Deploying the model on physical robot hardware would test whether predicted affordance regions translate into successful real-world actions.
- The separation of training from explicit reasoning data suggests a route to elicit capabilities in multimodal models for tasks where labeled reasoning examples are scarce.
Load-bearing premise
The custom affordance function is assumed to deliver effective and unbiased optimization signals that produce genuine generalization and reasoning rather than overfitting to the training distribution.
What would settle it
Measure accuracy and reasoning quality on a new collection of objects drawn from categories absent from training and check whether both metrics remain high; a sharp drop would indicate the generalization claim does not hold.
read the original abstract
Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Affordance-R1, a multimodal LLM framework for affordance grounding that integrates Chain-of-Thought reasoning via Group Relative Policy Optimization (GRPO) reinforcement learning. It introduces a custom affordance reward function with format, perception, and cognition components, constructs the ReasonAff reasoning dataset, and trains exclusively with RL without supervised reasoning traces. The central claim is that this yields robust zero-shot out-of-domain generalization, emergent test-time reasoning, and superior performance over prior methods in comprehensive experiments on embodied tasks.
Significance. If substantiated, the result would be significant for embodied AI and robotics by demonstrating that GRPO-based RL can induce transferable affordance reasoning without explicit supervision, addressing limitations in OOD generalization for human-robot interaction and manipulation. The public release of code and the ReasonAff dataset strengthens reproducibility and enables follow-up work.
major comments (2)
- [Section 3 (reward function and GRPO training)] The central claim of genuine zero-shot OOD generalization and emergent reasoning rests on the custom affordance reward function (format + perception + cognition) providing unbiased optimization signals. However, the manuscript does not include ablations or distributional analysis showing that the cognition or perception rewards are independent of patterns in the ReasonAff training distribution (e.g., object/action correlations or LLM-judge biases). This is load-bearing: without such evidence, the policy may optimize for reward-maximizing output formats that succeed on train-like data rather than acquiring transferable reasoning, directly undermining the no-explicit-reasoning-data generalization assertion.
- [Section 4 (experiments and results)] Experiments section: While the abstract and introduction assert comprehensive experiments and outperformance, the quantitative support for robust zero-shot generalization (e.g., specific OOD metrics, error analysis, or comparisons isolating the RL component) is not presented with sufficient detail to evaluate the claim against the skeptic concern of pattern-matching on ReasonAff. This weakens the evidential basis for the strongest claim.
minor comments (2)
- [Section 3.2] Notation for the affordance reward components could be clarified with explicit equations or pseudocode to distinguish how each term is computed (e.g., whether perception reward uses ground-truth annotations or model predictions).
- [Section 4.1] The manuscript would benefit from a clearer statement of the baseline methods and exact evaluation protocols (e.g., IoU thresholds for affordance grounding) to facilitate direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments on our manuscript. We have carefully reviewed the major concerns and provide detailed point-by-point responses below. We agree that additional evidence and detail will strengthen the presentation of our claims regarding zero-shot generalization and emergent reasoning in Affordance-R1.
read point-by-point responses
-
Referee: [Section 3 (reward function and GRPO training)] The central claim of genuine zero-shot OOD generalization and emergent reasoning rests on the custom affordance reward function (format + perception + cognition) providing unbiased optimization signals. However, the manuscript does not include ablations or distributional analysis showing that the cognition or perception rewards are independent of patterns in the ReasonAff training distribution (e.g., object/action correlations or LLM-judge biases). This is load-bearing: without such evidence, the policy may optimize for reward-maximizing output formats that succeed on train-like data rather than acquiring transferable reasoning, directly undermining the no-explicit-reasoning-data generalization assertion.
Authors: We agree that verifying the independence of the reward components from training distribution patterns is essential to support the claims of emergent, transferable reasoning. In the revised manuscript, we will add ablations isolating the format, perception, and cognition rewards and report their individual effects on OOD generalization performance. We will also include a distributional analysis comparing reward signals and output patterns on in-distribution versus OOD samples, along with checks for correlations with object-action pairs or potential LLM-judge biases in ReasonAff. These additions will help demonstrate that the rewards encourage generalizable cognitive processes rather than dataset-specific optimization. revision: yes
-
Referee: [Section 4 (experiments and results)] Experiments section: While the abstract and introduction assert comprehensive experiments and outperformance, the quantitative support for robust zero-shot generalization (e.g., specific OOD metrics, error analysis, or comparisons isolating the RL component) is not presented with sufficient detail to evaluate the claim against the skeptic concern of pattern-matching on ReasonAff. This weakens the evidential basis for the strongest claim.
Authors: We acknowledge that the experiments section would benefit from expanded quantitative detail to more convincingly address concerns about potential pattern-matching. In the revision, we will augment the results with finer-grained OOD metrics stratified by domain shift categories, a systematic error analysis with categorized failure modes, and targeted ablations that isolate the GRPO reinforcement learning component (e.g., comparisons against supervised fine-tuning baselines and non-RL variants). These enhancements will provide stronger evidential support for the zero-shot generalization and emergent reasoning assertions. revision: yes
Circularity Check
No significant circularity in claimed generalization from RL training
full rationale
The paper trains via GRPO reinforcement learning using an externally hand-designed affordance reward function (format + perception + cognition components) on the constructed ReasonAff dataset, without supervised reasoning traces. The claimed zero-shot OOD generalization and emergent test-time reasoning are presented as empirical outcomes of this optimization process rather than quantities defined in terms of the inputs or fitted directly to evaluation metrics. No self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear in the provided text; the reward signal is independent of the target generalization metric.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The affordance function with format, perception, and cognition rewards guides optimization directions effectively.
Forward citations
Cited by 3 Pith papers
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
OneThinker: All-in-one Reasoning Model for Image and Video
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.