RobotEQ: Transitioning from Passive Intelligence to Active Intelligence in Embodied AI
Pith reviewed 2026-05-08 09:07 UTC · model grok-4.3
The pith
The first benchmark for active intelligence shows embodied AI models still cannot reliably follow social norms without explicit instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RobotEQ is introduced as the first benchmark for active intelligence, which enables robots to judge permissible actions based on social norms in embodied settings absent explicit instructions. The accompanying RobotEQ-Data contains 1,900 egocentric images across 10 categories and 56 subcategories, annotated with 5,353 action judgment questions and 1,286 spatial grounding questions. RobotEQ-Bench applies this to assess state-of-the-art models, finding they fall short particularly in spatial grounding while benefiting from retrieval-augmented generation with social norm knowledge.
What carries the argument
The RobotEQ benchmark, built on the RobotEQ-Data dataset of manually annotated egocentric images and questions about permissible robot actions and spatial grounding, together with the RobotEQ-Bench evaluation protocol.
If this is right
- Existing models cannot yet achieve reliable active intelligence in embodied scenarios.
- Performance is weakest on spatial grounding tasks that require understanding physical constraints in context.
- Incorporating external social norm knowledge via retrieval techniques generally improves adherence to permissible actions.
- This benchmark can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.
Where Pith is reading between the lines
- Robots with effective active intelligence could manage unexpected situations in homes or public spaces with less human oversight.
- Expanding the benchmark to include dynamic video sequences or multi-turn interactions would better test real-time norm following.
- The spatial grounding weakness points to a broader need for tighter coupling between visual perception and normative reasoning in embodied models.
Load-bearing premise
The manually annotated action judgments and spatial questions in RobotEQ-Data accurately and comprehensively represent real-world social norms and permissible robot behaviors across diverse embodied scenarios.
What would settle it
A physical robot running a model that scores high on RobotEQ-Bench is deployed in varied human environments and observed for the frequency of actions that violate social norms when no instructions are given.
Figures
read the original abstract
Embodied AI is a prominent research topic in both academia and industry. Current research centers on completing tasks based on explicit user instructions. However, for robots to integrate into human society, they must understand which actions are permissible and which are prohibited, even without explicit commands. We refer to the user-guided AI as passive intelligence and the unguided AI as active intelligence. This paper introduces RobotEQ, the first benchmark for active intelligence, aiming to assess whether existing models can comprehend and adhere to social norms in embodied scenarios. First, we construct RobotEQ-Data, a dataset consisting of 1,894 egocentric images, spanning 10 representative embodied categories and 56 subcategories. Through extensive manual annotation, we provide 4,944 action judgment questions and 1,157 spatial grounding questions, specifying appropriate robot actions across diverse scenarios. Furthermore, we establish RobotEQ-Bench to evaluate the performance of state-of-the-art models on this task. Experimental results demonstrate that current models still fall short in achieving reliable active intelligence, particularly in spatial grounding. Meanwhile, leveraging RAG techniques to incorporate external social norm knowledge bases can generally enhance performance. This work can facilitate the transition of robotics from user-guided passive manipulation to active social compliance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RobotEQ as the first benchmark for 'active intelligence' in embodied AI, defined as the ability of robots to comprehend and adhere to social norms without explicit user instructions (contrasted with 'passive intelligence' for user-guided tasks). It constructs RobotEQ-Data from 1,900 egocentric images across 10 categories and 56 subcategories, providing 5,353 manually annotated action judgment questions and 1,286 spatial grounding questions. RobotEQ-Bench evaluates state-of-the-art models, reporting underperformance (especially in spatial grounding) that can be mitigated by RAG with external social norm knowledge bases. The work positions this as facilitating a transition to active social compliance in robotics.
Significance. If the annotations reliably capture generalizable social norms across embodied scenarios, RobotEQ could provide a valuable standardized benchmark for evaluating and improving social awareness in embodied AI, addressing a gap beyond explicit task completion. The empirical findings on model limitations and RAG benefits offer concrete directions for future work. The introduction of the 'active intelligence' framing, while novel, would benefit from stronger ties to existing literature on ethical robotics and value alignment.
major comments (2)
- [RobotEQ-Data construction] §3 (RobotEQ-Data construction): The dataset relies on 'extensive manual annotation' to create the 5,353 action judgment and 1,286 spatial grounding questions, but reports no inter-annotator agreement metrics, annotator demographics, or external validation against established ethical corpora or incident databases. This is load-bearing for the central claim that RobotEQ measures active intelligence, as social norms are culturally variable and the benchmark's validity as a faithful proxy depends on annotation reliability and generalizability.
- [RobotEQ-Bench evaluation] §4 (RobotEQ-Bench evaluation): The results claim that current models 'fall short' in active intelligence and that RAG 'can generally enhance performance,' but provide no specific quantitative metrics (e.g., accuracy or F1 scores per category), baseline model details, or error analysis across the question sets. This limits verification of the underperformance extent and the improvement magnitude, weakening the empirical support for the benchmark's utility.
minor comments (1)
- [Abstract and Introduction] The abstract and introduction assert RobotEQ is the 'first benchmark' for active intelligence without citing or contrasting against prior datasets on social norms, ethical decision-making, or value alignment in robotics/AI.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating revisions where appropriate to strengthen the work.
read point-by-point responses
-
Referee: The dataset relies on 'extensive manual annotation' to create the 5,353 action judgment and 1,286 spatial grounding questions, but reports no inter-annotator agreement metrics, annotator demographics, or external validation against established ethical corpora or incident databases. This is load-bearing for the central claim that RobotEQ measures active intelligence, as social norms are culturally variable and the benchmark's validity as a faithful proxy depends on annotation reliability and generalizability.
Authors: We agree that inter-annotator agreement metrics are essential to demonstrate annotation reliability, given the cultural variability of social norms. In the revised manuscript, we will report Fleiss' kappa scores computed on a 10% re-annotated subset of the questions. We will also add a description of annotator demographics, noting that the team consisted of researchers with expertise in robotics and AI ethics. For external validation, we will expand the discussion to explicitly map our 10 categories and 56 subcategories to established social norm frameworks from ethical robotics literature (e.g., value alignment studies), while acknowledging this as an area for future work rather than claiming full external corpus validation. revision: yes
-
Referee: The results claim that current models 'fall short' in active intelligence and that RAG 'can generally enhance performance,' but provide no specific quantitative metrics (e.g., accuracy or F1 scores per category), baseline model details, or error analysis across the question sets. This limits verification of the underperformance extent and the improvement magnitude, weakening the empirical support for the benchmark's utility.
Authors: We will revise §4 to include a detailed results table reporting accuracy and F1 scores broken down by the 10 categories (and where feasible, subcategories) for both action judgment and spatial grounding tasks. We will explicitly list the evaluated models (including versions and prompting details) and add a dedicated error analysis subsection identifying common failure modes, such as spatial mis-grounding and norm misinterpretation. These changes will provide verifiable quantitative support for the reported underperformance and RAG benefits. revision: yes
Circularity Check
No circularity: empirical benchmark construction with independent annotations
full rationale
The paper presents RobotEQ as an empirical benchmark for active intelligence via manual annotation of 1,900 images into 5,353 action judgments and 1,286 spatial questions across 10 categories. No mathematical derivations, equations, fitted parameters, or predictions appear in the provided text. The central claims rest on dataset construction and model evaluation, which do not reduce to self-citations, self-definitions, or inputs by construction. This is a standard benchmark-creation effort whose validity can be assessed externally against real-world norms or inter-annotator metrics, with no load-bearing step that collapses into its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Social norms can be consistently defined and manually annotated as appropriate or inappropriate robot actions in embodied scenarios.
invented entities (1)
-
active intelligence
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.