hub

Moka: Open-world robotic manipulation through mark-based visual prompting

Fangchen Liu, Kuan Fang, Pieter Abbeel, Sergey Levine · 2024 · arXiv 2403.03174

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Point Tracking Improves World Action Models

cs.RO · 2026-05-22 · unverdicted · novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

cs.RO · 2026-04-08 · unverdicted · novelty 7.0

KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

cs.RO · 2024-09-03 · conditional · novelty 7.0

ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance

cs.RO · 2026-05-22 · unverdicted · novelty 6.0

Afford-VLA internalizes task-conditioned affordance as an explicit visual planning interface within VLA models via learnable <AFF> tokens, achieving SOTA on LIBERO and SimplerEnv benchmarks.

TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

cs.RO · 2026-04-10 · unverdicted · novelty 6.0

V-CAGE automates the creation of scalable, high-quality robotic manipulation datasets through context-aware scene construction, closed-loop visual verification, and perceptually-driven compression.

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

cs.RO · 2026-02-14 · unverdicted · novelty 6.0

SCFields fuses semantics and contact data in a sim-to-real pipeline to enable category-level generalization for tactile tool manipulation with diffusion policies.

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

cs.RO · 2026-01-11 · unverdicted · novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

cs.RO · 2025-08-19 · conditional · novelty 6.0

Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

cs.RO · 2024-12-13 · conditional · novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.

SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

cs.RO · 2026-05-11 · conditional · novelty 4.0

SEVO raises ACT and SmolVLA pick-and-place success from 30-35% to 75-85% in novel environments by using active illumination, semantic cues, and diversified teleoperation data.

From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

cs.RO · 2026-04-04

citing papers explorer

Showing 1 of 1 citing paper after filters.

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation cs.RO · 2025-08-19 · conditional · none · ref 23
Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.

Moka: Open-world robotic manipulation through mark-based visual prompting

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer