pith. machine review for the scientific record. sign in

arxiv: 2604.22238 · v1 · submitted 2026-04-24 · 💻 cs.RO

Recognition: unknown

CodeGraphVLP: Code-as-Planner Meets Semantic-Graph State for Non-Markovian Vision-Language-Action Models

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:37 UTC · model grok-4.3

classification 💻 cs.RO
keywords non-Markovian taskssemantic-graph statecode-based plannervision-language-action modelslong-horizon manipulationpartial observabilityprogress checksrobot task planning
0
0 comments X

The pith

CodeGraphVLP pairs a persistent semantic-graph state with a code planner to succeed more often on non-Markovian robot tasks than standard VLA models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models typically treat the latest camera image as enough for the next action, but this breaks when evidence is hidden, appears only earlier, or is buried in clutter. CodeGraphVLP keeps a running semantic graph of objects and relations that survives partial views, then runs an executable code planner over the graph to check progress and emit a focused subtask plus the key objects involved. The planner output is used to build a cleaned observation that directs the VLA executor toward what actually matters. The result on real long-horizon tasks is higher completion rates than both plain VLA baselines and history-augmented variants, plus much lower planning latency than methods that loop back to a vision-language model at every step.

Core claim

The framework maintains task-relevant entities and relations in a semantic-graph state under partial observability. An executable code planner runs progress checks on this graph and produces subtask instructions together with relevant objects; these outputs drive construction of clutter-suppressed observations that focus the VLA executor on critical evidence.

What carries the argument

Persistent semantic-graph state combined with an executable code-based planner that performs progress checks and generates focused subtask instructions.

If this is right

  • On real-world non-Markovian tasks the method improves task completion over strong VLA baselines and history-enabled variants.
  • It substantially lowers planning latency compared with VLM-in-the-loop planning.
  • Ablation studies confirm that the graph state, code planner, and progress-guided prompting each contribute to the gains.
  • The hierarchical loop lets the VLA executor operate on observations that suppress irrelevant clutter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit graph-based memory may let other short-horizon learned policies operate in partially observable environments without retraining.
  • Replacing repeated large-model queries with code execution over a compact state could reduce both latency and compute cost in deployed systems.
  • The same graph-plus-code structure might extend to tasks such as sequential assembly or multi-room navigation where order and persistence matter.

Load-bearing premise

The semantic-graph state can reliably track task-relevant entities and relations even when parts of the scene are occluded or cluttered.

What would settle it

An experiment in which the semantic graph is replaced by a simple buffer of recent observations while keeping the rest of the pipeline identical, and which then shows no gain in task completion, would indicate that the graph representation itself is not carrying the reported benefit.

Figures

Figures reproduced from arXiv: 2604.22238 by Anh Nguyen, Anthony Gunderman, Bui Duy Quoc Nghi, Chase Rainwater, Duy Nguyen, Khoa Vo, Minh Vu, Ngan Le, Sieu Tran, Taisei Hanyu, Yuki Ikebe.

Figure 1
Figure 1. Figure 1: Architectures for non-Markovian long-horizon manipulation. (a) Memory-augmented VLA equips a short-horizon policy with memory context, offering moderately efficient progress checks and limited robustness for action reasoning in clutter. (b) Hierarchical VLM–VLA uses a VLM planner to reason about subtasks and guide a VLA policy with subtask-level cues, improving clutter robustness but incurring highlatency … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CodeGraphVLP view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative rollouts of CodeGraphVLP on our three real-world tasks (Pick-and-Place Twice, Place-and-Stack, and Swap Cups). For each task, we show multi-view RGB inputs with the overall instruction, the semantic-graph state Gt, and the progress-guided prompts used by the VLA policy: clutter-free visual cues that retain only subtask-relevant objects and the planner-produced subtask language cues. inspired by… view at source ↗
Figure 4
Figure 4. Figure 4: Robot experimental setup on a UR10e manipulator with a parallel view at source ↗
read the original abstract

Vision-Language-Action (VLA) models promise generalist robot manipulation, but are typically trained and deployed as short-horizon policies that assume the latest observation is sufficient for action reasoning. This assumption breaks in non-Markovian long-horizon tasks, where task-relevant evidence can be occluded or appear only earlier in the trajectory, and where clutter and distractors make fine-grained visual grounding brittle. We present CodeGraphVLP, a hierarchical framework that enables reliable long-horizon manipulation by combining a persistent semantic-graph state with an executable code-based planner and progress-guided visual-language prompting. The semantic-graph maintains task-relevant entities and relations under partial observability. The synthesized planner executes over this semantic-graph to perform efficient progress checks and outputs a subtask instruction together with subtask-relevant objects. We use these outputs to construct clutter-suppressed observations that focus the VLA executor on critical evidence. On real-world non-Markovian tasks, CodeGraphVLP improves task completion over strong VLA baselines and history-enabled variants while substantially lowering planning latency compared to VLM-in-the-loop planning. We also conduct extensive ablation studies to confirm the contributions of each component.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CodeGraphVLP, a hierarchical framework that augments vision-language-action (VLA) models with a persistent semantic-graph state representation, an executable code-based planner for progress checks and subtask decomposition, and progress-guided visual-language prompting to focus the VLA executor on relevant evidence. It claims that this integration enables reliable performance on non-Markovian long-horizon manipulation tasks under partial observability by maintaining task-relevant entities and relations, yielding higher task completion rates than strong VLA baselines and history-enabled variants while reducing planning latency relative to VLM-in-the-loop approaches, with supporting ablation studies.

Significance. If the performance claims are substantiated, the work would be significant for bridging symbolic planning with neural VLA policies in robotics, offering a practical way to handle non-Markovian dependencies and clutter that currently limit short-horizon VLA deployment. The extensive ablation studies are a clear strength, as they directly test component contributions rather than relying on end-to-end black-box gains.

major comments (2)
  1. [Results section] Results section (and abstract performance claims): the manuscript reports qualitative improvements in task completion and latency but provides no exact quantitative metrics (e.g., success rates, latency values in seconds), baseline implementations with citations or code, error bars, or statistical significance tests. This absence is load-bearing because the central claim of superiority over VLA baselines and history variants cannot be verified or sized without these details.
  2. [Method / Semantic-graph construction] Semantic-graph state description (likely §3.2 or Method): no error rates, drift measurements, update rules, or failure-mode analysis are supplied for how the graph is constructed and maintained from VLM observations under partial observability, occlusions, and distractors. This directly undermines the weakest assumption that the graph reliably supports accurate progress checks by the code planner.
minor comments (2)
  1. [Method] Notation for the semantic-graph (entities, relations, update function) could be formalized with a small table or pseudocode to improve clarity for readers unfamiliar with the exact representation.
  2. [Ablation studies] The ablation studies are mentioned but would benefit from a dedicated table summarizing the contribution of each component (graph, code planner, prompting) with the same metrics used in the main results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have reviewed the major comments carefully and provide point-by-point responses below. Where the comments identify gaps in quantitative detail or analysis, we agree and commit to revisions that will strengthen the paper without altering its core contributions.

read point-by-point responses
  1. Referee: [Results section] Results section (and abstract performance claims): the manuscript reports qualitative improvements in task completion and latency but provides no exact quantitative metrics (e.g., success rates, latency values in seconds), baseline implementations with citations or code, error bars, or statistical significance tests. This absence is load-bearing because the central claim of superiority over VLA baselines and history variants cannot be verified or sized without these details.

    Authors: We acknowledge that the current manuscript presents performance improvements in summarized form in the abstract and results section without a dedicated table of exact metrics, error bars, or statistical tests, which limits independent verification. We will revise the results section to include a table with precise quantitative values from our real-world experiments (task success rates as percentages with standard deviations across repeated trials, planning latency in seconds with comparisons), full citations to the baseline VLA models and history variants used, descriptions of their implementations, and statistical significance tests (e.g., p-values from appropriate tests). This will allow the claims of superiority to be fully sized and verified. revision: yes

  2. Referee: [Method / Semantic-graph construction] Semantic-graph state description (likely §3.2 or Method): no error rates, drift measurements, update rules, or failure-mode analysis are supplied for how the graph is constructed and maintained from VLM observations under partial observability, occlusions, and distractors. This directly undermines the weakest assumption that the graph reliably supports accurate progress checks by the code planner.

    Authors: We agree that the method section describes semantic-graph construction from VLM observations but does not supply quantitative error rates, drift analysis, explicit update rules, or failure-mode discussion under partial observability and distractors. We will revise §3.2 (and add an appendix if needed) to include the update rules for graph maintenance (e.g., how new observations are merged or used to correct entities/relations), empirical error rates measured during our experiments (such as entity detection precision/recall in occluded or cluttered scenes), and a failure-mode analysis with concrete examples. This will directly support the reliability of the graph for the code planner's progress checks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework integration with ablations

full rationale

The paper introduces CodeGraphVLP as a hierarchical integration of a persistent semantic-graph state, executable code planner, and progress-guided VLA prompting for non-Markovian tasks. All claims rest on real-world task completion rates, latency measurements, and ablation studies that isolate component contributions. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear; the semantic-graph maintenance and planner outputs are presented as engineered components whose reliability is evaluated externally rather than assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that semantic graphs can be maintained accurately despite occlusion and that code execution can efficiently monitor progress; no free parameters or invented entities with independent evidence are detailed.

axioms (1)
  • domain assumption Semantic-graph state maintains task-relevant entities and relations under partial observability
    Invoked as the foundation for handling non-Markovian tasks and enabling progress checks in the hierarchical framework.
invented entities (2)
  • Semantic-graph state no independent evidence
    purpose: Persistent tracking of entities and relations for subtask planning and observation focusing
    Introduced as a core new component to overcome limitations of Markovian VLA assumptions.
  • Code-based planner no independent evidence
    purpose: Synthesizes executable plans over the graph to output subtasks and relevant objects
    Presented as the mechanism for efficient progress monitoring and instruction generation.

pith-pipeline@v0.9.0 · 5545 in / 1269 out tokens · 66580 ms · 2026-05-08T11:37:49.770256+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 10 canonical work pages · 9 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finnet al., “Rt-1: Robotics transformer for real-world control at scale,” inarXiv preprint arXiv:2212.06817, 2022

  2. [2]

    Rt-2: Vision- language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xiaet al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control,” in CoRL, 2023, pp. 2165–2183

  3. [3]

    Octo: An open-source generalist robot policy,

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees et al., “Octo: An open-source generalist robot policy,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  4. [4]

    Vision-language foundation models as effective robot imitators,

    X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wuet al., “Vision-language foundation models as effective robot imitators,” inICLR, 2024

  5. [5]

    OpenVLA: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair et al., “OpenVLA: An open-source vision-language-action model,” in CoRL, 2024

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finnet al., “π0: A vision-language-action flow model for general robot control,” arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong et al., “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

  8. [8]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driesset al., “π 0.5: a vision-language-action model with open- world generalization,”arXiv preprint arXiv:2504.16054, 2025

  9. [9]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fanet al., “Gr00t n1: An open foundation model for generalist humanoid robots,” arXiv preprint arXiv:2503.14734, 2025

  10. [10]

    Bridgedata v2: A dataset for robot learning at scale,

    H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruchet al., “Bridgedata v2: A dataset for robot learning at scale,” inCoRL. PMLR, 2023, pp. 1723–1736

  11. [11]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Leeet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” inICRA. IEEE, 2024, pp. 6892–6903

  12. [12]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- chetiet al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024

  13. [13]

    Visually grounding lanuage instruction for history-dependent manipulation,

    H. Ahn, O. Kwon, K. Kim, J. Jeong, H. Jun, H. Leeet al., “Visually grounding lanuage instruction for history-dependent manipulation,” in ICRA, May 2022

  14. [14]

    Memer: Scaling up memory for robot control via experience retrieval,

    A. Sridhar, J. Pan, S. Sharma, and C. Finn, “Memer: Scaling up memory for robot control via experience retrieval,” 2025

  15. [15]

    HAMLET: Switch your vision-language-action model into a history-aware policy,

    M. Koo, D. Choi, T. Kim, K. Lee, C. Kim, Y . Seoet al., “HAMLET: Switch your vision-language-action model into a history-aware policy,” inICLR, 2026

  16. [16]

    MemoryVLA: Perceptual-cognitive memory in vision-language-action models for robotic manipulation,

    H. Shi, B. Xie, Y . Liu, L. Sun, F. Liu, T. Wanget al., “MemoryVLA: Perceptual-cognitive memory in vision-language-action models for robotic manipulation,” inICLR, 2026

  17. [17]

    Hi robot: Open-ended instruction following with hierarchical vision-language-action models,

    L. X. Shi, brian ichter, M. R. Equi, L. Ke, K. Pertsch, Q. Vuong et al., “Hi robot: Open-ended instruction following with hierarchical vision-language-action models,” inICML, 2025

  18. [18]

    Thinkact: Vision-language-action reasoning via reinforced visual la- tent planning,

    C.-P. Huang, Y .-H. Wu, M.-H. Chen, Y .-C. F. Wang, and F.-E. Yang, “Thinkact: Vision-language-action reasoning via reinforced visual la- tent planning,” inNeurIPS, 2025

  19. [19]

    arXiv preprint arXiv:2601.09708 (2026) 3

    C.-P. Huang, Y . Man, Z. Yu, M.-H. Chen, J. Kautz, Y .-C. F. Wang et al., “Fast-thinkact: Efficient vision-language-action reasoning via verbalizable latent planning,”arXiv preprint arXiv:2601.09708, 2026

  20. [20]

    Towards long-horizon vision-language-action system: Reasoning, acting and memory,

    D. Li, Y . Zhang, M. Cao, D. Liu, W. Xie, T. Huiet al., “Towards long-horizon vision-language-action system: Reasoning, acting and memory,” inICCV, 2025, pp. 6839–6848

  21. [21]

    Code as policies: Language model programs for embodied control,

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichteret al., “Code as policies: Language model programs for embodied control,” inICRA, 2023, pp. 9493–9500

  22. [22]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson et al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, vol. 35, 2022

  23. [23]

    Palm-e: an embodied multimodal language model,

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter et al., “Palm-e: an embodied multimodal language model,” inICML, ser. ICML’23. JMLR.org, 2023

  24. [24]

    On scaling up a multilingual vision and language model,

    X. Chen, J. Djolonga, P. Padlewski, B. Mustafa, S. Changpinyo, J. Wu et al., “On scaling up a multilingual vision and language model,” in CVPR, June 2024, pp. 14 432–14 444

  25. [25]

    Prismatic vlms: Investigating the design space of visually- conditioned language models,

    S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inICML, 2024

  26. [26]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz et al., “Paligemma: A versatile 3b vlm for transfer,”arXiv preprint arXiv:2407.07726, 2024

  27. [27]

    Flow matching for generative modeling,

    Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inICLR, 2023

  28. [28]

    Moka: Open-world robotic manipulation through mark-based visual prompting,

    K. Fang, F. Liu, P. Abbeel, and S. Levine, “Moka: Open-world robotic manipulation through mark-based visual prompting,”Robotics: Science and Systems (RSS), 2024

  29. [29]

    Inner monologue: Embodied reasoning through planning with language models,

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florenceet al., “Inner monologue: Embodied reasoning through planning with language models,” inCoRL, 2022

  30. [30]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”arXiv preprint arXiv:2407.08693, 2024

  31. [31]

    Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,

    K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suen- derhauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable task planning,” inCoRL, 2023

  32. [32]

    Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,

    W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,” inCoRL. PMLR, 2025

  33. [33]

    Learning fine-grained bimanual manipulation with low-cost hardware,

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” inRobotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023

  34. [34]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfielet al., “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

  35. [35]

    Yoloe: Real- time seeing anything,

    A. Wang, L. Liu, H. Chen, Z. Lin, J. Han, and G. Ding, “Yoloe: Real- time seeing anything,” inICCV, October 2025, pp. 24 591–24 602

  36. [36]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,”arXiv preprint arXiv:2310.11441, 2023

  37. [37]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal et al., “Learning transferable visual models from natural language supervision,” inICML. PMLR, 2021

  38. [38]

    Putting the object back into video object segmentation,

    H. K. Cheng, S. W. Oh, B. Price, J.-Y . Lee, and A. Schwing, “Putting the object back into video object segmentation,” inCVPR, 2024, pp. 3151–3161

  39. [39]

    Progprompt: Generating situated robot task plans using large language models,

    I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay et al., “Progprompt: Generating situated robot task plans using large language models,” inICRA, 2023, pp. 11 523–11 530

  40. [40]

    Run-time observation interventions make vision-language-action models more visually ro- bust,

    A. J. Hancock, A. Z. Ren, and A. Majumdar, “Run-time observation interventions make vision-language-action models more visually ro- bust,” inICRA, 2025, pp. 9499–9506

  41. [41]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wanget al., “LoRA: Low-rank adaptation of large language models,” inICLR, 2022

  42. [42]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,” 2024