Plan Right, Then Plan Tight: Symbolic RL for Efficient Embodied Reasoning

Lujie Yin; Xiangli Shi; Xiaomeng Zhu; Ye Tian; Yuchun Guo; Yufei Huang; Yuxuan Zhou; Ziyang Sun

arxiv: 2606.31260 · v1 · pith:DOYPZTPJnew · submitted 2026-06-30 · 💻 cs.RO

Plan Right, Then Plan Tight: Symbolic RL for Efficient Embodied Reasoning

Xiangli Shi , Xiaomeng Zhu , Ye Tian , Yuchun Guo , Ziyang Sun , Lujie Yin , Yuxuan Zhou , Yufei Huang This is my paper

Pith reviewed 2026-07-01 05:18 UTC · model grok-4.3

classification 💻 cs.RO

keywords embodied task planningBDDL specificationsymbolic verificationreinforcement learningGroupAdapt scheduleBEHAVIOR-1000plan verificationvideo-to-BDDL parser

0 comments

The pith

A single BDDL specification serves as the shared interface for data construction, plan verification, and reward design in embodied task planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that embodied planners can obtain cheap, deterministic supervision by deriving one formal BDDL specification from video evidence or curated tasks. This specification feeds a video-to-BDDL parser, an LLM verifier, and a lightweight symbolic engine that returns dense feedback in milliseconds. A difficulty-aware length schedule called GroupAdapt then uses the current in-batch pass rate to give harder prompts more token budget and tightens it automatically as success improves. Together these components let an 8B model reach 97.3 Strict-Pass on BEHAVIOR-1000 while shortening outputs by 79 percent.

Core claim

A single BDDL specification, automatically constructed from open-world video evidence or curated tasks, can serve as a shared interface for data construction, plan verification, and reward design. A video-to-BDDL parser, an LLM verifier, and a lightweight symbolic engine together supply dense feedback at millisecond latency. GroupAdapt, a difficulty-aware length schedule that uses the in-batch group pass rate as a zero-cost signal, grants hard prompts wider length tolerance that tightens as their pass rate improves. Under this guidance the 8B planner attains a Strict-Pass score of 97.3 on BEHAVIOR-1000, a 25.9 percent relative improvement over the Qwen3-8B baseline that also exceeds the stro

What carries the argument

The BDDL specification as a shared interface, together with the LLM verifier and the GroupAdapt length schedule, that supplies deterministic rewards and adaptive token budgets without full simulation.

If this is right

Planners receive dense deterministic rewards at millisecond latency instead of waiting for full simulation rollouts.
Response length can be compressed by nearly 80 percent while raising Strict-Pass rates above both same-size and larger baselines.
Smaller 8B models can surpass the strongest large-model baselines on household-task benchmarks.
In-batch pass rate supplies a zero-cost difficulty signal that automatically widens or tightens length tolerance during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same BDDL interface could support continual online refinement if the video-to-BDDL parser is run on robot camera streams.
Length compression may directly lower end-to-end latency when the planner runs on resource-limited robot hardware.
Replacing BDDL with other formalisms such as PDDL or linear temporal logic would test whether the verification-plus-adaptation pattern generalizes beyond the current specification language.
Combining the symbolic verifier with real-world execution traces could expose and correct systematic gaps between simulated and physical task requirements.

Load-bearing premise

A single automatically constructed BDDL specification can faithfully serve as the shared interface for data construction, plan verification, and reward design without introducing systematic mismatches between video evidence, the formal spec, and physical task requirements.

What would settle it

Executing a large sample of plans that pass the BDDL verifier inside the full BEHAVIOR simulator and measuring that a substantial fraction fail to complete the task due to unrepresented physical constraints or spec inaccuracies.

Figures

Figures reproduced from arXiv: 2606.31260 by Lujie Yin, Xiangli Shi, Xiaomeng Zhu, Ye Tian, Yuchun Guo, Yufei Huang, Yuxuan Zhou, Ziyang Sun.

**Figure 1.** Figure 1: A running example of SymPlan based on a real attach_a_camera_to_a_tripod guided sample. Open-world video evidence is parsed into grounded objects and initial predicates, verified into BDDL, rewritten as planner-facing dialogue, turned into executable action code, and checked by the symbolic engine. The same BDDL interface therefore builds data, verifies plans, and supplies training rewards. of actions (A… view at source ↗

**Figure 2.** Figure 2: BDDL-centric data construction and symbolic verification. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: SFT-initialized verifiable RL for compact planning. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Compact symbolic-RL trajectories on in-domain embodied validation, averaged over three decoding [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: summarizes this construction pipeline. The important design choice is that the same ver- 3 5s initial evidance 55s tripod visible 88s camera on tripod Video + Task Task prior Place the camera on top of the tripod. Search targets digital_camera camera_tripod floor / room Visual evidence digital_camera camera_tripod digital_camera camera_tripod digital_camera camera_tripod floor Object grounding BDDL Draft L… view at source ↗

**Figure 6.** Figure 6: Side-by-side comparison of the video-derived BDDL with the official BEHAVIOR BDDL for the [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of our data construction format. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Symbolic replay for buy_dog_food. The engine executes generated action code from the initial grocery-store state to the verified checkout state, using the same BDDL specification that was used to construct the planner input. B.2 Action Set Scaling [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Additional symbolic-RL trajectories, averaged over three decoding seeds. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Histogram of executable command count [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

read the original abstract

Embodied task planning asks an agent to turn a natural-language instruction into an executable sequence of actions in a physical scene, and is a building block for household, assistive, and service robots. Recent prompting-based and reinforcement-learning planners generate fluent action text but lack a cheap deterministic check that the produced plan is valid in the target world, while high-fidelity simulation is too slow to serve as an inner-loop training signal. The general problem is therefore how to obtain verifiable supervision and rewards for embodied planners without relying on string-level matching or full simulation. Here we show that a single BDDL specification, automatically constructed from open-world video evidence or curated tasks, can serve as a shared interface for data construction, plan verification, and reward design. A video-to-BDDL parser, an LLM verifier, and a lightweight symbolic engine together supply dense feedback at millisecond latency. We further introduce GroupAdapt, a difficulty-aware length schedule that uses the in-batch group pass rate as a zero-cost signal so that hard prompts get wider length tolerance and automatically tighten as their pass rate improves. Under the guidance of the proposed verifier and GroupAdapt schedule, the 8B planner attains a Strict-Pass score of 97.3 on BEHAVIOR-1000, yielding a 25.9 percent relative improvement over the Qwen3-8B baseline. This result exceeds the strongest large-model baseline by 3.5 percent, while simultaneously compressing the response length by 79 percent to 207 tokens, demonstrating both effectiveness and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable way to swap slow simulation for fast BDDL-based feedback in embodied RL, but the gains rest on an unvalidated parser that could be feeding the system false signals.

read the letter

The punchline is that this approach gets an 8B planner to 97.3 strict-pass on BEHAVIOR-1000 with 79% shorter outputs by using one BDDL spec for data, verification, and rewards, plus a simple GroupAdapt scheduler that widens length tolerance for hard batches.

The new piece is the tight loop: video-to-BDDL parsing feeds an LLM verifier and a symbolic engine that runs in milliseconds, then GroupAdapt uses the batch pass rate to adjust plan lengths without extra cost. That combination is not in the baselines they cite, and it directly tackles the simulation bottleneck for household robotics.

It does the engineering cleanly enough to produce those numbers while keeping plans short. The efficiency claim looks real on its face.

The soft spot is exactly what the stress test flags. The parser is the only source of the shared spec, yet the abstract gives no human validation, no simulator cross-check, and no error rates on predicate accuracy or object relations. If the BDDL systematically misses task constraints, the verifier rewards plans that look good on paper but fail in the world, and the 25.9% lift becomes hard to trust. No ablations on the parser or on GroupAdapt are mentioned either.

This is for people building embodied planners who already work with BDDL or similar formalisms and want faster inner-loop signals. A reader who needs reproducible gains on BEHAVIOR-1000 would find the method worth trying, but they would have to add their own validation first.

It deserves peer review because the practical problem is clear and the reported efficiency is large enough to check.

Referee Report

2 major / 0 minor

Summary. The paper claims that a single automatically constructed BDDL specification from a video-to-BDDL parser can serve as a shared interface for data construction, LLM-based plan verification, and symbolic reward design in embodied task planning. Combined with a GroupAdapt difficulty-aware length schedule that uses in-batch pass rate as a signal, this enables an 8B planner to reach 97.3 Strict-Pass on BEHAVIOR-1000 (25.9% relative gain over Qwen3-8B baseline, exceeding strongest large-model baseline by 3.5%), while reducing response length by 79% to 207 tokens.

Significance. If the BDDL faithfulness assumption holds and the reported gains are reproducible, the work would demonstrate a practical route to dense, low-latency verifiable supervision for embodied planners that avoids both string matching and full simulation, enabling efficient training of smaller models with measurable gains in both accuracy and token efficiency.

major comments (2)

[Abstract] Abstract: The central performance numbers (97.3 Strict-Pass, 25.9% relative improvement) rest on the video-to-BDDL parser supplying a faithful shared interface, yet the manuscript supplies no quantitative validation, error analysis, or ablation of the parser against human judgment or simulator ground truth. This is load-bearing for the claim that the verifier supplies reliable feedback.
[Abstract] Abstract: No derivation, ablation, or sensitivity analysis is provided for the GroupAdapt schedule or its use of in-batch pass rate as the difficulty signal; the reported compression to 207 tokens and the Strict-Pass gains cannot be assessed for robustness without these controls.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for validation of the BDDL parser and analysis of the GroupAdapt schedule. Both points identify gaps in the current manuscript that we will address through additional experiments and exposition in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance numbers (97.3 Strict-Pass, 25.9% relative improvement) rest on the video-to-BDDL parser supplying a faithful shared interface, yet the manuscript supplies no quantitative validation, error analysis, or ablation of the parser against human judgment or simulator ground truth. This is load-bearing for the claim that the verifier supplies reliable feedback.

Authors: We agree that quantitative validation of the video-to-BDDL parser is essential to substantiate the shared-interface claim. The current manuscript relies on the parser for data construction and verification but does not report error rates or agreement metrics. In the revision we will add a dedicated section with parser accuracy against human annotations on 200 BEHAVIOR-1000 tasks and against simulator ground truth, including failure-mode categorization. This will directly support the reliability of the verifier feedback. revision: yes
Referee: [Abstract] Abstract: No derivation, ablation, or sensitivity analysis is provided for the GroupAdapt schedule or its use of in-batch pass rate as the difficulty signal; the reported compression to 207 tokens and the Strict-Pass gains cannot be assessed for robustness without these controls.

Authors: We concur that the absence of derivation and controls for GroupAdapt limits assessment of robustness. The manuscript introduces the schedule but provides no ablations on the in-batch pass-rate signal or sensitivity to its hyperparameters. In the revision we will include (i) a formal derivation of the length-tolerance update rule, (ii) an ablation replacing the pass-rate signal with fixed or oracle difficulty, and (iii) sensitivity plots over batch-size and pass-rate thresholds, showing impact on both Strict-Pass and token length. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result stands on external benchmark

full rationale

The paper reports an empirical Strict-Pass score of 97.3 on the external BEHAVIOR-1000 benchmark after training with a BDDL-based verifier and GroupAdapt schedule. GroupAdapt conditions length tolerance on in-batch pass rate, but this signal is not definitionally identical to the final Strict-Pass metric; the reported 25.9 % relative gain is therefore not forced by construction. No equations, self-citations, or ansatzes are shown that reduce the central claim to a renaming or fitted input of the same data. The derivation chain is self-contained against the stated benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the named components; BDDL is treated as an external standard.

pith-pipeline@v0.9.1-grok · 5830 in / 1149 out tokens · 25483 ms · 2026-07-01T05:18:05.049050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages · 3 internal anchors

[1]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601. Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023a. Llm+ p: Empowering large language models with optimal planning proficiency.arXiv preprint arXiv:2304.11477. Shilong Liu, Zhaoyang Zeng, Tianh...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135,

Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502. Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. 2023. Say- plan: Grounding large language models using 3d scene graphs for scalable robot task planning.arX...

work page arXiv 2023
[3]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosm...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Syner...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others. 2023. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. InConfer...

2023
[6]

repeated for 4 baskets Goal for each basket: inside(candle, basket) inside(cookie, basket) inside(cheese, basket) inside(bow, basket)

Verified BDDL Objects basket_1..4, candle_1..4 cookie_1..4, cheese_1..4 bow_1..4, table_1, table_2 + floor, agent, distractors Init (onfloor basket_1 floor) (ontop candle_1 table_1) (ontop cookie_1 table_1) (ontop cheese_1 table_2) (ontop bow_1 table_2) ... repeated for 4 baskets Goal for each basket: inside(candle, basket) inside(cookie, basket) inside(c...
[7]

Track hand state internally; grasp needs a free hand; place_* needs holding object

Model-facing prompt System You are a single-arm robot. Track hand state internally; grasp needs a free hand; place_* needs holding object. Environment (define (environment) (:objects ...) (:init ...)) Dialogue User: Please help me assemble the gift baskets. Assistant: Should each basket include candle, cookie, cheese, bow? User: Yes
[8]

</think> Executable plan <answer><steps>

Target training output Reasoning scaffold <think> Identify objects; track hands; order pick-and-place actions. </think> Executable plan <answer><steps>
[9]

go to table_1, grasp candle_1
[10]

20 steps total </steps><code> navigate(table_1) grasp(candle_1) place_inside(candle_1,basket_1)

go to basket_1, place inside ... 20 steps total </steps><code> navigate(table_1) grasp(candle_1) place_inside(candle_1,basket_1) ... </code></answer> Figure 7: Illustration of our data construction format.We parse the raw BDDL task, includ- ing its formal goal, then convert it into a model-facing prompt consisting of environment state, robot specification...
[11]

Uncovered goal predicates

place both items Figure 8: Symbolic replay for buy_dog_food.The engine executes generated action code from the initial grocery-store state to the verified checkout state, using the same BDDL specification that was used to construct the planner input. B.2 Action Set Scaling Table 5 summarizes the action-set expansion from the 14-action B-100 engine to the ...

2026
[12]

Goal completion is reported as a fraction, and error rate is reported in percent

Each entry reports single-arm / dual-arm values. Goal completion is reported as a fraction, and error rate is reported in percent. Model Goal completion Error rate (%) Commands DeepSeek-V4-Flash 0.987 / 0.980 0.03 / 0.09 19.09 / 16.03 DeepSeek-V4-Pro 0.978 / 0.961 1.11 / 0.71 27.20 / 21.95 Gemini-3.1-Pro 0.897 / 0.890 0.00 / 0.02 17.77 / 14.73 Kimi-K2.6 0...
[13]

some- times pass

and a median of 14.5 commands on B-1000 16 0 40 80 120160200240 90 92 94 96 98 100 Goal Completion Ratio (%) B-1000 Goal Completion Ratio 0 40 80 120160200240 0.0 2.5 5.0 7.5 10.0 12.5 B-1000 Error Rate 0 40 80 120160200240 95 96 97 98 99 100 Goal Completion Ratio (%) B-1000 Goal Completion Ratio 0 40 80 120160200240 0 2 4 6 B-1000 Error Rate 0 40 80 1201...

2021

[1] [1]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Let’s verify step by step. InInternational Conference on Learning Representations, volume 2024, pages 39578–39601. Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023a. Llm+ p: Empowering large language models with optimal planning proficiency.arXiv preprint arXiv:2304.11477. Shilong Liu, Zhaoyang Zeng, Tianh...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135,

Virtualhome: Simulating household activities via programs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502. Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou- Chakra, Ian Reid, and Niko Suenderhauf. 2023. Say- plan: Grounding large language models using 3d scene graphs for scalable robot task planning.arX...

work page arXiv 2023

[3] [3]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Man- dlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosm...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chun- yuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Syner...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, and 1 others. 2023. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control. InConfer...

2023

[6] [6]

repeated for 4 baskets Goal for each basket: inside(candle, basket) inside(cookie, basket) inside(cheese, basket) inside(bow, basket)

Verified BDDL Objects basket_1..4, candle_1..4 cookie_1..4, cheese_1..4 bow_1..4, table_1, table_2 + floor, agent, distractors Init (onfloor basket_1 floor) (ontop candle_1 table_1) (ontop cookie_1 table_1) (ontop cheese_1 table_2) (ontop bow_1 table_2) ... repeated for 4 baskets Goal for each basket: inside(candle, basket) inside(cookie, basket) inside(c...

[7] [7]

Track hand state internally; grasp needs a free hand; place_* needs holding object

Model-facing prompt System You are a single-arm robot. Track hand state internally; grasp needs a free hand; place_* needs holding object. Environment (define (environment) (:objects ...) (:init ...)) Dialogue User: Please help me assemble the gift baskets. Assistant: Should each basket include candle, cookie, cheese, bow? User: Yes

[8] [8]

</think> Executable plan <answer><steps>

Target training output Reasoning scaffold <think> Identify objects; track hands; order pick-and-place actions. </think> Executable plan <answer><steps>

[9] [9]

go to table_1, grasp candle_1

[10] [10]

20 steps total </steps><code> navigate(table_1) grasp(candle_1) place_inside(candle_1,basket_1)

go to basket_1, place inside ... 20 steps total </steps><code> navigate(table_1) grasp(candle_1) place_inside(candle_1,basket_1) ... </code></answer> Figure 7: Illustration of our data construction format.We parse the raw BDDL task, includ- ing its formal goal, then convert it into a model-facing prompt consisting of environment state, robot specification...

[11] [11]

Uncovered goal predicates

place both items Figure 8: Symbolic replay for buy_dog_food.The engine executes generated action code from the initial grocery-store state to the verified checkout state, using the same BDDL specification that was used to construct the planner input. B.2 Action Set Scaling Table 5 summarizes the action-set expansion from the 14-action B-100 engine to the ...

2026

[12] [12]

Goal completion is reported as a fraction, and error rate is reported in percent

Each entry reports single-arm / dual-arm values. Goal completion is reported as a fraction, and error rate is reported in percent. Model Goal completion Error rate (%) Commands DeepSeek-V4-Flash 0.987 / 0.980 0.03 / 0.09 19.09 / 16.03 DeepSeek-V4-Pro 0.978 / 0.961 1.11 / 0.71 27.20 / 21.95 Gemini-3.1-Pro 0.897 / 0.890 0.00 / 0.02 17.77 / 14.73 Kimi-K2.6 0...

[13] [13]

some- times pass

and a median of 14.5 commands on B-1000 16 0 40 80 120160200240 90 92 94 96 98 100 Goal Completion Ratio (%) B-1000 Goal Completion Ratio 0 40 80 120160200240 0.0 2.5 5.0 7.5 10.0 12.5 B-1000 Error Rate 0 40 80 120160200240 95 96 97 98 99 100 Goal Completion Ratio (%) B-1000 Goal Completion Ratio 0 40 80 120160200240 0 2 4 6 B-1000 Error Rate 0 40 80 1201...

2021