Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Pranay Dugar; Prasad Tadepalli; Rajesh Mangannavar; Zachary Coalson

arxiv: 2606.21740 · v1 · pith:DC6UMKT5new · submitted 2026-06-19 · 💻 cs.AI

Training the Orchestrator: A Supervised Approach to End-to-End PDDL Planning with LLM Agents

Rajesh Mangannavar , Zachary Coalson , Pranay Dugar , Prasad Tadepalli This is my paper

Pith reviewed 2026-06-26 13:54 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsPDDL planningorchestrator trainingsupervised learningcost reductionPlanBenchverifier-guided training

0 comments

The pith

A trained lightweight policy can orchestrate LLM planning agents as effectively as a frontier model but at a fraction of the cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an orchestrator for PDDL planning can be trained from trajectories certified by an external verifier rather than relying on repeated prompts to a large language model. HALO combines a small QLoRA-tuned policy with hardcoded rules to select among 21 specialized agents in a refinement loop. This yields success rates that match or exceed those of prompted GPT-5-mini and stay close to Gemini-3-Flash across several benchmarks. Readers would care because the method slashes orchestration costs by more than an order of magnitude and reduces the number of LLM calls needed per planning task by 40 to 50 percent.

Core claim

HALO trains the orchestrator from refinement trajectories that an external verifier has certified as ending in valid plans across 11 PDDL domains. It pairs a small QLoRA-tuned policy with three hardcoded rules for trivially decidable selections and operates over an expanded 21-agent action space. The verifier provides strong guidance because every accepted trajectory is a sequence of demonstrably correct state-agent decisions directly usable as supervision. This allows HALO to match or exceed the GPT-5-mini prompted baseline on success rate while sitting within three percentage points of the stronger Gemini-3-Flash baseline.

What carries the argument

HALO, a hybrid agent-learned orchestrator that uses a QLoRA-tuned policy trained on verified trajectories to select agents, supplemented by hardcoded rules.

If this is right

HALO achieves success rates matching or exceeding the GPT-5-mini prompted baseline.
HALO stays within three percentage points of the Gemini-3-Flash prompted baseline.
Orchestration cost drops from $0.18 to $0.004 per task against GPT-5-mini, a roughly 45 times reduction.
Total LLM calls per episode are cut by 40 to 50 percent.
These results hold across PlanBench, Natural Plan, and classical planning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This supervised training method could be applied to other multi-agent orchestration problems where a verifier can certify successful trajectories.
Lowering the cost of orchestration might make end-to-end LLM planning feasible for a wider range of users and applications.
Future work could explore whether the trained policy generalizes to new domains without additional verifier data.

Load-bearing premise

The verifier already provides strong guidance because every accepted trajectory consists of demonstrably correct decisions that can serve directly as supervision signals.

What would settle it

A test showing that the trained HALO policy selects incorrect agents at a rate high enough to drop success rates significantly below the prompted baselines on the same benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.21740 by Pranay Dugar, Prasad Tadepalli, Rajesh Mangannavar, Zachary Coalson.

**Figure 2.** Figure 2: Training the orchestrator inside HALO. (a) Following GABAR’s template Mangannavar et al. [2025], training problems drawn from 11 PDDL domains are passed through a strong prompted teacher (GPT-5-mini). The teacher’s rollouts go through a three-stage filter, consisting of a hard verifier filter that discards trajectories not ending in a valid plan, spec-level augmentation, and an LLM-as-judge soft filter, pr… view at source ↗

read the original abstract

Translating natural-language planning intent into verified plans is a longstanding challenge: people communicate goals in language, while classical planners require formal PDDL specifications. Recent agentic frameworks bridge this gap by orchestrating a pool of specialized repair agents inside a verifier-checked refinement loop, but the orchestrator at the centre is itself a prompted frontier LLM, paying a frontier-LLM API call at every refinement step. We present HALO (Hybrid Agent-Learned Orchestrator), which trains the orchestrator from refinement trajectories that an external verifier has certified as ending in valid plans, across 11 PDDL domains. HALO pairs a small QLoRA-tuned policy with three hardcoded rules for trivially decidable selections, and operates over an expanded 21-agent action space. Unlike approaches that prompt a frontier LLM at every step or learn an orchestrator from sparse end-of-episode rewards, our key observation is that the verifier already provides strong guidance: every accepted trajectory is a sequence of demonstrably correct (state, agent) decisions, directly usable as supervision. Across PlanBench, Natural Plan, and classical planning benchmarks, HALO matches or exceeds the GPT-5-mini prompted baseline on success rate, sits within three percentage points of the stronger Gemini-3-Flash prompted baseline, reduces orchestration cost by more than an order of magnitude (\$0.18 to \$0.004 per task against GPT-5-mini, roughly 45$\times$ cheaper; roughly 15$\times$ cheaper than Gemini-3-Flash), and cuts total LLM calls per episode by 40 to 50 percent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HALO trains a small policy on verifier-certified trajectories to orchestrate planning agents, matching big-model success rates at far lower cost.

read the letter

The central takeaway here is that training a small QLoRA policy on verifier-certified trajectories lets you replace expensive frontier-LLM prompting for the orchestrator with something much cheaper, while keeping success rates roughly the same across those planning benchmarks.

What stands out as new is the direct use of successful trajectories as supervision. Instead of prompting a big model every time or doing RL from final rewards, they take the sequences where the verifier accepted the plan and train the policy to imitate the agent choices in those states. That makes sense because the verifier already filters for correct decisions.

The paper does a reasonable job showing the practical payoff: success rates match or beat the GPT-5-mini baseline and stay close to Gemini-3-Flash, with orchestration cost dropping from $0.18 to $0.004 per task and fewer total LLM calls. Adding the three hardcoded rules and expanding to 21 agents seems like a sensible engineering choice.

The soft spots are mostly around missing details. The abstract reports the numbers but doesn't give domain stats, how many trajectories were used, or any error analysis. It's not clear how the baselines were implemented or whether the small model generalizes beyond the training domains. The QLoRA adaptation rank is listed as a free parameter, so some tuning was probably involved. These are the usual things that need fleshing out in a full paper.

This is the kind of work that would interest people building agentic planning systems who care about cost at scale. A reader working on LLM agents or classical planning hybrids would get value from the cost numbers and the supervision approach.

I would send it to peer review. The core idea holds up on the evidence given, and the results are concrete enough to warrant a closer look even if revisions are needed for the experimental section.

Referee Report

0 major / 2 minor

Summary. The paper introduces HALO, a hybrid orchestrator for LLM-based PDDL planning that replaces a prompted frontier LLM with a small QLoRA-tuned policy trained via supervised imitation on trajectories certified as valid by an external verifier. The policy operates over a 21-agent action space alongside three hardcoded rules and is evaluated on PlanBench, Natural Plan, and classical planning benchmarks, where it matches or exceeds the GPT-5-mini baseline on success rate, stays within 3 points of Gemini-3-Flash, reduces orchestration cost by 15-45×, and cuts total LLM calls per episode by 40-50%.

Significance. If the empirical results hold under the reported experimental protocol, the work demonstrates that verifier-certified trajectories supply sufficiently dense supervision to train a lightweight policy that can replace repeated frontier-LLM calls for orchestration. This yields a concrete, reproducible cost reduction while preserving end-to-end success rates across 11 domains, strengthening the case for hybrid learned-orchestrator designs in agentic planning systems.

minor comments (2)

The abstract states concrete success-rate and cost figures but does not define the precise baselines (e.g., exact prompting templates or temperature settings for GPT-5-mini and Gemini-3-Flash) or report per-domain statistics; the full paper should include these in §4 or a dedicated experimental appendix so readers can reproduce the comparison.
The description of the 21-agent action space and the three hardcoded rules is given only at a high level; a table or pseudocode listing the exact agent names and the decision logic for the hardcoded cases would improve clarity in §3.2.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No major comments were raised in the report, so we have no specific points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation relies on supervised training of a small policy (QLoRA-tuned) from trajectories that an external verifier has already certified as ending in valid plans across 11 domains. This is standard imitation learning: the verifier supplies the (state, agent) labels directly, independent of the learned model or any fitted parameters within the paper. The abstract explicitly contrasts this with both frontier-LLM prompting at every step and sparse end-of-episode reward learning, confirming the supervision signal is external rather than self-generated. No equations, self-citations, or uniqueness claims reduce the result to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the method rests on the assumption that verifier trajectories constitute high-quality, generalizable supervision and that the 11 domains plus 21-agent space are representative.

free parameters (1)

QLoRA adaptation rank and scaling
The small policy is obtained via QLoRA tuning whose rank, alpha, and learning-rate choices are free parameters not specified in the abstract.

axioms (1)

domain assumption Verifier-certified trajectories consist of demonstrably correct (state, agent) decisions usable as direct supervision
Central premise stated in the abstract that enables the supervised-learning approach.

pith-pipeline@v0.9.1-grok · 5832 in / 1311 out tokens · 25979 ms · 2026-06-26T13:54:52.699557+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 29 canonical work pages · 5 internal anchors

[1]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency , author=. arXiv preprint arXiv:2304.11477 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2405.19793 , year=

PDDLEGO: Iterative Planning in Textual Environments , author=. arXiv preprint arXiv:2405.19793 , year=

work page arXiv
[3]

arXiv preprint arXiv:2308.06391 , year=

Dynamic Planning with a LLM , author=. arXiv preprint arXiv:2308.06391 , year=

work page arXiv
[4]

arXiv preprint arXiv:2406.10196 , year=

TRIP-PAL: Travel Planning with Guarantees by Combining Large Language Models and Automated Planners , author=. arXiv preprint arXiv:2406.10196 , year=

work page arXiv
[5]

NeurIPS , year=

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , author=. NeurIPS , year=
[6]

arXiv preprint arXiv:2307.07696 , year=

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text , author=. arXiv preprint arXiv:2307.07696 , year=

work page arXiv
[7]

2025 , eprint=

Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools , author=. 2025 , eprint=

2025
[8]

Llms can't plan, but can help planning in llm-modulo frameworks, 2024

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks , author=. arXiv preprint arXiv:2402.01817 , year=

work page arXiv
[9]

ICLR , year=

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. ICLR , year=
[10]

UIST , year=

Generative Agents: Interactive Simulacra of Human Behavior , author=. UIST , year=
[11]

arXiv preprint arXiv:2310.10134 , year=

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization , author=. arXiv preprint arXiv:2310.10134 , year=

work page arXiv
[12]

arXiv preprint arXiv:2309.11436 , year=

You Only Look at Screens: Multimodal Chain-of-Action Agents , author=. arXiv preprint arXiv:2309.11436 , year=

work page arXiv
[13]

arXiv preprint arXiv:2403.12881 , year=

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , author=. arXiv preprint arXiv:2403.12881 , year=

work page arXiv
[14]

Fireact: Toward language agent fine-tuning

FireAct: Toward Language Agent Fine-tuning , author=. arXiv preprint arXiv:2310.05915 , year=

work page arXiv
[15]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. arXiv preprint arXiv:2310.12823 , year=

work page arXiv
[16]

arXiv preprint arXiv:2403.02502 , year=

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents , author=. arXiv preprint arXiv:2403.02502 , year=

work page arXiv
[17]

arXiv preprint , year=

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding , author=. arXiv preprint , year=
[18]

2023 , eprint=

QLoRA: Efficient Finetuning of Quantized LLMs , author=. 2023 , eprint=

2023
[19]

arXiv preprint arXiv:2411.02337 , year=

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning , author=. arXiv preprint arXiv:2411.02337 , year=

work page arXiv
[20]

Training Language Models to Self-Correct via Reinforcement Learning

Training Language Models to Self-Correct via Reinforcement Learning , author=. arXiv preprint arXiv:2409.12917 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

arXiv preprint arXiv:2406.14283 , year=

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning , author=. arXiv preprint arXiv:2406.14283 , year=

work page arXiv
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

NeurIPS , year=

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. NeurIPS , year=
[24]

arXiv preprint , year=

RaDA: Retrieval-augmented Web Agent Planning with LLMs , author=. arXiv preprint , year=
[25]

arXiv preprint arXiv:2410.18963 , year=

OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning , author=. arXiv preprint arXiv:2410.18963 , year=

work page arXiv
[26]

arXiv preprint arXiv:2406.06485 , year=

Can Language Models Serve as Text-Based World Simulators? , author=. arXiv preprint arXiv:2406.06485 , year=

work page arXiv
[27]

arXiv preprint , year=

Plan-RAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers , author=. arXiv preprint , year=
[28]

arXiv preprint , year=

UrbanLLM: Autonomous Urban Activity Planning and Management with Large Language Models , author=. arXiv preprint , year=
[29]

NeurIPS , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. NeurIPS , year=
[30]

arXiv preprint , year=

SearChain: Adaptive Information Retrieval Chain for Multi-Turn Question Answering , author=. arXiv preprint , year=
[31]

AAAI , year=

Graph of Thoughts: Solving Elaborate Problems with Large Language Models , author=. AAAI , year=
[32]

ICLR , year=

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models , author=. ICLR , year=
[33]

ICLR , year=

ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search , author=. ICLR , year=
[34]

EMNLP , year=

Reasoning with Language Model is Planning with World Model , author=. EMNLP , year=
[35]

arXiv preprint arXiv:2405.03553 , year=

AlphaMath Almost Zero: Process Supervision Without Process , author=. arXiv preprint arXiv:2405.03553 , year=

work page arXiv
[36]

arXiv preprint , year=

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement , author=. arXiv preprint , year=
[37]

arXiv preprint arXiv:2403.00092 , year=

Proc2PDDL: Open-Domain Planning Representations from Texts , author=. arXiv preprint arXiv:2403.00092 , year=

work page arXiv
[38]

NeurIPS Workshop , year=

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages , author=. NeurIPS Workshop , year=
[39]

arXiv preprint arXiv:2311.09830 , year=

AutoPlanBench: Automatically Generating Benchmarks for LLM Planners from PDDL , author=. arXiv preprint arXiv:2311.09830 , year=

work page arXiv
[40]

EMNLP , year=

Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models , author=. EMNLP , year=
[41]

arXiv preprint arXiv:2405.04776 , year=

Chain of Thoughtlessness: An Analysis of CoT in Planning , author=. arXiv preprint arXiv:2405.04776 , year=

work page arXiv
[42]

GitHub , year=

ToolOrchestra: An End-to-End RL Training Framework for Orchestrating Tools and Agentic Workflows , author=. GitHub , year=
[43]

arXiv preprint , year=

TinyAgent: Function Calling at the Edge , author=. arXiv preprint , year=
[44]

arXiv preprint , year=

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking , author=. arXiv preprint , year=
[45]

arXiv preprint arXiv:2503.18809 , year=

Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code , author=. arXiv preprint arXiv:2503.18809 , year=

work page arXiv
[46]

AAAI , year=

Generalized Planning in PDDL Domains with Pretrained Large Language Models , author=. AAAI , year=
[47]

arXiv preprint arXiv:2501.18784 , year=

LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore? , author=. arXiv preprint arXiv:2501.18784 , year=

work page arXiv
[48]

Nature , year=

Mathematical Discoveries from Program Search with Large Language Models , author=. Nature , year=
[49]

2004 , publisher=

Automated Planning: Theory and Practice , author=. 2004 , publisher=

2004
[50]

Technical Report , year=

PDDL: The Planning Domain Definition Language , author=. Technical Report , year=
[51]

arXiv preprint arXiv:2403.03101 , year=

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents , author=. arXiv preprint arXiv:2403.03101 , year=

work page arXiv
[52]

End-to-end PDDL Planning with Hardcoded and Dynamic Agents

End-to-end PDDL Planning with Hardcoded and Dynamic Agents , author=. arXiv preprint arXiv:2512.09629 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Advances in Neural Information Processing Systems , volume=

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , author=. Advances in Neural Information Processing Systems , volume=
[54]

arXiv preprint arXiv:2405.04215 , year=

NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions , author=. arXiv preprint arXiv:2405.04215 , year=

work page arXiv
[55]

2025 , eprint=

How Far Are LLMs from Symbolic Planners? An NLP-Based Perspective , author=. 2025 , eprint=

2025
[56]

Advances in Neural Information Processing Systems , year=

Graph Neural Network Based Action Ranking for Planning , author=. Advances in Neural Information Processing Systems , year=
[57]

Proceedings of the International Conference on Automated Planning and Scheduling , volume=

GammaZero: Learning to Guide Belief-Space Search for Long-Horizon POMDPs with Generalizable Graph Representations , author=. Proceedings of the International Conference on Automated Planning and Scheduling , volume=
[58]

Journal of Artificial Intelligence Research , volume=

The Fast Downward Planning System , author=. Journal of Artificial Intelligence Research , volume=
[59]

Howey, Richard and Long, Derek and Fox, Maria , booktitle=
[60]

Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS) , year=

Forward-Chaining Partial-Order Planning , author=. Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS) , year=
[61]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=
[62]

Neural Computation , volume=

Efficient Training of Artificial Neural Networks for Autonomous Navigation , author=. Neural Computation , volume=
[63]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

and Chi, Ed H

Zheng, Huaixiu Steven and Mishra, Swaroop and Zhang, Hugh and Chen, Xinyun and Chen, Minmin and Nova, Azade and Hou, Le and Cheng, Heng-Tze and Le, Quoc V. and Chi, Ed H. and Zhou, Denny , year=. 2406.04520 , archivePrefix=

work page arXiv

[1] [1]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency , author=. arXiv preprint arXiv:2304.11477 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2405.19793 , year=

PDDLEGO: Iterative Planning in Textual Environments , author=. arXiv preprint arXiv:2405.19793 , year=

work page arXiv

[3] [3]

arXiv preprint arXiv:2308.06391 , year=

Dynamic Planning with a LLM , author=. arXiv preprint arXiv:2308.06391 , year=

work page arXiv

[4] [4]

arXiv preprint arXiv:2406.10196 , year=

TRIP-PAL: Travel Planning with Guarantees by Combining Large Language Models and Automated Planners , author=. arXiv preprint arXiv:2406.10196 , year=

work page arXiv

[5] [5]

NeurIPS , year=

Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , author=. NeurIPS , year=

[6] [6]

arXiv preprint arXiv:2307.07696 , year=

Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text , author=. arXiv preprint arXiv:2307.07696 , year=

work page arXiv

[7] [7]

2025 , eprint=

Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools , author=. 2025 , eprint=

2025

[8] [8]

Llms can't plan, but can help planning in llm-modulo frameworks, 2024

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks , author=. arXiv preprint arXiv:2402.01817 , year=

work page arXiv

[9] [9]

ICLR , year=

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control , author=. ICLR , year=

[10] [10]

UIST , year=

Generative Agents: Interactive Simulacra of Human Behavior , author=. UIST , year=

[11] [11]

arXiv preprint arXiv:2310.10134 , year=

CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization , author=. arXiv preprint arXiv:2310.10134 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2309.11436 , year=

You Only Look at Screens: Multimodal Chain-of-Action Agents , author=. arXiv preprint arXiv:2309.11436 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2403.12881 , year=

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models , author=. arXiv preprint arXiv:2403.12881 , year=

work page arXiv

[14] [14]

Fireact: Toward language agent fine-tuning

FireAct: Toward Language Agent Fine-tuning , author=. arXiv preprint arXiv:2310.05915 , year=

work page arXiv

[15] [15]

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin

AgentTuning: Enabling Generalized Agent Abilities for LLMs , author=. arXiv preprint arXiv:2310.12823 , year=

work page arXiv

[16] [16]

arXiv preprint arXiv:2403.02502 , year=

Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents , author=. arXiv preprint arXiv:2403.02502 , year=

work page arXiv

[17] [17]

arXiv preprint , year=

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding , author=. arXiv preprint , year=

[18] [18]

2023 , eprint=

QLoRA: Efficient Finetuning of Quantized LLMs , author=. 2023 , eprint=

2023

[19] [19]

arXiv preprint arXiv:2411.02337 , year=

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning , author=. arXiv preprint arXiv:2411.02337 , year=

work page arXiv

[20] [20]

Training Language Models to Self-Correct via Reinforcement Learning

Training Language Models to Self-Correct via Reinforcement Learning , author=. arXiv preprint arXiv:2409.12917 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

arXiv preprint arXiv:2406.14283 , year=

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning , author=. arXiv preprint arXiv:2406.14283 , year=

work page arXiv

[22] [22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

NeurIPS , year=

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , author=. NeurIPS , year=

[24] [24]

arXiv preprint , year=

RaDA: Retrieval-augmented Web Agent Planning with LLMs , author=. arXiv preprint , year=

[25] [25]

arXiv preprint arXiv:2410.18963 , year=

OSCAR: Operating System Control via State-Aware Reasoning and Re-Planning , author=. arXiv preprint arXiv:2410.18963 , year=

work page arXiv

[26] [26]

arXiv preprint arXiv:2406.06485 , year=

Can Language Models Serve as Text-Based World Simulators? , author=. arXiv preprint arXiv:2406.06485 , year=

work page arXiv

[27] [27]

arXiv preprint , year=

Plan-RAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers , author=. arXiv preprint , year=

[28] [28]

arXiv preprint , year=

UrbanLLM: Autonomous Urban Activity Planning and Management with Large Language Models , author=. arXiv preprint , year=

[29] [29]

NeurIPS , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. NeurIPS , year=

[30] [30]

arXiv preprint , year=

SearChain: Adaptive Information Retrieval Chain for Multi-Turn Question Answering , author=. arXiv preprint , year=

[31] [31]

AAAI , year=

Graph of Thoughts: Solving Elaborate Problems with Large Language Models , author=. AAAI , year=

[32] [32]

ICLR , year=

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models , author=. ICLR , year=

[33] [33]

ICLR , year=

ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search , author=. ICLR , year=

[34] [34]

EMNLP , year=

Reasoning with Language Model is Planning with World Model , author=. EMNLP , year=

[35] [35]

arXiv preprint arXiv:2405.03553 , year=

AlphaMath Almost Zero: Process Supervision Without Process , author=. arXiv preprint arXiv:2405.03553 , year=

work page arXiv

[36] [36]

arXiv preprint , year=

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement , author=. arXiv preprint , year=

[37] [37]

arXiv preprint arXiv:2403.00092 , year=

Proc2PDDL: Open-Domain Planning Representations from Texts , author=. arXiv preprint arXiv:2403.00092 , year=

work page arXiv

[38] [38]

NeurIPS Workshop , year=

Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages , author=. NeurIPS Workshop , year=

[39] [39]

arXiv preprint arXiv:2311.09830 , year=

AutoPlanBench: Automatically Generating Benchmarks for LLM Planners from PDDL , author=. arXiv preprint arXiv:2311.09830 , year=

work page arXiv

[40] [40]

EMNLP , year=

Unlocking the Future: Exploring Look-Ahead Planning Mechanistic Interpretability in Large Language Models , author=. EMNLP , year=

[41] [41]

arXiv preprint arXiv:2405.04776 , year=

Chain of Thoughtlessness: An Analysis of CoT in Planning , author=. arXiv preprint arXiv:2405.04776 , year=

work page arXiv

[42] [42]

GitHub , year=

ToolOrchestra: An End-to-End RL Training Framework for Orchestrating Tools and Agentic Workflows , author=. GitHub , year=

[43] [43]

arXiv preprint , year=

TinyAgent: Function Calling at the Edge , author=. arXiv preprint , year=

[44] [44]

arXiv preprint , year=

Hammer: Robust Function-Calling for On-Device Language Models via Function Masking , author=. arXiv preprint , year=

[45] [45]

arXiv preprint arXiv:2503.18809 , year=

Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code , author=. arXiv preprint arXiv:2503.18809 , year=

work page arXiv

[46] [46]

AAAI , year=

Generalized Planning in PDDL Domains with Pretrained Large Language Models , author=. AAAI , year=

[47] [47]

arXiv preprint arXiv:2501.18784 , year=

LLM-Generated Heuristics for AI Planning: Do We Even Need Domain-Independence Anymore? , author=. arXiv preprint arXiv:2501.18784 , year=

work page arXiv

[48] [48]

Nature , year=

Mathematical Discoveries from Program Search with Large Language Models , author=. Nature , year=

[49] [49]

2004 , publisher=

Automated Planning: Theory and Practice , author=. 2004 , publisher=

2004

[50] [50]

Technical Report , year=

PDDL: The Planning Domain Definition Language , author=. Technical Report , year=

[51] [51]

arXiv preprint arXiv:2403.03101 , year=

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents , author=. arXiv preprint arXiv:2403.03101 , year=

work page arXiv

[52] [52]

End-to-end PDDL Planning with Hardcoded and Dynamic Agents

End-to-end PDDL Planning with Hardcoded and Dynamic Agents , author=. arXiv preprint arXiv:2512.09629 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Advances in Neural Information Processing Systems , volume=

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change , author=. Advances in Neural Information Processing Systems , volume=

[54] [54]

arXiv preprint arXiv:2405.04215 , year=

NL2Plan: Robust LLM-Driven Planning from Minimal Text Descriptions , author=. arXiv preprint arXiv:2405.04215 , year=

work page arXiv

[55] [55]

2025 , eprint=

How Far Are LLMs from Symbolic Planners? An NLP-Based Perspective , author=. 2025 , eprint=

2025

[56] [56]

Advances in Neural Information Processing Systems , year=

Graph Neural Network Based Action Ranking for Planning , author=. Advances in Neural Information Processing Systems , year=

[57] [57]

Proceedings of the International Conference on Automated Planning and Scheduling , volume=

GammaZero: Learning to Guide Belief-Space Search for Long-Horizon POMDPs with Generalizable Graph Representations , author=. Proceedings of the International Conference on Automated Planning and Scheduling , volume=

[58] [58]

Journal of Artificial Intelligence Research , volume=

The Fast Downward Planning System , author=. Journal of Artificial Intelligence Research , volume=

[59] [59]

Howey, Richard and Long, Derek and Fox, Maria , booktitle=

[60] [60]

Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS) , year=

Forward-Chaining Partial-Order Planning , author=. Proceedings of the International Conference on Automated Planning and Scheduling (ICAPS) , year=

[61] [61]

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS) , pages=

[62] [62]

Neural Computation , volume=

Efficient Training of Artificial Neural Networks for Autonomous Navigation , author=. Neural Computation , volume=

[63] [63]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

and Chi, Ed H

Zheng, Huaixiu Steven and Mishra, Swaroop and Zhang, Hugh and Chen, Xinyun and Chen, Minmin and Nova, Azade and Hou, Le and Cheng, Heng-Tze and Le, Quoc V. and Chi, Ed H. and Zhou, Denny , year=. 2406.04520 , archivePrefix=

work page arXiv