In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Hao Guo; Kevin Shabahang; Michael Diamond; Rivaan Patil; Simon Dennis

arxiv: 2604.27891 · v2 · submitted 2026-04-30 · 💻 cs.AI · cs.LG

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks

Simon Dennis , Michael Diamond , Rivaan Patil , Kevin Shabahang , Hao Guo This is my paper

Pith reviewed 2026-05-08 03:02 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords in-context promptingagent orchestrationprocedural taskslarge language modelsmulti-turn conversationssystem promptstask automation

0 comments

The pith

For procedural tasks, embedding the full procedure in the system prompt lets the model self-orchestrate more reliably than external orchestration systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two approaches to handling multi-turn procedural tasks with large language models. One places the entire procedure directly into the model's system prompt so it can manage its own steps and state. The other relies on an external system to track progress and inject routing instructions at each turn. Across travel booking, technical support, and insurance claims domains, the prompt-based method achieves higher quality scores and lower failure rates. This suggests that recent improvements in model capabilities have reduced the need for complex external orchestration in these settings.

Core claim

The central claim is that for procedural tasks, in-context prompting—where the entire procedure is placed in the system prompt allowing the LLM to self-orchestrate—outperforms agent orchestration frameworks that place an external orchestrator above the LLM to track state and inject routing instructions. In evaluations across three domains with 200 conversations each, the in-context approach scored between 4.53 and 5.00 on a 5-point scale compared to 4.17 to 4.84 for the orchestrated system, with failure rates of 11.5%, 0.5%, and 5% versus 24%, 9%, and 17%.

What carries the argument

The in-context system prompt containing the complete procedure definition, which enables the model to manage its own state and transitions without external intervention.

If this is right

Developers can simplify agent systems by replacing external orchestrators with detailed system prompts for procedural tasks.
Resources spent on building and maintaining orchestration frameworks may yield diminishing returns for many applications.
Multi-turn conversations following defined procedures can achieve high reliability using only frontier model prompting.
Earlier models may have required orchestration, but current capabilities make it unnecessary for these tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This finding may extend to other structured tasks beyond the tested domains, such as customer service or medical diagnosis flows.
Future model improvements could further widen the gap in favor of in-context methods.
Testing on non-frontier models might reveal when orchestration remains necessary.

Load-bearing premise

That the LLM-as-judge scoring provides an accurate and unbiased measure of actual conversation quality and task success across the chosen domains.

What would settle it

Re-evaluating the same conversations with human judges instead of LLM-as-judge to check if the quality and failure rate differences hold.

Figures

Figures reproduced from arXiv: 2604.27891 by Hao Guo, Kevin Shabahang, Michael Diamond, Rivaan Patil, Simon Dennis.

**Figure 1.** Figure 1: Travel booking flowchart (14 nodes, 3 decision hubs, 3 terminal states). The Assess node view at source ↗

**Figure 2.** Figure 2: Zoom technical support flowchart (14 nodes, 3 decision hubs, 3 terminal states). Triage view at source ↗

**Figure 3.** Figure 3: Insurance claims processing flowchart (55 nodes, 6 decision hubs, 5 terminal states). The view at source ↗

read the original abstract

Agent orchestration frameworks -- LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, and others -- place an external orchestrator above the LLM, tracking state and injecting routing instructions at every turn. We present a controlled comparison showing that for procedural tasks, this architecture is dominated by a simpler alternative: putting the entire procedure in the system prompt and letting the model self-orchestrate. Across three domains -- travel booking (14 nodes), Zoom technical support (14 nodes), and insurance claims processing (55 nodes) -- we evaluate 200 conversations per condition using LLM-as-judge scoring on five quality criteria. The in-context approach scores 4.53--5.00 on a 5-point scale while a LangGraph orchestrator using the same model scores 4.17--4.84. The orchestrated system fails on 24% of travel, 9% of Zoom, and 17% of insurance conversations, compared to 11.5%, 0.5%, and 5% for the in-context baseline. While external orchestration may have been necessary for earlier models, advances in frontier model capabilities have made it unnecessary for multi-turn conversations following a defined procedure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

In-context prompting beats the LangGraph baseline on these three procedural tasks according to the reported scores, but the LLM-as-judge evaluation lacks the validation needed to support the broader claim.

read the letter

The main thing to take away is that the paper runs a head-to-head test and finds the in-context version scores higher (4.53-5.00) and fails less often than the LangGraph orchestrator (4.17-4.84) across travel booking, Zoom support, and insurance claims, each with 200 conversations per condition. That gives a concrete data point against the default assumption that external orchestration is required for multi-turn procedural work. The comparison uses the same base model in both arms, which keeps the test focused on architecture rather than model differences. The domains are all defined-node procedures, so the setup matches the claim being tested. This is the sort of practical comparison that can help teams decide whether to add another layer of routing code. The soft spot is the measurement. All the quality scores and failure rates come from an LLM judge, yet the abstract gives no details on the judge prompt, the judging model, failure criteria definitions, or any human correlation check. If the judge model favors the style or length of the in-context responses, the reported gap could be partly artifactual. The paper would be more convincing with even a small human validation set or inter-rater numbers. The domains are narrow enough that the result may not travel to less linear tasks. This is worth a serious referee for anyone working on agent tooling or automation pipelines. The question is timely and the experiment is direct, but the evaluation method needs tightening before the conclusion that orchestration is obsolete can be taken as settled. Referees should ask for prompt details, judge validation, and perhaps a broader set of tasks.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that for procedural tasks, embedding the full procedure in the system prompt enables frontier LLMs to self-orchestrate more effectively than external orchestration frameworks such as LangGraph. This is supported by a controlled comparison across three domains (travel booking with 14 nodes, Zoom technical support with 14 nodes, and insurance claims processing with 55 nodes), evaluating 200 conversations per condition via LLM-as-judge scoring on five quality criteria. The in-context approach achieves scores of 4.53–5.00 versus 4.17–4.84 for the LangGraph baseline, with lower failure rates (11.5%/0.5%/5% versus 24%/9%/17%). The authors conclude that external orchestration is no longer necessary given current model capabilities.

Significance. If the empirical results hold after addressing evaluation concerns, the work would have substantial practical significance by demonstrating that simpler in-context methods can replace complex agent orchestration for well-defined procedural tasks. This could reduce engineering overhead, latency, and costs in deploying multi-turn agents, while challenging the default adoption of frameworks like LangGraph or CrewAI. The paper's strength lies in its direct head-to-head comparison with concrete quantitative outcomes, but its impact depends on establishing that the LLM judge provides unbiased, reliable measurements of task success.

major comments (3)

[Evaluation and Results sections] The evaluation depends entirely on LLM-as-judge scoring, yet the manuscript provides no details on the judge prompt, the judging model, inter-rater agreement metrics, or any human correlation study. This is load-bearing for the central claim, as any systematic bias (e.g., favoring in-context response style or length) could artifactually inflate the reported gaps of 0.36–0.69 points and the failure-rate differences.
[Experimental Setup] Failure criteria, conversation generation protocol, and exact definitions of the five quality criteria are not specified. Without these, it is impossible to verify that the 200 conversations per condition constitute a fair, controlled comparison or that the failure rates (e.g., 24% vs 11.5% in travel) reflect objective task completion rather than judge-specific thresholds.
[Results] No statistical tests, confidence intervals, or variance analysis accompany the reported scores and failure rates. The differences could arise from sampling variability across the 200 conversations, weakening support for the claim that in-context prompting dominates orchestration.

minor comments (2)

[Abstract] The abstract and introduction should explicitly name the frontier model(s) used for both the agent and the judge to allow reproducibility.
[Figures/Tables] Figure or table captions could more clearly distinguish the in-context baseline from the LangGraph condition and list the exact five quality criteria.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the transparency and rigor of our evaluation. We agree that the suggested additions will strengthen the manuscript and will incorporate revisions accordingly. Below we respond point by point to each major comment.

read point-by-point responses

Referee: [Evaluation and Results sections] The evaluation depends entirely on LLM-as-judge scoring, yet the manuscript provides no details on the judge prompt, the judging model, inter-rater agreement metrics, or any human correlation study. This is load-bearing for the central claim, as any systematic bias (e.g., favoring in-context response style or length) could artifactually inflate the reported gaps of 0.36–0.69 points and the failure-rate differences.

Authors: We agree that the manuscript requires substantially more detail on the LLM-as-judge methodology. In the revised version we will add a dedicated subsection that includes the complete judge prompt, the exact judging model used, and any inter-rater agreement statistics obtained from multiple independent judgments. Although we did not perform a formal human correlation study, we will include qualitative examples of scored conversations, an explicit discussion of how the five quality criteria were constructed to reduce style or length bias, and an acknowledgment of this as a limitation. These changes will allow readers to assess potential systematic biases directly. revision: yes
Referee: [Experimental Setup] Failure criteria, conversation generation protocol, and exact definitions of the five quality criteria are not specified. Without these, it is impossible to verify that the 200 conversations per condition constitute a fair, controlled comparison or that the failure rates (e.g., 24% vs 11.5% in travel) reflect objective task completion rather than judge-specific thresholds.

Authors: We acknowledge the omission. The revised manuscript will expand the Experimental Setup section to provide: (1) precise failure criteria (e.g., failure to complete all procedure nodes, introduction of incorrect information, or premature termination), (2) the full conversation generation protocol including how user simulators were prompted and the distribution of conversation lengths, and (3) detailed rubrics with score anchors and examples for each of the five quality criteria. These additions will demonstrate that the 200 conversations per condition were generated under identical conditions and that failure rates reflect objective task outcomes rather than judge-specific artifacts. revision: yes
Referee: [Results] No statistical tests, confidence intervals, or variance analysis accompany the reported scores and failure rates. The differences could arise from sampling variability across the 200 conversations, weakening support for the claim that in-context prompting dominates orchestration.

Authors: We agree that statistical support is necessary. In the revision we will add confidence intervals (computed via bootstrap resampling with 10,000 iterations) for all reported mean quality scores and failure rates, together with appropriate statistical tests: two-sample t-tests for the continuous quality scores and chi-squared or proportion z-tests for the binary failure rates. With 200 independent conversations per condition, these analyses are directly computable from our existing data and will quantify the likelihood that the observed gaps (0.36–0.69 points and 12.5–16.5 percentage points in failure rates) are due to sampling variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison with no derivations or self-referential reductions

full rationale

The paper advances an empirical claim by running controlled experiments (200 conversations per condition across three procedural domains) that directly measure quality scores and failure rates for in-context prompting versus LangGraph orchestration. No equations, fitted parameters, uniqueness theorems, or ansatzes appear in the provided text. The central result is a head-to-head performance comparison using LLM-as-judge scoring; it does not reduce to any input by construction, self-citation chain, or renaming of prior results. The derivation chain is therefore self-contained and consists solely of new experimental data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described. The evaluation implicitly assumes reliability of LLM-as-judge scoring without independent validation in the provided text.

pith-pipeline@v0.9.0 · 5517 in / 1271 out tokens · 99565 ms · 2026-05-08T03:02:12.480816+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

If MISSING required info→ask for it [missing_info]

work page
[2]

If info is UNCLEAR→ask for clarification [needs_clarification]

work page
[3]

If all required info gathered→present options [info_complete]

work page
[4]

If user wants to ABANDON→close gracefully [user_abandoning] 14 HANDLE_RESPONSE (Hub 2): Respond to the user’s reaction to the options

work page
[5]

If they PICKED AN OPTION: Confirm their choice, summarize key details, ask if ready to finalize [option_selected]

work page
[6]

If they WANT MODIFICATIONS: Acknowledge feedback, suggest adjustments [needs_revision]

work page
[7]

If they have QUESTIONS: Answer helpfully [answered_question]

work page
[8]

If CHANGING REQUIREMENTS: Go back to assessment [changing_requirements]

work page
[9]

If ABANDONING: Close gracefully [user_abandoning]

work page
[10]

If NEEDS HUMAN: Escalate [needs_human] FINAL_ROUTING (Hub 3): Handle the user’s final response

work page
[11]

If they CONFIRMED: Express enthusiasm, provide tips, close warmly [confirmed]

work page
[12]

If they have a QUESTION: Answer it, ask if ready to confirm [answered_question]

work page
[13]

If they want to CHANGE something: Address the change [wants_changes]

work page
[14]

missing_info

If having SECOND THOUGHTS: Close gracefully [second_thoughts] Non-hub nodes (e.g., OPENING, PRESENT_OPTIONS) have simpler templates with a single outgoing edge and no routing decision. The Zoom and insurance domains follow the same pattern with domain-appropriate prompts; for example, the Zoom TRIAGE hub routes among[needs_details], [ready_for_solutions],...

work page
[15]

4: Nearly complete–-reached the final stage but the conclusion was slightly weak

Task Success (Procedural Completion): Did the agent follow the booking procedure through to a terminal state? 5: Complete procedure–-agent moved through all stages and reached a clear terminal state. 4: Nearly complete–-reached the final stage but the conclusion was slightly weak. 3: Partial–-completed middle stages but fizzled without a clear endpoint. 2...

work page
[16]

4: Mostly correct, one minor detail off

Information Accuracy: Did the agent correctly use and retain all user-provided information? 5: Flawless–-every detail correctly reflected. 4: Mostly correct, one minor detail off. 3: Generally right but ignored a stated preference. 2: Multiple errors or contradicts user input. 1: Fabricates details or ignores input

work page
[17]

4: Mostly coherent, one minor issue

Consistency: Did the agent maintain coherent state across the conversation? 5: No contradictions, no repeated questions, smooth flow. 4: Mostly coherent, one minor issue. 3: Noticeable issues like re-asking for provided information. 2: Contradicts itself or forgets important details. 1: Incoherent

work page
[18]

4: Handles most challenges well

Graceful Handling: How well does the agent handle changes, ambiguity, edge cases? 5: Masterful–-smoothly handles every challenge. 4: Handles most challenges well. 3: Handles straightforward cases but struggles with complexity. 2: Fails to adapt to changes. 1: Any deviation breaks the flow. Note: If no challenging moments, cap at 3

work page
[19]

procedural tasks,

Naturalness: Does this read like a conversation with a skilled human agent? 5: Indistinguishable from a skilled human. 4: Very natural with occasional formulaic moments. 3: Clearly an AI, rigid question-answer pattern. 2: Robotic phrasing, unnatural transitions. 1: Obviously scripted. 17 Open User Request Assess Hub 1 User Info User Clarify Present Option...

work page 2024
[20]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

If MISSING required info→ask for it [missing_info]

work page

[2] [2]

If info is UNCLEAR→ask for clarification [needs_clarification]

work page

[3] [3]

If all required info gathered→present options [info_complete]

work page

[4] [4]

If user wants to ABANDON→close gracefully [user_abandoning] 14 HANDLE_RESPONSE (Hub 2): Respond to the user’s reaction to the options

work page

[5] [5]

If they PICKED AN OPTION: Confirm their choice, summarize key details, ask if ready to finalize [option_selected]

work page

[6] [6]

If they WANT MODIFICATIONS: Acknowledge feedback, suggest adjustments [needs_revision]

work page

[7] [7]

If they have QUESTIONS: Answer helpfully [answered_question]

work page

[8] [8]

If CHANGING REQUIREMENTS: Go back to assessment [changing_requirements]

work page

[9] [9]

If ABANDONING: Close gracefully [user_abandoning]

work page

[10] [10]

If NEEDS HUMAN: Escalate [needs_human] FINAL_ROUTING (Hub 3): Handle the user’s final response

work page

[11] [11]

If they CONFIRMED: Express enthusiasm, provide tips, close warmly [confirmed]

work page

[12] [12]

If they have a QUESTION: Answer it, ask if ready to confirm [answered_question]

work page

[13] [13]

If they want to CHANGE something: Address the change [wants_changes]

work page

[14] [14]

missing_info

If having SECOND THOUGHTS: Close gracefully [second_thoughts] Non-hub nodes (e.g., OPENING, PRESENT_OPTIONS) have simpler templates with a single outgoing edge and no routing decision. The Zoom and insurance domains follow the same pattern with domain-appropriate prompts; for example, the Zoom TRIAGE hub routes among[needs_details], [ready_for_solutions],...

work page

[15] [15]

4: Nearly complete–-reached the final stage but the conclusion was slightly weak

Task Success (Procedural Completion): Did the agent follow the booking procedure through to a terminal state? 5: Complete procedure–-agent moved through all stages and reached a clear terminal state. 4: Nearly complete–-reached the final stage but the conclusion was slightly weak. 3: Partial–-completed middle stages but fizzled without a clear endpoint. 2...

work page

[16] [16]

4: Mostly correct, one minor detail off

Information Accuracy: Did the agent correctly use and retain all user-provided information? 5: Flawless–-every detail correctly reflected. 4: Mostly correct, one minor detail off. 3: Generally right but ignored a stated preference. 2: Multiple errors or contradicts user input. 1: Fabricates details or ignores input

work page

[17] [17]

4: Mostly coherent, one minor issue

Consistency: Did the agent maintain coherent state across the conversation? 5: No contradictions, no repeated questions, smooth flow. 4: Mostly coherent, one minor issue. 3: Noticeable issues like re-asking for provided information. 2: Contradicts itself or forgets important details. 1: Incoherent

work page

[18] [18]

4: Handles most challenges well

Graceful Handling: How well does the agent handle changes, ambiguity, edge cases? 5: Masterful–-smoothly handles every challenge. 4: Handles most challenges well. 3: Handles straightforward cases but struggles with complexity. 2: Fails to adapt to changes. 1: Any deviation breaks the flow. Note: If no challenging moments, cap at 3

work page

[19] [19]

procedural tasks,

Naturalness: Does this read like a conversation with a skilled human agent? 5: Indistinguishable from a skilled human. 4: Very natural with occasional formulaic moments. 3: Clearly an AI, rigid question-answer pattern. 2: Robotic phrasing, unnatural transitions. 1: Obviously scripted. 17 Open User Request Assess Hub 1 User Info User Clarify Present Option...

work page 2024

[20] [20]

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page