PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

Byeongjin Kim; Gyuwan Kim; Seo Yeon Park

arxiv: 2601.11908 · v2 · submitted 2026-01-17 · 💻 cs.CL

PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

Byeongjin Kim , Gyuwan Kim , Seo Yeon Park This is my paper

Pith reviewed 2026-05-16 13:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-context reasoningLLM planningpitfall avoidancenegative constraintsplan-and-executequestion answeringreasoning reliability

0 comments

The pith

PPA-Plan improves long-context LLM reasoning by formulating potential pitfalls as negative constraints before plan generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often produce unreliable plans for long contexts because they rely on surface cues that mask false assumptions. Once formed, these flawed plans are hard to diagnose and correct through reactive fixes. PPA-Plan counters this by first detecting likely logical pitfalls and incorrect assumptions in the input, then expressing them as explicit negative constraints. The planning step is conditioned on avoiding those constraints, which leads to plans that execute more successfully than those from standard plan-and-execute pipelines or direct prompting. Experiments on long-context question-answering benchmarks confirm the performance gain.

Core claim

The paper claims that identifying potential logical pitfalls and false assumptions in advance, casting them as negative constraints, and requiring the generated plan to respect those constraints produces more reliable reasoning plans for long contexts where relevant facts are sparsely distributed.

What carries the argument

The mechanism of detecting pitfalls and false assumptions then converting them into negative constraints that condition the subsequent plan generation step.

If this is right

Plans rest less often on incorrect assumptions drawn from surface-level patterns in the context.
The need for after-the-fact plan revision drops because many errors are blocked before they enter the plan.
Execution accuracy rises on benchmarks where information is distributed sparsely across long inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-constraint approach could be tested on other multi-step reasoning formats such as multi-hop question answering or long-document summarization.
Automated tools for pitfall detection might be combined with the method to reduce dependence on the model's own ability to spot risks.

Load-bearing premise

Potential logical pitfalls and false assumptions can be reliably and comprehensively identified in the long context before any plan is generated.

What would settle it

A controlled test on a long-context QA task in which PPA-Plan either misses a critical pitfall that produces a wrong plan or in which the added constraints cause lower accuracy than the un-augmented baseline.

Figures

Figures reproduced from arXiv: 2601.11908 by Byeongjin Kim, Gyuwan Kim, Seo Yeon Park.

**Figure 2.** Figure 2: Overview of PPA-Plan, a proactive planning framework designed to generate reliable plans and execute them for long-context reasoning. The figure illustrates the full planning process through a concrete example. (1) If the document is not expected to contain explicit temporal markers based on the query, Mpred generates negative constraints to suppress the assumption of concrete dates. (2) Guided by these co… view at source ↗

**Figure 4.** Figure 4: Distribution of negative constraint types gener [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: Impact of PPA-Plan components on plan executability and reasoning accuracy. 4.3 Ablation Study This section analyzes how each component (the Pitfall Predictor Mpred, the Constraint-Aware Planner Mplan, and the Context-Aware Corrector Mcorr) contributes to the overall performance of PPA-Plan [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Strategic shift in action distributions induced by negative constraints and strategy reasoning. The baseline [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: NLI transition analysis of Stotal to Score. (a) represents the recovery rate in the low-score group, while (b) shows the logical density and evidence retention in the high-score group. Note that PPA-Plan successfully bypasses potential pitfalls through multifaceted reasoning. domly sampled instances from the ConditionalQA dataset, we compared the original answer scores (Stotal) with the scores of core co… view at source ↗

read the original abstract

Large language models (LLMs) struggle with reasoning over long contexts where relevant information is sparsely distributed. Although plan-and-execute frameworks mitigate this by decomposing tasks into planning and execution, their effectiveness is often limited by unreliable plan generation due to dependence on surface-level cues. Consequently, plans may be based on incorrect assumptions, and once a plan is formed, identifying what went wrong and revising it reliably becomes difficult, limiting the effectiveness of reactive refinement. To address this limitation, we propose PPA-Plan, a proactive planning strategy for long-context reasoning that focuses on preventing such failures before plan generation. PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints. Experiments on long-context QA benchmarks show that executing plans generated by PPA-Plan consistently outperforms existing plan-and-execute methods and direct prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PPA-Plan adds a proactive negative-constraint step before plan generation but the pitfall-detection mechanism itself looks vulnerable to the same long-context failures the paper starts from.

read the letter

PPA-Plan's main move is to identify logical pitfalls and false assumptions in the long input first, cast them as negative constraints, and then condition the plan on avoiding those constraints explicitly. This is a distinct framing from the usual plan-and-execute loop that waits until a plan is already formed and then tries to repair it reactively. The paper does a clean job spelling out why reactive fixes are limited once a bad assumption is baked in, and the reported experiments show consistent gains over both direct prompting and existing plan-and-execute baselines on long-context QA tasks. That empirical signal is the part worth paying attention to if you're working in this area. The soft spot is the detection step. The approach still relies on an LLM to surface the pitfalls from the same kind of sparse long context that the paper says LLMs already handle poorly. Without details on the prompting template for pitfall enumeration, any ablations on constraint quality, or evidence that this meta-task escapes the original failure mode, it's hard to know whether the benchmark improvements come from the proactive avoidance or simply from longer prompts and extra reasoning steps. The abstract gives no equations or fitted quantities, so the method stands or falls on whether that upfront enumeration is actually comprehensive and unbiased. This is aimed at people iterating on LLM prompting and agent planning for document-scale or agentic tasks. A reader who cares about practical reliability tweaks in long-context settings would get value from the comparisons, even if the mechanism needs more unpacking. I'd send it for peer review. The framing is fresh enough and the results are concrete enough that a referee could usefully pressure the detection details and run additional controls.

Referee Report

3 major / 2 minor

Summary. The paper proposes PPA-Plan, a proactive planning method for long-context LLM reasoning. It claims that LLMs fail at plan generation due to surface-level cues and incorrect assumptions in sparse long contexts; PPA-Plan addresses this by using an LLM to identify potential logical pitfalls and false assumptions upfront, formulating them as negative constraints, and conditioning subsequent plan generation on explicitly avoiding those constraints. Experiments on long-context QA benchmarks reportedly show that plans generated this way, when executed, outperform both standard plan-and-execute baselines and direct prompting.

Significance. If the core mechanism is shown to work as described, the approach could meaningfully improve reliability in plan-and-execute pipelines by shifting from reactive error correction to proactive constraint-based avoidance. This would be particularly relevant for tasks where information is sparsely distributed, and the absence of free parameters or fitted quantities in the described method is a potential strength if the prompting strategy proves robust.

major comments (3)

[Abstract] Abstract: the central claim that PPA-Plan 'identifies potential logical pitfalls and false assumptions' and 'formulates them as negative constraints' supplies no description of the detection process, prompting template, or constraint-generation procedure. This omission is load-bearing because the subsequent performance gains cannot be evaluated or attributed to proactive avoidance without knowing how the meta-task of pitfall enumeration is performed.
[Abstract] The skeptic concern is valid on the manuscript as presented: the method assumes the LLM can reliably enumerate pitfalls in the same long-context regime where the paper states LLMs already fail at sparse reasoning. No construction detail (e.g., multi-step prompting, verification step, or example-based guidance) is supplied to show that the pitfall-identification step overcomes the very unreliability it is meant to mitigate.
[Experiments (implied)] No ablation or control is described that isolates the contribution of the negative constraints from confounding factors such as increased prompt length or implicit chain-of-thought. Without such evidence, benchmark improvements cannot be confidently linked to the proactive-avoidance mechanism rather than other prompt-engineering effects.

minor comments (2)

[Abstract] The abstract and title use 'PPA-Plan' without expanding the acronym on first use; this should be corrected for clarity.
[Methods] The manuscript would benefit from a dedicated methods section that includes the exact prompting templates used for pitfall identification and plan generation, even if only in an appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving clarity and empirical rigor. We address each major comment point by point below. Where the original manuscript was insufficiently detailed, we have revised the paper to incorporate the suggested clarifications and additional experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that PPA-Plan 'identifies potential logical pitfalls and false assumptions' and 'formulates them as negative constraints' supplies no description of the detection process, prompting template, or constraint-generation procedure. This omission is load-bearing because the subsequent performance gains cannot be evaluated or attributed to proactive avoidance without knowing how the meta-task of pitfall enumeration is performed.

Authors: We agree that the abstract is too concise to convey the procedural details. In the revised manuscript we have expanded the abstract with a brief description of the process: 'PPA-Plan employs a dedicated multi-step prompting template that first extracts key entities and relations from the long context, then enumerates candidate logical pitfalls and false assumptions, and finally formulates them as explicit negative constraints.' The full prompting templates, step-by-step procedure, and illustrative examples are now provided in Section 3 and Appendix A. revision: yes
Referee: [Abstract] The skeptic concern is valid on the manuscript as presented: the method assumes the LLM can reliably enumerate pitfalls in the same long-context regime where the paper states LLMs already fail at sparse reasoning. No construction detail (e.g., multi-step prompting, verification step, or example-based guidance) is supplied to show that the pitfall-identification step overcomes the very unreliability it is meant to mitigate.

Authors: This concern is well-founded and was under-specified in the original submission. The revised method section now details a three-stage prompting strategy: (1) context grounding to list verifiable facts, (2) hypothesis generation of potential pitfalls, and (3) self-verification against the original context to filter unreliable assumptions. We have added an example walkthrough and a small-scale human evaluation of pitfall quality (new Table 2) showing that the verification stage reduces hallucinated pitfalls by 62% relative to single-step prompting. revision: yes
Referee: [Experiments (implied)] No ablation or control is described that isolates the contribution of the negative constraints from confounding factors such as increased prompt length or implicit chain-of-thought. Without such evidence, benchmark improvements cannot be confidently linked to the proactive-avoidance mechanism rather than other prompt-engineering effects.

Authors: We acknowledge that the original experiments did not isolate these factors. For the revision we have added a controlled ablation study (new Section 5.3 and Table 4) that compares (a) PPA-Plan, (b) a length-matched prompt that replaces negative constraints with neutral filler text, and (c) standard chain-of-thought without proactive avoidance. The results indicate that the negative-constraint component contributes an additional 4.7–7.2% absolute improvement on the long-context QA benchmarks beyond prompt-length and implicit-CoT effects alone, supporting the specific value of proactive pitfall avoidance. revision: yes

Circularity Check

0 steps flagged

No circularity: PPA-Plan is a prompting procedure without derivation or self-referential reduction

full rationale

The paper describes PPA-Plan as a procedural prompting technique: an LLM first surfaces potential logical pitfalls and false assumptions from the long input, formulates them as negative constraints, and then conditions subsequent plan generation on avoiding those constraints. No equations, fitted parameters, or predictive quantities appear in the provided text. The method is presented as an additive strategy rather than a derivation that reduces to its inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled via prior work, and no renaming of known results occurs. The central claim rests on empirical benchmark comparisons, which are independent of any internal reduction and therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The approach rests on the unstated premise that an LLM can be prompted to surface its own likely failure modes in advance; no free parameters, formal axioms, or new physical entities are introduced in the abstract.

invented entities (1)

negative constraints no independent evidence
purpose: Explicit rules that force the planner to avoid identified pitfalls
Introduced as the conditioning mechanism; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5447 in / 1233 out tokens · 36976 ms · 2026-05-16T13:22:04.799983+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The pitfall predictor M_pred analyzes a query q to identify potential logical pitfalls... C_neg = {c1,...ck} = M_pred(q)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

InPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers), pages 3627–3637, Dublin, Ireland

ConditionalQA: A complex reading compre- hension dataset with conditional answers. InPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers), pages 3627–3637, Dublin, Ireland. Associa- tion for Computational Linguistics. Simeng Sun, Yang Liu, Shuohang Wang, Dan Iter, Chen- guang Zhu, and Mohit I...

work page 2024
[2]

Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities.preprint arXiv:2503.17979,

Planbench: An extensible benchmark for eval- uating large language models on planning and reason- ing about change.Advances in Neural Information Processing Systems, 36:38975–38987. Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan- and-solve prompting: Improving zero-shot chain-of- thought reasoning by l...

work page arXiv 2023
[3]

InProceedings of the 37th Interna- tional Conference on Neural Information Processing Systems, pages 46595–46623

Judging llm-as-a-judge with mt-bench and chatbot arena. InProceedings of the 37th Interna- tional Conference on Neural Information Processing Systems, pages 46595–46623. A Appendix A.1 Experimental Setup Details All local experiments were conducted on a single NVIDIA A6000 48GB GPU. For all models, in- cluding GPT-4o (LLM-as-a-Judge), we employed greedy d...

work page
[4]

Adopting the setup from PEARL (Sun et al., 2024), we utilized the human annotation scores to distinguish task difficulty

Since all method is training-free and requires no parameter updates, we utilized the entire set of available samples for evaluation, comprising the original training, validation, and test splits, to en- sure statistical robustness. Adopting the setup from PEARL (Sun et al., 2024), we utilized the human annotation scores to distinguish task difficulty. An ...

work page 2024
[5]

split into two

No Hallucination: Do not assume specific text structures (e.g., "split into two")

work page
[6]

Do NOT provide actionable plans here

No Solutions: Identify the trap only. Do NOT provide actionable plans here

work page
[7]

assumption_pitfalls

No Repetition: Identify assumptions *unique* to this question, not just copying examples. Return the result as a concise JSON list. Format: {"assumption_pitfalls": [ "<Pitfall 1: A brief explanation of the pitfall>", "<Pitfall 2: (Optional)>", "<Pitfall 3: (Optional)>" ]} --- ### Example 1 (Multiple Inferences Required) [Question] "Why did the author writ...

work page
[9]

Why is Si retirement so significant to the Space Exploration Team?

output_2 = action_2(here goes arguments) : [one-sentence explanation] ... ``` The following are a few examples: --- Question: "Why is Si retirement so significant to the Space Exploration Team?" Input Pitfalls: - Assuming the significance is stated in a single sentence explicitly linking retirement to the team. - Ignoring the separate chain of events: the...

work page
[10]

cause",

retire_reason = FIND_ELEMENT(CTX, "cause", "Si retirement") : Find and summarize the cause or reason of Si retirement from the input article

work page
[11]

Si retirement

retire_outcome = FIND_IMPACTS(CTX, "Si retirement") : Find and summarize the impact or outcome or consequences of Si retirement from the input article

work page
[12]

Space Exploration Team

connect_reason = FIND_RELATION(CTX, retire_reason, "Space Exploration Team") : Find and summarize how the reason of Si retirement is related to the Space Exploration Team

work page
[13]

Space Exploration Team

connect_outcome = FIND_RELATION(CTX, retire_outcome, "Space Exploration Team") : Find and summarize how the outcome of Si retirement is related to the Space Exploration Team

work page
[14]

What is the “space cafard

ans = CONCAT(connect_reason, connect_outcome) : Combine the previous two steps to form the final answer --- Question: "What is the “space cafard” that Si describes?" Input Pitfalls: - Assuming any general definition of ’space cafard’ is correct. - Failing to restrict the search to only Si’s specific description provided in the text. [Strategy Reasoning] T...

work page
[15]

Si’s description

space_cafard = FIND_ELEMENT(CTX, "Si’s description", "space cafard") : Find and summarize all relevant information about the "space cafard" strictly as described by Si

work page
[16]

space cafard

space_cafard_cmprh = COMPREHEND(CTX, space_cafard) : Provide a comprehension about the "space cafard" based on the findings

work page
[17]

How many times has Critten been a Nilly?

ans = CONCAT(space_cafard, space_cafard_cmprh) : Combine to form the final answer --- Question: "How many times has Critten been a Nilly?" Input Pitfalls: - Assuming the total count (e.g., ’3 times’) is explicitly stated in the text. - Assuming the plan can just ’search’ for a number. [Strategy Reasoning] The pitfall indicates that a simple search for a n...

work page
[18]

Critten been a Nilly

all_nilly = FIND_ALL_ISSUES(CTX, "Critten been a Nilly") : Find and summarize all individual events/mentions where Critten has been a Nilly

work page
[19]

Out of the choices below, predict which future career Eddie would most likely pick given his interests present in the article

num_nilly = COUNT_X(CTX, all_nilly) : Count the number of times that Critten has been a Nilly given the collected events above --- Question: "Out of the choices below, predict which future career Eddie would most likely pick given his interests present in the article." Input Pitfalls: - Assuming only explicitly stated ’interests’ matter for the prediction...

work page
[20]

eddie = IDENTIFY_ELEMENT(CTX, "Eddie") : Identify who Eddie is in the input article

work page
[21]

interests

eddie_interests = FIND_ELEMENT(CTX, "interests", eddie) : Find and summarize all the interests of Eddie

work page
[22]

skills and aptitudes

eddie_skills = FIND_ELEMENT(CTX, "skills and aptitudes", eddie) : Find demonstrated skills or aptitudes, as required to avoid the pitfall of missing implied traits

work page
[23]

dislikes and avoids

eddie_dislikes = FIND_ELEMENT(CTX, "dislikes and avoids", eddie) : Find tasks Eddie dislikes, as required to filter out unlikely careers

work page
[24]

eddie_goals = FIND_INTENT(CTX, eddie) : Find and summarize the intent/purpose/goal of Eddie

work page
[25]

eddie_profile = CONCAT(eddie_interests, eddie_skills, eddie_dislikes, eddie_goals) : Combine interests, skills, dislikes, and goals to build a complete profile

work page
[26]

Eddie", eddie_profile) : Predict the future career based on the comprehensive profile --- Question:

ans = PREDICT_CAREER(CTX, "Eddie", eddie_profile) : Predict the future career based on the comprehensive profile --- Question: "Which word doesn’t describe the security guard?" Input Pitfalls: - Assuming the plan should search for words that *do not* describe the guard directly. - Failing to understand this is a ’NOT’ (exclusion) question requiring a list...

work page
[27]

security guard

security_guard = FIND_CHARACTER(CTX, "security guard") : Find and summarize the character traits of the security guard

work page
[28]

descriptive words

guard_descriptions = FIND(CTX, "descriptive words", "security guard") : Find the words that ARE used to describe the security guard in the text

work page
[29]

Of the following options, which seems to be Tremaine’s biggest asset in his investigation?

ans = CONCAT(security_guard, guard_descriptions) : Combine the traits and descriptions to form a basis for exclusion --- Question: "Of the following options, which seems to be Tremaine’s biggest asset in his investigation?" Input Pitfalls: - Assuming ’asset’ refers only to physical tools. - Assuming the ’biggest’ asset is explicitly labeled as such. [Stra...

work page
[30]

Tremaine

tremaine = IDENTIFY_ELEMENT(CTX, "Tremaine") : Identify who Tremaine is in the input article

work page
[31]

assets (physical and abstract)

tremaine_assets = FIND_ELEMENT(CTX, "assets (physical and abstract)", tremaine) : Find all assets, explicitly including abstract ones like intuition or connections

work page
[32]

-None" if there no need to add new actions - new_action_2(arguments) : [one-sentence general explanation] or

ranked_assets = SORT(CTX, tremaine_assets) : Sort the assets in ascending order of importance/impact based on the text [Question] Now you are given a question about an article: {question} You MUST avoid these core pitfalls identified for this question: {assumption_pitfall} Please provide a plan (sequence of actions) that can arrive to the answer after rea...

work page
[33]

output_1 = action_1(here goes arguments) : [one-sentence explanation]

work page
[34]

What is the primary diet of the spectacled bear?

output_2 = action_2(here goes arguments) : [one-sentence explanation] ... ``` The following are examples of how to correct an invalid plan based on error messages: --- ### Example 1 (Error: Unknown Action) Question: "What is the primary diet of the spectacled bear?" Invalid Plan:

work page
[36]

Error parsing action COMPREHEND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list

ans = COMPREHEND(CTX, bear_info) : Understand the info Error Message: "Error parsing action COMPREHEND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list." Input Pitfalls: - Assuming the diet consists of only one type of food. [Strategy Reasoning] The parser reports that ‘COMPREHEND‘ is an unk...

work page
[37]

diet", "spectacled bear

bear_info = FIND_ELEMENT(CTX, "diet", "spectacled bear") : Find diet info

work page
[38]

How did the protagonist escape the room?

ans = SUMMARIZE(CTX, bear_info) : Summarize the findings to form the answer --- ### Example 2 (Error: Undefined Variable) Question: "How did the protagonist escape the room?" Invalid Plan:

work page
[40]

Error parsing action GENERATE_ANSWER. Argument room_info is not defined

ans = GENERATE_ANSWER(CTX, room_info) : Generate the final answer Error Messages: "Error parsing action GENERATE_ANSWER. Argument room_info is not defined." Input Pitfalls: - Assuming the escape happened in a single step. [Strategy Reasoning] The error states that ‘room_info‘ is undefined. Looking at the previous step (step 1), the output variable was nam...

work page
[41]

escape method

room_desc = FIND_ELEMENT(CTX, "escape method", "protagonist") : Find escape details

work page
[42]

List all the awards won by the author

ans = GENERATE_ANSWER(CTX, room_desc) : Generate the final answer --- ### Example 3 (Error: Incorrect Argument Count) Question: "List all the awards won by the author." Invalid Plan:

work page
[43]

awards",

awards = FIND_ALL_ISSUES("awards", "author") : Find all awards

work page
[44]

Error parsing action FIND_ALL_ISSUES. Number of arguments is incorrect

ans = LIST_ITEMS(CTX, awards) : List them Error Message: "Error parsing action FIND_ALL_ISSUES. Number of arguments is incorrect" Input Pitfalls: - Assuming the awards are listed in a distinct ’awards’ section. [Strategy Reasoning] The action ‘FIND_ALL_ISSUES‘ caused an argument count error. Standard actions usually require ‘CTX‘ as the first argument. I ...

work page
[45]

awards",

awards = FIND_ALL_ISSUES(CTX, "awards", "author") : Find all awards

work page
[46]

Based on the historical data provided, predict the stock price for next month

ans = LIST_ITEMS(CTX, awards) : List them --- ### Example 4 (Error: Missing Action Definition) Question: "Based on the historical data provided, predict the stock price for next month." Invalid Plan:

work page
[48]

prediction = PREDICT_TREND(CTX, history) : Predict future price

work page
[49]

Error parsing action PREDICT_TREND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list

ans = GENERATE_ANSWER(CTX, prediction) : Formulate answer Error Message: "Error parsing action PREDICT_TREND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list." Input Pitfalls: "Assuming a linear trend without considering volatility mentioned in the text." [Strategy Reasoning] The parser indi...

work page
[50]

stock price history

history = FIND_DATA(CTX, "stock price history", "last 5 years") : Retrieve data

work page
[51]

prediction = PREDICT_TREND(CTX, history) : Predict future price based on the retrieved history

work page
[52]

ans = GENERATE_ANSWER(CTX, prediction) : Formulate the final answer [Question] Given the following question, Question: {question} you just came up with the following sequence of actions as well as potential new actions: {invalid_plan} However, the above answer is invalid according to a parser, which returned an error message: {error_message} You MUST avoi...

work page

[1] [1]

InPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers), pages 3627–3637, Dublin, Ireland

ConditionalQA: A complex reading compre- hension dataset with conditional answers. InPro- ceedings of the 60th Annual Meeting of the Associa- tion for Computational Linguistics (V olume 1: Long Papers), pages 3627–3637, Dublin, Ireland. Associa- tion for Computational Linguistics. Simeng Sun, Yang Liu, Shuohang Wang, Dan Iter, Chen- guang Zhu, and Mohit I...

work page 2024

[2] [2]

Trade-offs in large reasoning models: An empirical analysis of deliberative and adaptive reasoning over foundational capabilities.preprint arXiv:2503.17979,

Planbench: An extensible benchmark for eval- uating large language models on planning and reason- ing about change.Advances in Neural Information Processing Systems, 36:38975–38987. Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan- and-solve prompting: Improving zero-shot chain-of- thought reasoning by l...

work page arXiv 2023

[3] [3]

InProceedings of the 37th Interna- tional Conference on Neural Information Processing Systems, pages 46595–46623

Judging llm-as-a-judge with mt-bench and chatbot arena. InProceedings of the 37th Interna- tional Conference on Neural Information Processing Systems, pages 46595–46623. A Appendix A.1 Experimental Setup Details All local experiments were conducted on a single NVIDIA A6000 48GB GPU. For all models, in- cluding GPT-4o (LLM-as-a-Judge), we employed greedy d...

work page

[4] [4]

Adopting the setup from PEARL (Sun et al., 2024), we utilized the human annotation scores to distinguish task difficulty

Since all method is training-free and requires no parameter updates, we utilized the entire set of available samples for evaluation, comprising the original training, validation, and test splits, to en- sure statistical robustness. Adopting the setup from PEARL (Sun et al., 2024), we utilized the human annotation scores to distinguish task difficulty. An ...

work page 2024

[5] [5]

split into two

No Hallucination: Do not assume specific text structures (e.g., "split into two")

work page

[6] [6]

Do NOT provide actionable plans here

No Solutions: Identify the trap only. Do NOT provide actionable plans here

work page

[7] [7]

assumption_pitfalls

No Repetition: Identify assumptions *unique* to this question, not just copying examples. Return the result as a concise JSON list. Format: {"assumption_pitfalls": [ "<Pitfall 1: A brief explanation of the pitfall>", "<Pitfall 2: (Optional)>", "<Pitfall 3: (Optional)>" ]} --- ### Example 1 (Multiple Inferences Required) [Question] "Why did the author writ...

work page

[8] [9]

Why is Si retirement so significant to the Space Exploration Team?

output_2 = action_2(here goes arguments) : [one-sentence explanation] ... ``` The following are a few examples: --- Question: "Why is Si retirement so significant to the Space Exploration Team?" Input Pitfalls: - Assuming the significance is stated in a single sentence explicitly linking retirement to the team. - Ignoring the separate chain of events: the...

work page

[9] [10]

cause",

retire_reason = FIND_ELEMENT(CTX, "cause", "Si retirement") : Find and summarize the cause or reason of Si retirement from the input article

work page

[10] [11]

Si retirement

retire_outcome = FIND_IMPACTS(CTX, "Si retirement") : Find and summarize the impact or outcome or consequences of Si retirement from the input article

work page

[11] [12]

Space Exploration Team

connect_reason = FIND_RELATION(CTX, retire_reason, "Space Exploration Team") : Find and summarize how the reason of Si retirement is related to the Space Exploration Team

work page

[12] [13]

Space Exploration Team

connect_outcome = FIND_RELATION(CTX, retire_outcome, "Space Exploration Team") : Find and summarize how the outcome of Si retirement is related to the Space Exploration Team

work page

[13] [14]

What is the “space cafard

ans = CONCAT(connect_reason, connect_outcome) : Combine the previous two steps to form the final answer --- Question: "What is the “space cafard” that Si describes?" Input Pitfalls: - Assuming any general definition of ’space cafard’ is correct. - Failing to restrict the search to only Si’s specific description provided in the text. [Strategy Reasoning] T...

work page

[14] [15]

Si’s description

space_cafard = FIND_ELEMENT(CTX, "Si’s description", "space cafard") : Find and summarize all relevant information about the "space cafard" strictly as described by Si

work page

[15] [16]

space cafard

space_cafard_cmprh = COMPREHEND(CTX, space_cafard) : Provide a comprehension about the "space cafard" based on the findings

work page

[16] [17]

How many times has Critten been a Nilly?

ans = CONCAT(space_cafard, space_cafard_cmprh) : Combine to form the final answer --- Question: "How many times has Critten been a Nilly?" Input Pitfalls: - Assuming the total count (e.g., ’3 times’) is explicitly stated in the text. - Assuming the plan can just ’search’ for a number. [Strategy Reasoning] The pitfall indicates that a simple search for a n...

work page

[17] [18]

Critten been a Nilly

all_nilly = FIND_ALL_ISSUES(CTX, "Critten been a Nilly") : Find and summarize all individual events/mentions where Critten has been a Nilly

work page

[18] [19]

Out of the choices below, predict which future career Eddie would most likely pick given his interests present in the article

num_nilly = COUNT_X(CTX, all_nilly) : Count the number of times that Critten has been a Nilly given the collected events above --- Question: "Out of the choices below, predict which future career Eddie would most likely pick given his interests present in the article." Input Pitfalls: - Assuming only explicitly stated ’interests’ matter for the prediction...

work page

[19] [20]

eddie = IDENTIFY_ELEMENT(CTX, "Eddie") : Identify who Eddie is in the input article

work page

[20] [21]

interests

eddie_interests = FIND_ELEMENT(CTX, "interests", eddie) : Find and summarize all the interests of Eddie

work page

[21] [22]

skills and aptitudes

eddie_skills = FIND_ELEMENT(CTX, "skills and aptitudes", eddie) : Find demonstrated skills or aptitudes, as required to avoid the pitfall of missing implied traits

work page

[22] [23]

dislikes and avoids

eddie_dislikes = FIND_ELEMENT(CTX, "dislikes and avoids", eddie) : Find tasks Eddie dislikes, as required to filter out unlikely careers

work page

[23] [24]

eddie_goals = FIND_INTENT(CTX, eddie) : Find and summarize the intent/purpose/goal of Eddie

work page

[24] [25]

eddie_profile = CONCAT(eddie_interests, eddie_skills, eddie_dislikes, eddie_goals) : Combine interests, skills, dislikes, and goals to build a complete profile

work page

[25] [26]

Eddie", eddie_profile) : Predict the future career based on the comprehensive profile --- Question:

ans = PREDICT_CAREER(CTX, "Eddie", eddie_profile) : Predict the future career based on the comprehensive profile --- Question: "Which word doesn’t describe the security guard?" Input Pitfalls: - Assuming the plan should search for words that *do not* describe the guard directly. - Failing to understand this is a ’NOT’ (exclusion) question requiring a list...

work page

[26] [27]

security guard

security_guard = FIND_CHARACTER(CTX, "security guard") : Find and summarize the character traits of the security guard

work page

[27] [28]

descriptive words

guard_descriptions = FIND(CTX, "descriptive words", "security guard") : Find the words that ARE used to describe the security guard in the text

work page

[28] [29]

Of the following options, which seems to be Tremaine’s biggest asset in his investigation?

ans = CONCAT(security_guard, guard_descriptions) : Combine the traits and descriptions to form a basis for exclusion --- Question: "Of the following options, which seems to be Tremaine’s biggest asset in his investigation?" Input Pitfalls: - Assuming ’asset’ refers only to physical tools. - Assuming the ’biggest’ asset is explicitly labeled as such. [Stra...

work page

[29] [30]

Tremaine

tremaine = IDENTIFY_ELEMENT(CTX, "Tremaine") : Identify who Tremaine is in the input article

work page

[30] [31]

assets (physical and abstract)

tremaine_assets = FIND_ELEMENT(CTX, "assets (physical and abstract)", tremaine) : Find all assets, explicitly including abstract ones like intuition or connections

work page

[31] [32]

-None" if there no need to add new actions - new_action_2(arguments) : [one-sentence general explanation] or

ranked_assets = SORT(CTX, tremaine_assets) : Sort the assets in ascending order of importance/impact based on the text [Question] Now you are given a question about an article: {question} You MUST avoid these core pitfalls identified for this question: {assumption_pitfall} Please provide a plan (sequence of actions) that can arrive to the answer after rea...

work page

[32] [33]

output_1 = action_1(here goes arguments) : [one-sentence explanation]

work page

[33] [34]

What is the primary diet of the spectacled bear?

output_2 = action_2(here goes arguments) : [one-sentence explanation] ... ``` The following are examples of how to correct an invalid plan based on error messages: --- ### Example 1 (Error: Unknown Action) Question: "What is the primary diet of the spectacled bear?" Invalid Plan:

work page

[34] [36]

Error parsing action COMPREHEND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list

ans = COMPREHEND(CTX, bear_info) : Understand the info Error Message: "Error parsing action COMPREHEND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list." Input Pitfalls: - Assuming the diet consists of only one type of food. [Strategy Reasoning] The parser reports that ‘COMPREHEND‘ is an unk...

work page

[35] [37]

diet", "spectacled bear

bear_info = FIND_ELEMENT(CTX, "diet", "spectacled bear") : Find diet info

work page

[36] [38]

How did the protagonist escape the room?

ans = SUMMARIZE(CTX, bear_info) : Summarize the findings to form the answer --- ### Example 2 (Error: Undefined Variable) Question: "How did the protagonist escape the room?" Invalid Plan:

work page

[37] [40]

Error parsing action GENERATE_ANSWER. Argument room_info is not defined

ans = GENERATE_ANSWER(CTX, room_info) : Generate the final answer Error Messages: "Error parsing action GENERATE_ANSWER. Argument room_info is not defined." Input Pitfalls: - Assuming the escape happened in a single step. [Strategy Reasoning] The error states that ‘room_info‘ is undefined. Looking at the previous step (step 1), the output variable was nam...

work page

[38] [41]

escape method

room_desc = FIND_ELEMENT(CTX, "escape method", "protagonist") : Find escape details

work page

[39] [42]

List all the awards won by the author

ans = GENERATE_ANSWER(CTX, room_desc) : Generate the final answer --- ### Example 3 (Error: Incorrect Argument Count) Question: "List all the awards won by the author." Invalid Plan:

work page

[40] [43]

awards",

awards = FIND_ALL_ISSUES("awards", "author") : Find all awards

work page

[41] [44]

Error parsing action FIND_ALL_ISSUES. Number of arguments is incorrect

ans = LIST_ITEMS(CTX, awards) : List them Error Message: "Error parsing action FIND_ALL_ISSUES. Number of arguments is incorrect" Input Pitfalls: - Assuming the awards are listed in a distinct ’awards’ section. [Strategy Reasoning] The action ‘FIND_ALL_ISSUES‘ caused an argument count error. Standard actions usually require ‘CTX‘ as the first argument. I ...

work page

[42] [45]

awards",

awards = FIND_ALL_ISSUES(CTX, "awards", "author") : Find all awards

work page

[43] [46]

Based on the historical data provided, predict the stock price for next month

ans = LIST_ITEMS(CTX, awards) : List them --- ### Example 4 (Error: Missing Action Definition) Question: "Based on the historical data provided, predict the stock price for next month." Invalid Plan:

work page

[44] [48]

prediction = PREDICT_TREND(CTX, history) : Predict future price

work page

[45] [49]

Error parsing action PREDICT_TREND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list

ans = GENERATE_ANSWER(CTX, prediction) : Formulate answer Error Message: "Error parsing action PREDICT_TREND. Unknown action. Please define it in the ’New actions’ section if needed, or choose from the existing action list." Input Pitfalls: "Assuming a linear trend without considering volatility mentioned in the text." [Strategy Reasoning] The parser indi...

work page

[46] [50]

stock price history

history = FIND_DATA(CTX, "stock price history", "last 5 years") : Retrieve data

work page

[47] [51]

prediction = PREDICT_TREND(CTX, history) : Predict future price based on the retrieved history

work page

[48] [52]

ans = GENERATE_ANSWER(CTX, prediction) : Formulate the final answer [Question] Given the following question, Question: {question} you just came up with the following sequence of actions as well as potential new actions: {invalid_plan} However, the above answer is invalid according to a parser, which returned an error message: {error_message} You MUST avoi...

work page