pith. machine review for the scientific record. sign in

arxiv: 2605.08477 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: no theorem link

Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM agentstool callingplanning horizonknowledge base QAmulti-hop QAfull-horizon planningsingle-step planninglazy replanning
0
0 comments X

The pith

Full-horizon planning with lazy replanning achieves accuracy parity with single-step planning in data-centric tool calling while using 2-3 times fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the default assumption that LLM agents must interleave planning and execution step by step to handle complex data-centric tasks. It isolates planning horizon as the variable in a controlled comparison of full-horizon plans generated upfront versus single-step incremental reasoning on knowledge base question answering and multi-hop QA. Results show that full-horizon planning with on-demand replanning matches single-step accuracy across different task depths, breadths, and tool reliability levels. The same performance holds while consuming substantially fewer tokens, which suggests that constant eager monitoring is often unnecessary when tasks are well-defined around external data sources. The study therefore positions full-horizon planning with lazy replanning as a viable and more efficient default for such problems.

Core claim

Across Knowledge Base Question Answering and Multi-hop QA, full-horizon planning with lazy replanning reaches accuracy parity with single-step horizon planning across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.

What carries the argument

Planning horizon choice between full-horizon (complete plan generated before any tool calls) and single-step horizon (incremental reasoning and execution), together with lazy replanning that triggers revisions only when needed rather than after every step.

If this is right

  • Eager step-wise monitoring is often unnecessary for maintaining adaptability in well-defined data-centric tasks.
  • Full-horizon planning with lazy replanning can serve as a more efficient default strategy without accuracy loss.
  • Performance parity between the two horizons holds across different topological complexities and tool robustness levels.
  • Token consumption drops by a factor of 2-3 while accuracy remains comparable.
  • Simpler agent architectures that avoid constant interleaving become viable for these task classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The efficiency advantage may extend to other structured data tasks where plans can be checked against known schemas before execution.
  • Agent frameworks could default to full plans and invoke replanning only on explicit failure signals to cut inference cost.
  • Hybrid systems that switch to single-step mode only when uncertainty exceeds a threshold could combine the benefits of both approaches.
  • Production deployments of tool-calling agents might reduce token budgets by adopting full-horizon defaults when tasks are data-centric and well-specified.

Load-bearing premise

The studied tasks are sufficiently well-defined data-centric problems for which eager step-wise monitoring is often unnecessary, and the controlled experiments isolate planning horizon without confounding effects from prompt design or model behavior.

What would settle it

A controlled experiment on tasks with high ambiguity or frequent unexpected tool failures showing statistically significant accuracy loss for full-horizon planning relative to single-step planning would falsify the parity claim.

Figures

Figures reproduced from arXiv: 2605.08477 by Dan Zhang, Estevam Hruschka, Hannah Kim, Naoki Otani, Nikita Bhutani.

Figure 1
Figure 1. Figure 1: Planning horizon in data-centric tool calling. SH [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example plan DAG for the query: “Which com [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of task instances across datasets. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Explicit planning is a critical capability for LLM-based agents solving complex data-centric tasks, which require precise tool calling over external data sources. Existing strategies fall into two paradigms based on planning horizon: (1) full-horizon (FH), which generates a complete plan before execution, and (2) single-step horizon (SH), which interleaves each action (tool call) with incremental reasoning and observation. While step-by-step execution is a common default under the assumption that eager execution monitoring is necessary for adaptability, we revisit this assumption for well-defined data-centric tasks. Our controlled empirical study isolates planning horizon as the key architectural feature and systematically analyzes the effects of topological complexity and tool robustness on both paradigms. Our experiments across Knowledge Base Question Answering and Multi-hop QA show that FH planning with lazy replanning achieves accuracy parity with SH across varying depths, breadths, and robustness levels, while using 2-3x fewer tokens. These findings suggest that for well-defined data-centric tasks, eager step-wise monitoring is often unnecessary, and full-horizon planning with on-demand replanning can offer a more efficient default.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that for well-defined data-centric tasks, full-horizon (FH) planning with lazy replanning achieves accuracy parity with single-step horizon (SH) planning in LLM-based agents on Knowledge Base Question Answering and Multi-hop QA, while using 2-3x fewer tokens. It systematically varies topological complexity (depth, breadth) and tool robustness, concluding that eager step-wise monitoring is often unnecessary and that FH with on-demand replanning can be a more efficient default.

Significance. If the controlled comparison holds, the result would meaningfully challenge the default assumption in LLM agent design that step-by-step interleaving is required for adaptability in tool-calling settings. The empirical focus on data-centric tasks, combined with analysis across complexity and robustness axes, provides a concrete, falsifiable basis for preferring FH paradigms in this domain and could influence practical agent implementations toward greater token efficiency.

major comments (2)
  1. [§4 (Experiments) and abstract] The central claim that the study isolates planning horizon as the sole variable (abstract and §4) is load-bearing for the parity and efficiency conclusions, yet the manuscript provides insufficient detail on prompt standardization between FH and SH conditions, the precise trigger and frequency of lazy replanning, and whether SH interleaving introduces additional reasoning steps absent from FH. LLM performance is known to be highly sensitive to these implementation choices; without explicit controls or ablations demonstrating that these factors were equalized, the observed 2-3x token savings and accuracy parity could be confounded by prompt or replanning differences rather than horizon length itself.
  2. [Table 2, Figure 3] Table 2 and Figure 3 report accuracy parity across depths/breadths/robustness levels, but the manuscript does not include statistical significance tests, error bars, or per-run variance for the FH vs. SH comparisons. Given that the parity claim is the primary empirical support for rethinking the default planning horizon, the absence of these measures leaves the strength of the equivalence claim difficult to assess.
minor comments (2)
  1. [§3] The term 'lazy replanning' is introduced without a formal definition or pseudocode in §3; a concise algorithmic description would improve reproducibility.
  2. [Figure 4] Some figure captions (e.g., Figure 4) use abbreviations (KBQA, MHQA) without first spelling them out in the caption itself, even though they appear in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We have revised the manuscript to provide additional implementation details and statistical reporting that strengthen the controlled comparison and the parity claims.

read point-by-point responses
  1. Referee: [§4 (Experiments) and abstract] The central claim that the study isolates planning horizon as the sole variable (abstract and §4) is load-bearing for the parity and efficiency conclusions, yet the manuscript provides insufficient detail on prompt standardization between FH and SH conditions, the precise trigger and frequency of lazy replanning, and whether SH interleaving introduces additional reasoning steps absent from FH. LLM performance is known to be highly sensitive to these implementation choices; without explicit controls or ablations demonstrating that these factors were equalized, the observed 2-3x token savings and accuracy parity could be confounded by prompt or replanning differences rather than horizon length itself.

    Authors: We appreciate the referee's emphasis on experimental controls. In the original implementation, both FH and SH conditions used identical base LLM calls, tool schemas, task descriptions, and system prompt prefixes; the sole difference was the planning instruction (generate complete plan vs. generate next single step). Lazy replanning in FH is triggered on-demand only upon execution failure (tool error or output schema mismatch) and is limited to at most one replan per query to preserve efficiency. SH interleaving incorporates the observation after each tool call by design, but all reasoning and observation tokens are included in the reported totals for both conditions. To eliminate any ambiguity, we have added a dedicated subsection (now §4.2) with verbatim prompt templates for both paradigms, a precise pseudocode description of the lazy replanning trigger, and an ablation confirming that removing the single replan option does not alter the parity result. These additions make the isolation of horizon length explicit. revision: yes

  2. Referee: [Table 2, Figure 3] Table 2 and Figure 3 report accuracy parity across depths/breadths/robustness levels, but the manuscript does not include statistical significance tests, error bars, or per-run variance for the FH vs. SH comparisons. Given that the parity claim is the primary empirical support for rethinking the default planning horizon, the absence of these measures leaves the strength of the equivalence claim difficult to assess.

    Authors: We agree that formal statistical support strengthens the parity conclusion. In the revised manuscript we have added standard error bars (computed over 5 independent runs with different random seeds) to all bars in Figure 3 and included per-condition means, standard deviations, and paired t-test p-values directly in Table 2. The updated results show that accuracy differences between FH and SH remain statistically non-significant (p > 0.05) across all depth, breadth, and robustness settings, while the 2-3× token reduction is significant. Per-run variance is now also tabulated in the appendix for full transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of planning horizons

full rationale

The paper presents a controlled empirical study comparing full-horizon (FH) planning with lazy replanning against single-step horizon (SH) interleaving on KBQA and multi-hop QA tasks. It reports accuracy parity and 2-3x token savings across depths, breadths, and robustness levels. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text; the central claim rests on experimental isolation of planning horizon rather than any derivation that reduces to its own inputs by construction. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparison of two planning strategies rather than new mathematical axioms or invented entities. The key background assumption is that the tasks are well-defined enough that lazy replanning suffices.

axioms (1)
  • domain assumption For well-defined data-centric tasks, eager step-wise monitoring is often unnecessary for adaptability
    Invoked when the authors revisit the assumption that step-by-step execution is required and conclude FH with lazy replanning is sufficient.

pith-pipeline@v0.9.0 · 5508 in / 1211 out tokens · 41722 ms · 2026-05-12T02:49:54.981978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 4 internal anchors

  1. [1]

    Anthropic. 2025. Programmatic tool calling - Claude API Docs. https://platform. claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling

  2. [2]

    Maria Calzarossa and Giuseppe Serazzi. 1993. Workload characterization: a survey.Proc. IEEE81, 8 (1993), 1136–1150. doi:10.1109/5.236191

  3. [3]

    Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang. 2022. KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for ...

  4. [4]

    Ke Chen, Peiran Wang, Yaoning Yu, Xianyang Zhan, and Haohan Wang

  5. [5]

    arXiv:2508.02744 [cs.AI] https://arxiv.org/abs/2508.02744

    Large Language Model-based Data Science Agent: A Survey. arXiv:2508.02744 [cs.AI] https://arxiv.org/abs/2508.02744

  6. [6]

    Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Minsu Jang, Dohyung Kim, Jaehong Kim, and Youngwoo Yoon. 2025. ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning. doi:10.48550/arXiv.2511.02424

  7. [7]

    Cloudflare. 2025. Codemode·Cloudflare Agents docs. https://developers. cloudflare.com/agents/api-reference/codemode/

  8. [8]

    Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. 2025. Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267). PMLR, Vancouver, Canada, 15419...

  9. [9]

    Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, and Sanjiban Choudhury. 2025. Robotouille: An Asynchronous Planning Benchmark for LLM Agents. InThe Thirteenth International Conference on Learning Representations. Singapore. https://openreview.net/forum?id=OhUoTMxFIH

  10. [10]

    Google. 2025. Gemini 3 Flash Model Card. https://storage.googleapis.com/ deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf

  11. [11]

    Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases. InProceedings of the Web Conference 2021 (WWW ’21). Association for Computing Machinery, New York, NY, USA, 3477–3488. doi:10. 1145/3442381.3449992

  12. [12]

    Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srini- vasa, Hugo Latapie, and Yu Su. 2024. Middleware for LLMs: Tools Are In- strumental for Language Agents in Complex Environments. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Miami, Florida, USA, ...

  13. [13]

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv:2402.02716 [cs.AI] https://arxiv.org/ abs/2402.02716

  14. [14]

    Tatsuro Inaba, Hirokazu Kiyomaru, Fei Cheng, and Sadao Kurohashi. 2023. MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting. InProceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Toronto, Canada, 1522–1532. doi:10.186...

  15. [15]

    Joongwon Kim, Bhargavi Paranjape, Tushar Khot, and Hannaneh Hajishirzi

  16. [16]

    doi:10.48550/arXiv.2406.06469

    Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning. doi:10.48550/arXiv.2406.06469

  17. [17]

    Minsoo Kim, Victor Bursztyn, Eunyee Koh, Shunan Guo, and Seung-won Hwang

  18. [18]

    In Ku, L.-W., Martins, A

    RaDA: Retrieval-augmented Web Agent Planning with LLMs. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Com- putational Linguistics, Bangkok, Thailand, 13511–13525. doi:10.18653/v1/2024. findings-acl.802

  19. [19]

    Mahoney, Kurt Keutzer, and Amir Gholami

    Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. 2024. An LLM Compiler for Parallel Function Calling. InProceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 24370–24391. https://proceedings.mlr.press/v235/kim24y.html

  20. [20]

    Lee, David Porfirio, Xinyu Jessica Wang, Kevin Chenkai Zhao, and Bilge Mutlu

    Christine P. Lee, David Porfirio, Xinyu Jessica Wang, Kevin Chenkai Zhao, and Bilge Mutlu. 2025. VeriPlan: Integrating Formal Verification and LLMs into End- User Planning. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 247, 19 pages. doi:10.1145/370...

  21. [21]

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. 2025. Agent-Oriented Planning in Multi-Agent Systems. InThe Thirteenth International Conference on Learning Representations. Singapore. https://openreview.net/ forum?id=EqcLAU6gyU

  22. [22]

    Xinzhe Li. 2025. A Review of Prominent Paradigms for LLM-Based Agents: Tool Use, Planning (Including RAG), and Feedback Learning. InProceedings of the 31st International Conference on Computational Linguistics. Association for Computational Linguistics, Abu Dhabi, UAE, 9760–9779. https://aclanthology. org/2025.coling-main.652/

  23. [23]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval(Virtual Event, Canada)(SIG...

  24. [24]

    Shuodi Liu, Yingzhuo Liu, Zi Wang, Yusheng Wang, Huijia Wu, Liuyu Xiang, and Zhaofeng He. 2025. Select-Then-Decompose: From Empirical Analysis to Adaptive Selection Strategy for Task Decomposition in Large Language Models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, S...

  25. [25]

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. 2023. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Mod- els.Advances in Neural Information Processing Systems36 (Dec. 2023), 43447–43478. https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 871ed095b734818cfba48db6ae...

  26. [26]

    Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wenhao Liu, Meina Song, Yifan Zhu, and Anh Tuan Luu. 2025. KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search. InProceedings of the 42nd International Conference on Machine Learning. PMLR, Vancouver, Canada, 41177–41199. https://proceedings.mlr.press/v267/luo25d.html

  27. [27]

    OpenAI. 2025. Update to GPT-5 System Card: GPT-5.2. https://cdn.openai.com/ pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf

  28. [28]

    Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabhar- wal, Mohit Bansal, and Tushar Khot. 2024. ADaPT: As-Needed Decomposition and Planning with Language Models. InFindings of the Association for Com- putational Linguistics: NAACL 2024. Association for Computational Linguistics, Mexico City, Mexico, 4226–4252. doi:10.18653/v1/202...

  29. [29]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools.Advances in Neural Information Processing Systems36 (Dec. 2023), 68539–68551. https://papers.nips.cc/paper_files/paper/2023/hash/ d842425e4bf79b...

  30. [30]

    Skipper Seabold and Josef Perktold. 2010. statsmodels: Econometric and statistical modeling with python. In9th Python in Science Conference. Austin, Texas, USA

  31. [31]

    Kenneth C. Sevcik. 1989. Characterizations of parallelism in applications and their use in scheduling. InProceedings of the 1989 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems(Oakland, Califor- nia, USA)(SIGMETRICS ’89). Association for Computing Machinery, New York, NY, USA, 171–180. doi:10.1145/75108.75391

  32. [32]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El- Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksan- dra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, ...

  33. [33]

    Yu Su, Huan Sun, Brian Sadler, Mudhakar Srivatsa, Izzeddin Gür, Zenghui Yan, and Xifeng Yan. 2016. On Generating Characteristic-rich Question Sets for QA Evaluation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 562–572. doi:10.18653/v1/D16-1054

  34. [34]

    Irene Testini, Lorenzo Pacchiardi, and Jose Hernandez-Orallo. 2025. Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents.Transactions on Machine Learning Research(2025). https://openreview. net/forum?id=MB0TCLfLn1

  35. [35]

    Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free Collaborative Knowledgebase.Commun. ACM57, 10 (Sept. 2014), 78–85. doi:10.1145/2629489

  36. [36]

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-Solve Prompting: Improving Zero-Shot Chain- of-Thought Reasoning by Large Language Models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Tor...

  37. [37]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable Code Actions Elicit Better LLM Agents. InProceedings of the 41st International Conference on Machine Learning. PMLR, Vienna, Austria, 50208–50232. https://proceedings.mlr.press/v235/wang24h.html

  38. [38]

    Hui Wei, Zihao Zhang, Shenghua He, Tian Xia, Shijia Pan, and Fei Liu. 2025. PlanGenLLMs: A Modern Survey of LLM Planning Capabilities. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, Austria, 19497–19521. doi:10.18653/v1/2025.acl-long.958

  39. [39]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. Curran Associates, Inc., New Orleans, Louisiana, USA, 24824–24837. https://proceedings.neurips.cc/paper_...

  40. [40]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. C-Pack: Packed Resources For General Chinese Embed- dings. InProceedings of the 47th International ACM SIGIR Conference on Re- search and Development in Information Retrieval(Washington DC, USA)(SI- GIR ’24). Association for Computing Machinery, New York, NY, USA...

  41. [41]

    Amy Xin, Jinxin Liu, Zijun Yao, Zhicheng Lee, Shulin Cao, Lei Hou, and Juanzi Li. 2025. AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2(Toronto ON, Canada) (KDD ’25). Association for Computing Machinery, New York, NY, US...

  42. [42]

    Guanming Xiong, Junwei Bao, Hongfei Jiang, Yang Song, and Wen Zhao. 2025. Multi-Turn Interactions for Text-to-SQL with Large Language Models. InPro- ceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 3560–3570. doi:10.1145/37...

  43. [43]

    Binfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, and Dongkuan Xu. 2023. ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models. arXiv:2305.18323 [cs.CL] https://arxiv. org/abs/2305.18323

  44. [44]

    Omry Yadan. 2019. Hydra - A framework for elegantly configuring complex applications. Github. https://github.com/facebookresearch/hydra

  45. [45]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  46. [46]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2369–2...

  47. [47]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InThe Eleventh International Conference on Learning Representations. Kigali, Rwanda. https://openreview.net/forum?id=WE_vluYUL-X

  48. [48]

    Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh

  49. [49]

    URLhttps://openreview.net/pdf?id=WE_vluYUL-X

    The Value of Semantic Parse Labeling for Knowledge Base Question Answering. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, 201–206. doi:10.18653/v1/P16-2033

  50. [50]

    Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, and Jiaxin Pei. 2025. ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. San Diego, California, USA. https://openreview.net/forum? id=r2ykUnzuGt

  51. [51]

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. 2025. MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents. arXiv:2506.15841 [cs.CL] https://arxiv.org/abs/2506.15841 CAIS ’26, May 26–29, 2026, San Jose, CA, USA Otani et al. A Artifact Appendix Ou...

  52. [52]

    **Rephrase:** Restate the input question in clear, natural English, keeping all constraints

  53. [53]

    output PersonA

    **Construct DAG:** Break down the question into a minimal, strictly ordered DAG using the above rules and the tools provided. ## Node Construction Rules - **Self-Contained:** Each node's`input`is a clear, standalone sentence (replace`$i`with the literal value). - **Natural Embedding:**`$i`appears as though the entity or value is being asked about, not ref...

  54. [55]

    the answer is

    Exact match -> correct: - System Output conveys the same answer value(s) as in Correct Answer, and no additional answer values. - Treat lists as sets: ignore order, casing, surrounding filler, and duplicate repetitions of the same correct value(s). - Ignore purely explanatory filler (e.g., "the answer is ...")

  55. [56]

    Partial overlap -> partially_correct: - At least one expected answer value appears in the System Output, but it is missing any other required values and/or includes extra answer values not in the Correct Answer

  56. [57]

    m.xxxxx (Name)

    Mismatch -> incorrect: - None of the expected answer values appear in the System Output, or any provided value contradicts the Correct Answer. **Definitions and matching guidance:** - Answer value: an entity (ID or name), number, date, or other atomic item. Treat comma/semicolon/newline/bulleted lists and clear conjunctions as multiple values. - Entity ma...

  57. [58]

    unknown"/

    Non-answer -> refusal/unsure: - System Output refuses, asks for clarification, says "unknown"/"no information", or otherwise does not contain an answer value

  58. [59]

    World Famous

    Exact match -> correct: - The System Output conveys exactly the same answer value(s) as the Correct Answer and no additional distinct answer values, in the same order for multi-part answers. - Ignore casing, articles, punctuation/spacing, and duplicate repetitions. - Ignore explanatory or descriptive context that does not add distinct answer values (e.g.,...

  59. [60]

    actor and director

    Partial overlap -> partially_correct: - At least one expected answer value appears, but any required value is missing and/or at least one value is incorrect/contradictory. - The output adds extra distinct answer values for a slot (e.g., listing multiple roles like "actor and director", multiple candidate entities with and/or/slashes/lists) beyond what the...

  60. [61]

    Which director, John Schlesinger or Barbara Albert, was also a writer and film producer?

    Mismatch -> incorrect: - None of the expected answer values appear in the System Output, or any provided value contradicts the Correct Answer. **Definitions and matching guidance:** - Answer value: an atomic item such as an entity (ID or name), number, date/time, or yes/no. - Entity matching: give credit only when the expected entity (name/ID or a clear a...