Recognition: unknown
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
Pith reviewed 2026-05-14 19:59 UTC · model grok-4.3
The pith
PyRAG reformulates multi-hop RAG as synthesis and execution of Python programs over retrieval tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multi-hop RAG is reformulated as program synthesis and execution: the model produces an executable Python program that chains retrieval and QA tool calls, exposing every intermediate state as a named variable. Execution supplies deterministic signals that drive self-repair when the program fails to compile or run, and that guide adaptive retrieval of missing facts. The resulting framework requires no additional training yet outperforms prior methods on PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle, especially on compositional multi-hop questions.
What carries the argument
The executable Python program over retrieval and QA tools, which replaces free-form text trajectories with explicit variables, deterministic execution feedback, and an inspectable trace.
If this is right
- Reasoning traces become fully inspectable because every step is a concrete variable assignment.
- Self-repair is grounded in compiler errors rather than the model's own unreliable reflection.
- Retrieval can be triggered adaptively from execution results instead of fixed queries.
- The same program representation works in both training-free and reinforcement-learning settings.
- Performance gains are largest on questions that require chaining multiple facts.
Where Pith is reading between the lines
- The same executable-program framing could be applied to other step-by-step tasks such as planning or tool-use chains.
- Execution feedback might reduce hallucination rates by rejecting programs that cannot run to a valid answer.
- Integration with an external code interpreter would make the self-repair loop fully automatic and scalable.
- The approach suggests that any reasoning task whose intermediate states can be represented as program variables may benefit from the same deterministic grounding.
Load-bearing premise
Code-specialized language models can reliably write correct executable programs for multi-hop reasoning and that execution feedback alone is enough to repair errors.
What would settle it
A controlled test in which the model repeatedly produces programs that fail to execute or return incorrect answers on a held-out multi-hop dataset even after several rounds of compiler-driven repair.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PyRAG, a framework that reformulates multi-hop RAG as program synthesis and execution of Python programs over retrieval and QA tools. This exposes intermediate states as variables, provides deterministic execution feedback, and enables compiler-grounded self-repair and adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle) show consistent outperformance over strong baselines, with especially large gains on compositional multi-hop datasets. Code, data, and models are publicly released.
Significance. If the central claim holds, the work is significant for offering a structured, inspectable alternative to free-form natural language reasoning in RAG, with potential for improved reliability on multi-hop tasks. The training-free design, compiler-grounded repair, and public code release are clear strengths that support reproducibility and extension.
major comments (1)
- [self-repair and execution-driven adaptive retrieval] The section describing the self-repair mechanism: the claim that execution feedback alone enables reliable self-repair without training is load-bearing for the central contribution, yet execution only signals runtime exceptions or type errors. It supplies no signal for semantic drift, such as a syntactically valid retrieval call that returns the wrong entity due to an incorrect generated query string. This leaves open the possibility that a runnable but incorrect trace proceeds to a wrong final answer undetected, undermining the asserted advantage over free-form reasoning.
minor comments (1)
- [Abstract] Abstract: reports consistent outperformance but provides no details on exact baselines, metrics, statistical significance, or ablation studies, which limits immediate verification of the results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying a key aspect of the self-repair mechanism. We address the concern directly below and describe the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [self-repair and execution-driven adaptive retrieval] The section describing the self-repair mechanism: the claim that execution feedback alone enables reliable self-repair without training is load-bearing for the central contribution, yet execution only signals runtime exceptions or type errors. It supplies no signal for semantic drift, such as a syntactically valid retrieval call that returns the wrong entity due to an incorrect generated query string. This leaves open the possibility that a runnable but incorrect trace proceeds to a wrong final answer undetected, undermining the asserted advantage over free-form reasoning.
Authors: We agree that execution feedback is limited to runtime exceptions, type errors, and other execution failures rather than directly detecting semantic drift in query strings. The self-repair component is explicitly triggered on such execution signals to regenerate faulty code segments in a compiler-grounded loop. For semantic issues, the framework relies on the explicit program structure: intermediate retrieval results are bound to named variables, allowing subsequent code steps to condition on the actual returned values and adapt retrieval calls accordingly. This provides a verifiable trace that free-form natural language reasoning lacks. Our experiments on compositional multi-hop datasets demonstrate that this structure yields measurable gains, consistent with reduced undetected error propagation. To address the concern, we will revise the self-repair section to explicitly delineate the scope of execution feedback (runtime vs. semantic), add a dedicated limitations paragraph, include qualitative examples of semantic-drift cases, and report an ablation isolating the contribution of execution-driven adaptation. revision: partial
Circularity Check
New framework proposal with empirical validation; no circular derivation or self-referential reduction
full rationale
The paper introduces PyRAG as a fresh reformulation of multi-hop RAG into executable Python program synthesis and execution, exposing variables and enabling compiler feedback. This is framed as a methodological shift rather than a mathematical derivation from prior equations or fitted parameters. No load-bearing claims reduce by construction to self-citations, ansatzes smuggled via prior work, or renaming of known results; experiments report direct comparisons on five public benchmarks (PopQA, HotpotQA, etc.) with released code. The central advantage of execution-driven self-repair is presented as an empirical outcome, not forced by definition or internal fitting. The skeptic concern about semantic errors versus runtime errors is a question of empirical validity, not circularity in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-hop question answering is a typical form of step-by-step computation that aligns closely with how code-specialized language models are trained
Reference graph
Works this paper leans on
-
[1]
Pathrag: Pruning graph-based retrieval augmented generation with relational paths
Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI Conference on Artificial Intelligence, 2026
2026
-
[2]
Learning to reason with search for llms via reinforcement learning,
Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, et al. Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025
-
[3]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[4]
Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875, 2022
-
[5]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
A survey on rag meeting llms: Towards retrieval-augmented large language models
Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 2024
2024
-
[7]
Pal: Program-aided language models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, 2023
2023
-
[8]
Retrieval-Augmented Generation for Large Language Models: A Survey
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 2024
Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 2024
2024
-
[11]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 10
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[12]
Constructing a multi-hop question answering dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Nguyen, Ehsan Abbasnejad, and Dinh Phung. Constructing a multi-hop question answering dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, 2020
2020
-
[13]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9
2022
-
[14]
Da-code: Agent data science code generation benchmark for large language models
Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, et al. Da-code: Agent data science code generation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13487–13521, 2024
2024
-
[15]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xu- ancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Active retrieval augmented generation
Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 conference on empirical methods in natural language processing, 2023
2023
-
[17]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Dense passage retrieval for open-domain question answering
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020
2020
-
[19]
Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024, 2022
-
[20]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. 2024
2024
-
[21]
Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 2019
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transaction...
2019
-
[22]
PhD thesis, UC Berkeley, 2025
Woosuk Kwon.vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, UC Berkeley, 2025
2025
-
[23]
Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33, 2020
2020
-
[24]
Search-o1: Agentic search-enhanced large reasoning models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
2025
-
[25]
Faithful chain-of-thought reasoning
Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the 11 Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: ...
2023
-
[26]
When not to trust language models: Investigating effectiveness of parametric and non-parametric memories
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023
2023
-
[27]
Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning
Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Computational Linguistics: EMNLP, 2023
2023
-
[28]
Fact-checking complex claims with program-guided reasoning
Liangming Pan, Xiaobao Wu, Xinyuan Lu, Luu Anh Tuan, William Yang Wang, Min-Yen Kan, and Preslav Nakov. Fact-checking complex claims with program-guided reasoning. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), 2023
2023
-
[29]
Structure-augmented reasoning genera- tion.arXiv preprint arXiv:2506.08364, 2025
Jash Rajesh Parekh, Pengcheng Jiang, and Jiawei Han. Structure-augmented reasoning genera- tion.arXiv preprint arXiv:2506.08364, 2025
-
[30]
Measuring and narrowing the compositionality gap in language models
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP, 2023
2023
-
[31]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of the Association for Computational Linguistics: EMNLP, 2023
2023
-
[32]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Hybridflow: A flexible and efficient rlhf framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, 2025
2025
-
[34]
Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Runchu Tian, Shaowen Wang, Giri Narasimhan, and Jiawei Han. Hypercube-based retrieval-augmented generation for scientific question- answering.arXiv preprint arXiv:2505.19288, 2025
-
[35]
Multicube-rag for multi-hop question answering
Jimeng Shi, Wei Hu, Runchu Tian, Bowen Jin, Wonbin Kweon, SeongKu Kang, Yunfan Kang, Dingqi Ye, Sizhe Zhou, Shaowen Wang, et al. Multicube-rag for multi-hop question answering. arXiv preprint arXiv:2602.15898, 2026
-
[36]
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Zerosearch: Incentivize the search capability of llms without searching, 2025
Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025
-
[38]
Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo, Yang Wang, Yaming Liang, Xiangyang Ling, Jie Zhou, Shaoliang Cai, and Jin Luo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nnVO1PvbTv
2024
-
[39]
Rethinking the reranker: Boundary-aware evidence selec- tion for robust retrieval-augmented generation, 2026
Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, and Jiawei Han. Rethinking the reranker: Boundary-aware evidence selec- tion for robust retrieval-augmented generation, 2026. URLhttps://arxiv.org/abs/2602. 03689. 12
2026
-
[40]
GRACE: Generative representation learning via contrastive policy optimization
Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, and Jiawei Han. GRACE: Generative representation learning via contrastive policy optimization. InThe Fourteenth International Conference on Learning Representations,
-
[41]
URLhttps://openreview.net/forum?id=hs9lwjH1bJ
-
[42]
Jiashuo Sun, Yixuan Xie, Jimeng Shi, Shaowen Wang, and Jiawei Han. Tasr-rag: Taxonomy-guided structured reasoning for retrieval-augmented generation.arXiv preprint arXiv:2603.09341, 2026
-
[43]
DynamicRAG: Leveraging outputs of large language model as feedback for dynamic reranking in retrieval-augmented generation
Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, and Jiawei Han. DynamicRAG: Leveraging outputs of large language model as feedback for dynamic reranking in retrieval-augmented generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=NuCtKoflsV
2026
-
[44]
Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 2022
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 2022
2022
-
[45]
Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023
2023
-
[46]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 2022
2022
-
[47]
Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin, Jiawei Han, and Qingkai Zeng. Structure-r1: Dynamically leveraging structural knowledge in llm reasoning through reinforcement learning, 2025. URLhttps://arxiv.org/abs/2510.15191
-
[48]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018
2018
-
[50]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Unknown Error
Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 13 A Limitations While PyRAGdemonstrates consistent gains across multi-hop benchmarks, our analysis (3.4) and ca...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.