arxiv: 2605.12975 · v1 · submitted 2026-05-13 · 💻 cs.AI

Recognition: unknown

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Jiashuo Sun , Jimeng Shi , Yixuan Xie , Saizhuo Wang , Jash Rajesh Parekh , Pengcheng Jiang , Zhiyi Shi , Jiajun Fan

show 5 more authors

Qinglong Zheng Peiran Li Shaowen Wang Ge Liu Jiawei Han

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords retrieval-augmented generationmulti-hop reasoningprogram synthesisexecutable programsself-repairRAGquestion answering

0 comments

The pith

PyRAG reformulates multi-hop RAG as synthesis and execution of Python programs over retrieval tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that multi-hop question answering in retrieval-augmented generation can be recast as writing and running Python code that calls retrieval and QA functions. This turns implicit natural-language steps into explicit variables and deterministic execution traces. The approach supplies compiler feedback for self-repair and execution results for adaptive retrieval, all without extra training. It delivers consistent gains over strong baselines on five QA benchmarks, with the largest improvements on datasets that require compositional chaining.

Core claim

Multi-hop RAG is reformulated as program synthesis and execution: the model produces an executable Python program that chains retrieval and QA tool calls, exposing every intermediate state as a named variable. Execution supplies deterministic signals that drive self-repair when the program fails to compile or run, and that guide adaptive retrieval of missing facts. The resulting framework requires no additional training yet outperforms prior methods on PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle, especially on compositional multi-hop questions.

What carries the argument

The executable Python program over retrieval and QA tools, which replaces free-form text trajectories with explicit variables, deterministic execution feedback, and an inspectable trace.

If this is right

Reasoning traces become fully inspectable because every step is a concrete variable assignment.
Self-repair is grounded in compiler errors rather than the model's own unreliable reflection.
Retrieval can be triggered adaptively from execution results instead of fixed queries.
The same program representation works in both training-free and reinforcement-learning settings.
Performance gains are largest on questions that require chaining multiple facts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same executable-program framing could be applied to other step-by-step tasks such as planning or tool-use chains.
Execution feedback might reduce hallucination rates by rejecting programs that cannot run to a valid answer.
Integration with an external code interpreter would make the self-repair loop fully automatic and scalable.
The approach suggests that any reasoning task whose intermediate states can be represented as program variables may benefit from the same deterministic grounding.

Load-bearing premise

Code-specialized language models can reliably write correct executable programs for multi-hop reasoning and that execution feedback alone is enough to repair errors.

What would settle it

A controlled test in which the model repeatedly produces programs that fail to execute or return incorrect answers on a held-out multi-hop dataset even after several rounds of compiler-driven repair.

Figures

Figures reproduced from arXiv: 2605.12975 by Ge Liu, Jash Rajesh Parekh, Jiajun Fan, Jiashuo Sun, Jiawei Han, Jimeng Shi, Peiran Li, Pengcheng Jiang, Qinglong Zheng, Saizhuo Wang, Shaowen Wang, Yixuan Xie, Zhiyi Shi.

**Figure 1.** Figure 1: Comparison across Vanilla RAG, Search Agents, and PyRAG (Ours). Given the multi-hop question “Who is older, Jed Hoyer or John William Henry II?”, (a) Vanilla RAG performs single-shot retrieval and is prone to incomplete or noisy evidence; (b) Search Agents follow an unstructured iterative trajectory where vague queries and entity drift (e.g., retrieving “Henry II of England” instead of “John William Henry … view at source ↗

**Figure 2.** Figure 2: The PyRAG framework. Given a multi-hop question, PyRAG proceeds in three stages: (1) Decompose: an LLM breaks the question into atomic, independently answerable sub-queries; (2) Plan: a code-specialized LLM synthesizes an executable Python program over two tool primitives, retrieve(query, topk) and answer(query, docs), where intermediate results are bound to variables and composed through explicit data dep… view at source ↗

**Figure 5.** Figure 5: Prompts used by the Decompose Agent. The system prompt enforces a parseable JSON-list [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Plan Agent system prompt. Codifies the executable interface as a contract — function [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Plan Agent user-side context. Top: the per-question template filled with the original [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Runtime-level self-repair template. Triggered when an executed program raises a Python [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Syntax-level self-repair template. Triggered when the generated code fails to compile; the [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Answer Agent system prompt — evidence mode. Used when at least one retrieved passage is supplied. The schema fixes question-type matching and reserves “unknown” as the sentinel that drives adaptive retrieval. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Answer Agent system prompt — aggregation mode. Used when the docs argument is empty, i.e. in the final synthesis call. Forbids yes/no responses to wh-questions, eliminating the failure mode where the synthesis call collapses into fact verification. Case A ✓ Correct Question: 2014 S/S is the debut album of a South Korean boy group that was formed by who? Gold: YG Entertainment Predicted: YG Entertainment #… view at source ↗

**Figure 12.** Figure 12: A representative correct example. Variables produced at one step are explicitly consumed [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: When Step 4 returns the sentinel "unknown", execution-guided refinement triggers a broader re-retrieval (Step 5–6, highlighted). The plan structure is preserved; only the under-evidenced sub-step is repaired. Adaptive retrieval recovers from an under-evidenced sub-step without modifying the overall plan, illustrating the benefit of execution-grounded refinement. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_… view at source ↗

**Figure 14.** Figure 14: Boolean conjunction over a 2×2 grid of predicates. The plan reduces a “both X and Y” question to a Cartesian grid of yes/no probes whose conjunction is decided by the Python keyword all. The boolean structure is enforced by the program; the answer agent never has to perform multi-clause logical reasoning over a free-form prompt. The decision rule is expressed as a Python expression rather than delegated t… view at source ↗

**Figure 15.** Figure 15: Arithmetic over retrieved values. The final answer is not contained in any retrieved [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Decomposition-stage entity drift. The Step 3 query should have been [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Retrieval failure that propagates because a sentinel value is treated as a content string. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Final aggregation misreads its own variable bindings. Written as a Python program, the [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗

**Figure 19.** Figure 19: Boolean conjunction misexecuted by the answer agent. Both [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗

**Figure 20.** Figure 20: Type confusion in a for-loop. The bug is a single-line type confusion: clients is a comma-joined string, not a list, so iterating it character-by-character is silently legal Python and the executor fans out into hundreds of nonsensical retrievals on "L", "e", "B", . . . The fix is the one-line cast shown above, after which the original for-loop iterates over actual client names. Such failures are uniquely… view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PyRAG recasts multi-hop RAG as synthesizing and running Python programs over retrieval tools, which gives explicit states and execution signals, but the repair mechanism only catches syntax and runtime errors rather than semantic mistakes in the generated logic.

read the letter

The core move here is to stop treating multi-hop reasoning as free-form text steps and instead have the model write an actual Python program that calls retrieval and QA functions, with variables holding the intermediate results. This produces a runnable trace that can be inspected and fixed using real execution feedback instead of the model second-guessing its own words. That framing is distinct from the usual chain-of-thought or reflection baselines they cite, and it aligns with how code models are trained, so no extra training is needed for the repair loop. They report consistent gains across PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle, with the biggest improvements on the harder compositional sets, and the code is released publicly, which is useful for anyone who wants to test it directly. The inspectable programs and deterministic feedback are genuine practical advantages over opaque natural-language trajectories. The soft spot is exactly the one flagged in the stress test. Execution will flag crashes or type errors, but a syntactically valid program can still generate a retrieval query that pulls the wrong entities, assign those to a variable, and continue to a wrong final answer without ever raising an exception. The paper positions execution-driven adaptive retrieval and self-repair as the key differentiator, yet the mechanism supplies no signal for that kind of silent semantic drift. Without detailed ablations showing how much the repair step actually contributes versus just stronger initial program synthesis, the central claim is harder to evaluate. This is the kind of work that belongs in peer review. People building production RAG systems or studying tool use in reasoning will want to see the full experiments and try the released code, even if the soundness details need tightening.

Referee Report

1 major / 1 minor

Summary. The paper introduces PyRAG, a framework that reformulates multi-hop RAG as program synthesis and execution of Python programs over retrieval and QA tools. This exposes intermediate states as variables, provides deterministic execution feedback, and enables compiler-grounded self-repair and adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, Bamboogle) show consistent outperformance over strong baselines, with especially large gains on compositional multi-hop datasets. Code, data, and models are publicly released.

Significance. If the central claim holds, the work is significant for offering a structured, inspectable alternative to free-form natural language reasoning in RAG, with potential for improved reliability on multi-hop tasks. The training-free design, compiler-grounded repair, and public code release are clear strengths that support reproducibility and extension.

major comments (1)

[self-repair and execution-driven adaptive retrieval] The section describing the self-repair mechanism: the claim that execution feedback alone enables reliable self-repair without training is load-bearing for the central contribution, yet execution only signals runtime exceptions or type errors. It supplies no signal for semantic drift, such as a syntactically valid retrieval call that returns the wrong entity due to an incorrect generated query string. This leaves open the possibility that a runnable but incorrect trace proceeds to a wrong final answer undetected, undermining the asserted advantage over free-form reasoning.

minor comments (1)

[Abstract] Abstract: reports consistent outperformance but provides no details on exact baselines, metrics, statistical significance, or ablation studies, which limits immediate verification of the results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying a key aspect of the self-repair mechanism. We address the concern directly below and describe the revisions we will make to the manuscript.

read point-by-point responses

Referee: [self-repair and execution-driven adaptive retrieval] The section describing the self-repair mechanism: the claim that execution feedback alone enables reliable self-repair without training is load-bearing for the central contribution, yet execution only signals runtime exceptions or type errors. It supplies no signal for semantic drift, such as a syntactically valid retrieval call that returns the wrong entity due to an incorrect generated query string. This leaves open the possibility that a runnable but incorrect trace proceeds to a wrong final answer undetected, undermining the asserted advantage over free-form reasoning.

Authors: We agree that execution feedback is limited to runtime exceptions, type errors, and other execution failures rather than directly detecting semantic drift in query strings. The self-repair component is explicitly triggered on such execution signals to regenerate faulty code segments in a compiler-grounded loop. For semantic issues, the framework relies on the explicit program structure: intermediate retrieval results are bound to named variables, allowing subsequent code steps to condition on the actual returned values and adapt retrieval calls accordingly. This provides a verifiable trace that free-form natural language reasoning lacks. Our experiments on compositional multi-hop datasets demonstrate that this structure yields measurable gains, consistent with reduced undetected error propagation. To address the concern, we will revise the self-repair section to explicitly delineate the scope of execution feedback (runtime vs. semantic), add a dedicated limitations paragraph, include qualitative examples of semantic-drift cases, and report an ablation isolating the contribution of execution-driven adaptation. revision: partial

Circularity Check

0 steps flagged

New framework proposal with empirical validation; no circular derivation or self-referential reduction

full rationale

The paper introduces PyRAG as a fresh reformulation of multi-hop RAG into executable Python program synthesis and execution, exposing variables and enabling compiler feedback. This is framed as a methodological shift rather than a mathematical derivation from prior equations or fitted parameters. No load-bearing claims reduce by construction to self-citations, ansatzes smuggled via prior work, or renaming of known results; experiments report direct comparisons on five public benchmarks (PopQA, HotpotQA, etc.) with released code. The central advantage of execution-driven self-repair is presented as an empirical outcome, not forced by definition or internal fitting. The skeptic concern about semantic errors versus runtime errors is a question of empirical validity, not circularity in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that multi-hop QA aligns with step-by-step code computation and that execution provides superior feedback to natural language self-reflection.

axioms (1)

domain assumption Multi-hop question answering is a typical form of step-by-step computation that aligns closely with how code-specialized language models are trained
Directly stated in the abstract motivation for reformulating RAG as program synthesis.

pith-pipeline@v0.9.0 · 5630 in / 1143 out tokens · 50467 ms · 2026-05-14T19:59:18.013477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 20 canonical work pages · 11 internal anchors

[1]

Pathrag: Pruning graph-based retrieval augmented generation with relational paths

Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[2]

Learning to reason with search for llms via reinforcement learning,

Mingyang Chen, Linzhuang Sun, Tianpeng Li, Haoze Sun, Yijie Zhou, Chenzheng Zhu, Haofen Wang, Jeff Z Pan, Wen Zhang, Huajun Chen, et al. Learning to reason with search for llms via reinforcement learning.arXiv preprint arXiv:2503.19470, 2025

work page arXiv 2025
[3]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

A., and Yu, T

Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, et al. Binding language models in symbolic languages.arXiv preprint arXiv:2210.02875, 2022

work page arXiv 2022
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, 2024

2024
[7]

Pal: Program-aided language models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. InInternational conference on machine learning, 2023

2023
[8]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 2024

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 2024

2024
[11]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 10

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

Constructing a multi-hop question answering dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Nguyen, Ehsan Abbasnejad, and Dinh Phung. Constructing a multi-hop question answering dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, 2020

2020
[13]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

2022
[14]

Da-code: Agent data science code generation benchmark for large language models

Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, et al. Da-code: Agent data science code generation benchmark for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 13487–13521, 2024

2024
[15]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xu- ancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 conference on empirical methods in natural language processing, 2023

2023
[17]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020

2020
[19]

Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024, 2022

Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.arXiv preprint arXiv:2212.14024, 2022

work page arXiv 2022
[20]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. 2024

2024
[21]

Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transaction...

2019
[22]

PhD thesis, UC Berkeley, 2025

Woosuk Kwon.vLLM: An Efficient Inference Engine for Large Language Models. PhD thesis, UC Berkeley, 2025

2025
[23]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33, 2020

2020
[24]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

2025
[25]

Faithful chain-of-thought reasoning

Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch. Faithful chain-of-thought reasoning. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the 11 Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: ...

2023
[26]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023

2023
[27]

Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. InFindings of the Association for Computational Linguistics: EMNLP, 2023

2023
[28]

Fact-checking complex claims with program-guided reasoning

Liangming Pan, Xiaobao Wu, Xinyuan Lu, Luu Anh Tuan, William Yang Wang, Min-Yen Kan, and Preslav Nakov. Fact-checking complex claims with program-guided reasoning. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), 2023

2023
[29]

Structure-augmented reasoning genera- tion.arXiv preprint arXiv:2506.08364, 2025

Jash Rajesh Parekh, Pengcheng Jiang, and Jiawei Han. Structure-augmented reasoning genera- tion.arXiv preprint arXiv:2506.08364, 2025

work page arXiv 2025
[30]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP, 2023

2023
[31]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. InFindings of the Association for Computational Linguistics: EMNLP, 2023

2023
[32]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, 2025

2025
[34]

Hypercube-based retrieval-augmented generation for scientific question- answering.arXiv preprint arXiv:2505.19288, 2025

Jimeng Shi, Sizhe Zhou, Bowen Jin, Wei Hu, Runchu Tian, Shaowen Wang, Giri Narasimhan, and Jiawei Han. Hypercube-based retrieval-augmented generation for scientific question- answering.arXiv preprint arXiv:2505.19288, 2025

work page arXiv 2025
[35]

Multicube-rag for multi-hop question answering

Jimeng Shi, Wei Hu, Runchu Tian, Bowen Jin, Wonbin Kweon, SeongKu Kang, Yunfan Kang, Dingqi Ye, Sizhe Zhou, Shaowen Wang, et al. Multicube-rag for multi-hop question answering. arXiv preprint arXiv:2602.15898, 2026

work page arXiv 2026
[36]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Zerosearch: Incentivize the search capability of llms without searching, 2025

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching.arXiv preprint arXiv:2505.04588, 2025

work page arXiv 2025
[38]

Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph

Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo, Yang Wang, Yaming Liang, Xiangyang Ling, Jie Zhou, Shaoliang Cai, and Jin Luo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nnVO1PvbTv

2024
[39]

Rethinking the reranker: Boundary-aware evidence selec- tion for robust retrieval-augmented generation, 2026

Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, and Jiawei Han. Rethinking the reranker: Boundary-aware evidence selec- tion for robust retrieval-augmented generation, 2026. URLhttps://arxiv.org/abs/2602. 03689. 12

2026
[40]

GRACE: Generative representation learning via contrastive policy optimization

Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, and Jiawei Han. GRACE: Generative representation learning via contrastive policy optimization. InThe Fourteenth International Conference on Learning Representations,
[41]

URLhttps://openreview.net/forum?id=hs9lwjH1bJ
[42]

Tasr-rag: Taxonomy-guided structured reasoning for retrieval-augmented generation.arXiv preprint arXiv:2603.09341, 2026

Jiashuo Sun, Yixuan Xie, Jimeng Shi, Shaowen Wang, and Jiawei Han. Tasr-rag: Taxonomy-guided structured reasoning for retrieval-augmented generation.arXiv preprint arXiv:2603.09341, 2026

work page arXiv 2026
[43]

DynamicRAG: Leveraging outputs of large language model as feedback for dynamic reranking in retrieval-augmented generation

Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, and Jiawei Han. DynamicRAG: Leveraging outputs of large language model as feedback for dynamic reranking in retrieval-augmented generation. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=NuCtKoflsV

2026
[44]

Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 2022

2022
[45]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023

2023
[46]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 2022

2022
[47]

Structure-r1: Dynamically leveraging structural knowledge in llm reasoning through reinforcement learning, 2025

Junlin Wu, Xianrui Zhong, Jiashuo Sun, Bolian Li, Bowen Jin, Jiawei Han, and Qingkai Zeng. Structure-r1: Dynamically leveraging structural knowledge in llm reasoning through reinforcement learning, 2025. URLhttps://arxiv.org/abs/2510.15191

work page arXiv 2025
[48]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi T...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018

2018
[50]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Unknown Error

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. 13 A Limitations While PyRAGdemonstrates consistent gains across multi-hop benchmarks, our analysis (3.4) and ca...

2025