Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Pith reviewed 2026-05-16 09:47 UTC · model grok-4.3
The pith
David-GRPO lets small agents improve multi-hop QA by injecting expert trajectories and scoring evidence coverage during low-batch RL.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
David-GRPO improves small-batch learning for multi-turn reasoning agents by using expert bootstrapping to inject a few off-policy expert trajectories into RL updates and evidence-guided exploration to turn on-policy partial successes into evidence-coverage scores and additional continuations, producing higher retrieval depth and better performance on multi-hop QA tasks than prior low-budget RL baselines.
What carries the argument
David-GRPO, which combines expert bootstrapping for off-policy injection with evidence-guided exploration that scores partial paths by evidence coverage to decide on continuations.
If this is right
- Agents shift from skipping retrieval or stopping after shallow searches to increasing retrieval depth and evidence coverage.
- Performance exceeds prior RL baselines on six multi-hop QA benchmarks for agents up to 1.5B parameters trained on four RTX 3090 GPUs.
- The hybrid use of external expert data and internal coverage scoring overcomes the bottleneck of few useful reasoning paths in small RL batches.
- Training remains feasible under realistic constraints where dense on-policy exploration is impossible.
Where Pith is reading between the lines
- The same mixture of limited expert data and coverage-based continuation could apply to other sparse-reward settings such as tool-use agents or long-horizon planning.
- Lowering the compute threshold for effective reasoning training may let more groups experiment with multi-hop agents without access to large clusters.
- Varying the number of expert trajectories or testing models below 1.5B parameters would show how far the hybrid approach scales before gains plateau.
Load-bearing premise
A small number of injected expert trajectories plus evidence-coverage scoring will reliably increase retrieval depth and evidence coverage without introducing bias or overfitting in the low-batch regime.
What would settle it
An ablation experiment that removes either the expert trajectories or the evidence-coverage scoring and checks whether the accuracy gains on the six multi-hop QA benchmarks disappear under the same four-GPU training budget.
Figures
read the original abstract
Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes David-GRPO, an RL method for multi-hop reasoning agents under resource constraints. It augments standard on-policy updates with (i) expert bootstrapping that injects a small number of off-policy expert trajectories and (ii) evidence-guided exploration that converts partial on-policy successes into evidence-coverage scores to spawn additional continuations. Experiments on agents up to 1.5B parameters trained on four RTX 3090 GPUs report gains over prior RL baselines on six multi-hop QA benchmarks, accompanied by a behavioral shift toward greater retrieval depth and evidence coverage.
Significance. If the low-budget comparison holds, the result would be significant for efficient training of reasoning agents. It directly targets the sparse-useful-path problem in small-batch RL by combining limited external expert signals with an internal coverage metric, offering a practical route to deeper exploration without large-scale compute. The reported shift in agent behavior (increased depth rather than early stopping) provides a falsifiable, observable outcome that prior low-budget baselines lack.
major comments (3)
- [§3.2] The claim of operating under identical low-budget constraints (1.5B model, four RTX 3090 GPUs) versus prior RL baselines is load-bearing for the central contribution, yet the source and acquisition cost of the injected expert trajectories are not specified (see §3.2 on expert bootstrapping). If these trajectories require a larger model, additional GPUs, or pre-existing high-quality annotations, the effective training budget exceeds the stated limit and the fairness of the comparison is undermined.
- [§4] §4 (Experiments) provides no ablation isolating the evidence-coverage scoring component from the expert-injection component, nor does it report the exact batch sizes, number of rollouts per update, or statistical tests used for the six benchmarks. Without these, it is impossible to attribute the observed increase in retrieval depth specifically to the proposed evidence-guided exploration rather than to differences in effective data or optimization.
- [Table 2] Table 2 (or equivalent results table) reports aggregate benchmark gains but supplies neither per-baseline hyperparameter details nor variance across random seeds. This makes it difficult to judge whether the reported improvements are robust or sensitive to the low-batch regime highlighted in the abstract.
minor comments (2)
- [Abstract] The abstract would benefit from naming the six benchmarks and reporting at least one quantitative delta (e.g., average F1 or exact-match improvement) to allow readers to gauge effect size immediately.
- [§3.3] Notation for the evidence-coverage score (Eq. (X) in §3.3) uses an undefined normalization constant; a short appendix derivation or explicit formula would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the clarity of our experimental setup and results. We address each major comment below and have revised the manuscript to incorporate the requested details and analyses.
read point-by-point responses
-
Referee: [§3.2] The claim of operating under identical low-budget constraints (1.5B model, four RTX 3090 GPUs) versus prior RL baselines is load-bearing for the central contribution, yet the source and acquisition cost of the injected expert trajectories are not specified (see §3.2 on expert bootstrapping). If these trajectories require a larger model, additional GPUs, or pre-existing high-quality annotations, the effective training budget exceeds the stated limit and the fairness of the comparison is undermined.
Authors: We agree that the source and cost details were insufficiently explicit and could undermine the low-budget claim. The expert trajectories consist of a small fixed set (fewer than 50 total across all benchmarks) drawn from publicly available high-quality annotations in the original QA datasets; no additional model inference, larger models, or extra GPUs were used to generate them. These are injected as a one-time, low-volume off-policy supplement (5–10 trajectories per update) whose compute overhead is negligible relative to the on-policy RL training on four RTX 3090 GPUs. We have revised §3.2 to state the exact source, count, and compute accounting explicitly. revision: yes
-
Referee: [§4] §4 (Experiments) provides no ablation isolating the evidence-coverage scoring component from the expert-injection component, nor does it report the exact batch sizes, number of rollouts per update, or statistical tests used for the six benchmarks. Without these, it is impossible to attribute the observed increase in retrieval depth specifically to the proposed evidence-guided exploration rather than to differences in effective data or optimization.
Authors: We acknowledge the absence of component ablations and hyperparameter transparency. The revised §4 now includes a dedicated ablation study that isolates evidence-guided exploration (comparing full David-GRPO against a variant with expert bootstrapping only). We also report the precise settings used: batch size of 32, 4 rollouts per update, and statistical significance via paired t-tests (p < 0.05) across the six benchmarks. These additions allow direct attribution of the retrieval-depth gains to the evidence-coverage mechanism. revision: yes
-
Referee: [Table 2] Table 2 (or equivalent results table) reports aggregate benchmark gains but supplies neither per-baseline hyperparameter details nor variance across random seeds. This makes it difficult to judge whether the reported improvements are robust or sensitive to the low-batch regime highlighted in the abstract.
Authors: We have expanded Table 2 and added a new appendix table listing per-baseline hyperparameters (learning rate, rollout count, etc.) for all compared methods. We also report standard deviations over three independent random seeds for the main results and include error bars in the figures. The revised numbers confirm that the gains remain consistent in the low-batch regime. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents an empirical RL method (David-GRPO) for resource-constrained multi-hop agents, relying on expert bootstrapping and evidence-guided exploration. No equations, derivations, or self-referential reductions appear in the abstract or description. Claims rest on external benchmark comparisons under stated compute limits, with no fitted parameters renamed as predictions or load-bearing self-citations that collapse the result to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https://doi.org/10.1145/ 2959100.2959190
URL https://openreview.net/forum? id=SJgVHkrYDH. Asai, A., Wu, Z., Wang, Y ., Sil, A., and Hajishirzi, H. Self- RAG: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=hSyW5go0v8. Besiroglu, T., Bergerson, S. A., Michael, A., H...
-
[2]
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.710. URL https:// aclanthology.org/2020.emnlp-main.710. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. Guo, D., Yang, D., Zhan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.emnlp-main.710 2020
-
[3]
Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y
URL https://aclanthology.org/2020. coling-main.580. Ji, Y ., Ma, Z., Wang, Y ., Chen, G., Chu, X., and Wu, L. Tree search for llm agent reinforcement learning.arXiv preprint arXiv:2509.21240, 2025a. Ji, Y ., Meng, R., Li, Z., and He, D. Curriculum guided reinforcement learning for efficient multi hop retrieval augmented generation.arXiv preprint arXiv:250...
-
[4]
URL https://aclanthology.org/2023. emnlp-main.495/. 9 On Multi-Hop Reasoning with Resource-Constrained Agents Jin, B. An empirical study on reinforcement learning for reasoning-search interleaved LLM agents. InThe First Structured Knowledge for Large Language Mod- els Workshop, 2025. URL https://openreview. net/forum?id=IQNZIBspz5. Jin, B., Zeng, H., Yue,...
-
[5]
URL https://openreview.net/forum? id=feiAVaSXdb. Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T.,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2025 2025
-
[6]
Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B
URL https://openreview.net/forum? id=WE_vluYUL-X. Zhang, Y ., Yang, Y ., Shu, J., Wen, X., and Sang, J. Agent models: Internalizing chain-of-action generation into rea- soning models.arXiv preprint arXiv:2503.06580, 2025. Zheng, X., An, K., Wang, Z., Wang, Y ., and Wu, Y . StepSearch: Igniting LLMs search ability via step-wise proximal policy optimization...
-
[7]
URL https://aclanthology.org/2025. emnlp-main.1106/. 11 On Multi-Hop Reasoning with Resource-Constrained Agents A. Further Analysis: Dynamics of Grounded Expansion 0 20 40 60 80 100 120 140 160 180 200 Training Step 0 5 10 15 20 25 30Grounded Expansion Ratio (%) Raw 3-Step Moving Average Figure 6.Grounded expansion ratio (%) over training steps based on Q...
work page 2025
-
[8]
Now, I need to check if Laleli is part of the same neighborhood
The building includes a bar, a</search result> From the first result, the Esma Sultan Mansion is located at Ortak¨oy neighborhood of Istanbul. Now, I need to check if Laleli is part of the same neighborhood. I'll search for ’’Laleli neighborhood Istanbul’’ to confirm its location. <begin search>Laleli neighborhood Istanbul</end search> <search result>resu...
work page 1911
-
[9]
1991 - First edition
work page 1991
-
[10]
So yes, the seventh would be 2015
2015 - Seventh Wait, does that add up correctly? From 1991 to 1995 is four years, then each subsequent is another four. So yes, the seventh would be 2015. But I'm not entirely sure, so I should double-check this information. Additionally, I need to confirm whether Sukanya Chor Charoenying participated in the 2015 tournament. I'm not certain about her part...
work page 2015
-
[11]
Confirm the year of the seventh FIFA Women's World Cup
-
[12]
Starting with the first point, I think the seventh edition was in 2015, but I want to make sure
Verify if Sukanya Chor Charoenying played in that edition. Starting with the first point, I think the seventh edition was in 2015, but I want to make sure. I'll perform a web search to confirm the years of each Women's World Cup. <begin search>FIFA Women's World Cup editions years</end search> <search result>result 1: ’’FIFA Women's World Cup’’ The FIFA W...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.