arxiv: 2605.09038 · v2 · submitted 2026-05-09 · 💻 cs.AI

Recognition: no theorem link

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

Jinchao Hu , Meizhi Zhong , Kehai Chen , Min Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords search skillsLLM tool usequery planningskill bankopen-domain QAretrieval behaviorsupervised fine-tuningfailure-driven learning

0 comments

The pith

SearchSkill trains LLMs to first select a reusable skill from an evolving bank and then generate a conditioned search query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SearchSkill to improve how language models handle search tools during open-domain question answering. Instead of issuing searches as a single undifferentiated action, the model selects a skill card from a growing collection and then produces a query or answer grounded in that card. The collection updates itself by detecting repeated failure patterns, refining or adding skills, and rebuilding the affected training trajectories. Supervised fine-tuning then follows the same select-then-execute sequence used at inference time. This produces more focused queries, fewer copied initial searches, and higher exact-match accuracy on knowledge-intensive benchmarks while staying within limited retrieval budgets.

Core claim

SearchSkill maintains an evolving SkillBank of search skills. At each step the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. Recurrent failure patterns trigger automatic expansions or refinements to the SkillBank, after which affected trajectories are reconstructed for a two-stage supervised fine-tuning process that aligns training with the inference-time protocol of skill selection followed by skill-grounded execution.

What carries the argument

The evolving SkillBank, a dynamic collection of skill cards from which the model selects before generating each search or answer action.

If this is right

Exact match improves on knowledge-intensive QA benchmarks for both open-source and closed-source models.
The first query is copied from the original question less often.
Subsequent queries become more atomic and focused on single reasoning hops.
Correct answers are reached more frequently within a small fixed search budget.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same select-then-execute pattern with an evolving bank could be applied to other tool-use domains beyond search.
Explicit skill planning may reduce total retrieval cost in retrieval-augmented generation systems by avoiding low-value queries.
Failure-driven skill refinement offers a route to self-improving tool-using agents without additional human annotation.

Load-bearing premise

Recurrent failure patterns can be automatically identified and turned into useful skill expansions or refinements that improve generalization rather than introduce noise.

What would settle it

Apply the full SearchSkill pipeline to a held-out knowledge QA benchmark and observe whether exact-match accuracy fails to rise or falls compared with a fixed-skill baseline that never updates the SkillBank.

Figures

Figures reproduced from arXiv: 2605.09038 by Jinchao Hu, Kehai Chen, Meizhi Zhong, Min Zhang.

**Figure 2.** Figure 2: Effect of replacing the SkillBank with an empty bank. Left: EM under full [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: SkillBank controls under the same SFT policy. Left: removing selected card content. Right: [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Skill-category contribution after activation. Left: activation versus judged necessity. Right: [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Query-planning diagnostics on four multi-hop benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Closed-source transfer with frozen B4. Bars show exact-match percentages; parentheses give gains over search-only prompting. The resulting drops show that activated skills contribute to answer construction, especially when the task requires structured decomposition or evidence grounding. 5.4 Query planning and evidence efficiency We further test whether SEARCHSKILL addresses the query-quality failure that … view at source ↗

**Figure 7.** Figure 7: GRPO training diagnostics for 7B models. Panels show train reward for 7B-Instruct and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: RL execution diagnostics on 7B-Instruct examples corrected by GRPO. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Training and validation loss curves for the two-stage SFT runs across four Qwen2.5 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: GRPO training dynamics for the 7B-Instruct SFT-initialized policy. The plot shows train [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Cumulative GRPO training reward for the 7B-Base SFT-initialized policy. The plot shows [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SearchSkill shows that an evolving skill bank updated from failures, combined with two-stage training, improves LLM search query planning on QA tasks.

read the letter

The main thing to know is that SearchSkill adds an explicit skill selection step and an evolving SkillBank that grows from failure patterns like copied queries or missed hops. This is presented as a lightweight way to improve query planning without treating search as a single undifferentiated action. They do this by detecting recurrent issues, creating or updating skill cards, rebuilding the affected training data, and then running a two-stage SFT. The first stage teaches skill choice, the second teaches execution under that skill. This matches the inference setup and seems to produce more atomic, targeted queries on the multi-hop benchmarks. The paper shows gains on HotpotQA and 2WikiMultiHop for both open and closed models, with fewer bad first queries and more correct answers in limited budgets. Ablations separate the initial inventory from the evolution step, and held-out tests show no degradation on out-of-distribution questions. A minor soft spot is that the failure detection still uses some fixed rules rather than a fully learned process. That works for these benchmarks but could be a limit when error types shift. The results look reproducible from the details given, though more runs would strengthen the case. This is useful for anyone working on tool-augmented LLMs for QA or research agents. It deserves peer review because the method is concrete, the experiments are structured, and the central claim holds up in the reported tests.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces SearchSkill, a framework for improving LLM search-tool use in open-domain QA by making query planning explicit: the model selects a reusable skill from an evolving SkillBank and then generates a skill-conditioned search or answer action. The SkillBank is dynamically expanded or refined from recurrent failure patterns (e.g., query copying, hop misses), affected trajectories are reconstructed, and a two-stage SFT aligns training with this inference protocol. Experiments across open- and closed-source models report gains in exact-match accuracy on knowledge-intensive QA benchmarks together with behavioral improvements (fewer copied first queries, more atomic hop-focused queries, higher success within limited retrieval budgets).

Significance. If the empirical results hold, the work supplies a lightweight, interpretable alternative to undifferentiated tool-use training by factoring search into reusable, evolvable skills. The two-stage SFT plus failure-driven SkillBank evolution is shown to improve both accuracy and retrieval efficiency on HotpotQA and 2WikiMultiHop, with ablations separating the contribution of the initial inventory from the evolution step and held-out trajectory checks confirming generalization.

major comments (2)

[§4.2] §4.2: the automatic identification of recurrent failure patterns (query-copying and hop-miss cases) and their conversion into SkillBank expansions is load-bearing for the central claim; the manuscript should supply the exact detection heuristics, similarity thresholds, and frequency cutoffs used, together with an ablation that measures how sensitive final performance is to these choices.
[Table 2 / §5.3] Table 2 / §5.3: the reported exact-match gains on HotpotQA and 2WikiMultiHop are presented without per-run standard deviations or statistical significance tests; given that the central claim rests on consistent improvement across model families, these statistics are required to establish that the observed deltas exceed run-to-run variance.

minor comments (3)

[§3.1] §3.1: the notation for skill cards (e.g., the distinction between skill description, trigger conditions, and execution template) is introduced informally; a compact tabular summary of the card schema would improve readability.
[Figure 3] Figure 3: the trajectory-reconstruction diagram is helpful but the arrows indicating which trajectories are regenerated after a SkillBank update are not labeled; adding explicit labels would clarify the data-flow.
[§5.4] §5.4: the out-of-distribution generalization experiment is mentioned only briefly; a short additional paragraph summarizing the OOD question set and the magnitude of the retained gains would strengthen the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comments below and will update the manuscript accordingly.

read point-by-point responses

Referee: [§4.2] §4.2: the automatic identification of recurrent failure patterns (query-copying and hop-miss cases) and their conversion into SkillBank expansions is load-bearing for the central claim; the manuscript should supply the exact detection heuristics, similarity thresholds, and frequency cutoffs used, together with an ablation that measures how sensitive final performance is to these choices.

Authors: We agree that the detection process is central and that more details are needed for reproducibility. The current manuscript describes the high-level approach in §4.2 but omits the precise implementation details. In the revision, we will add the exact heuristics: query-copying is detected when the generated query has Jaccard similarity > 0.8 with the input question or prior queries; hop-miss is identified by checking if all key entities from the question are covered in the retrieved documents after the planned hops. The frequency cutoff is patterns occurring in >10 failed trajectories. We will also include a sensitivity ablation varying the similarity threshold (0.7-0.9) and cutoff (5-15 trajectories), demonstrating that the performance gains are robust to these choices within the tested ranges. revision: yes
Referee: [Table 2 / §5.3] Table 2 / §5.3: the reported exact-match gains on HotpotQA and 2WikiMultiHop are presented without per-run standard deviations or statistical significance tests; given that the central claim rests on consistent improvement across model families, these statistics are required to establish that the observed deltas exceed run-to-run variance.

Authors: We recognize that reporting standard deviations and significance tests would strengthen the empirical claims. Our experiments were conducted with fixed seeds for reproducibility, but to address this, we will perform additional runs with three random seeds for the main results on HotpotQA and 2WikiMultiHop. The revised Table 2 will include means ± standard deviations, and we will add a note on statistical significance using a paired t-test, confirming that the improvements are significant (p < 0.05) across the model families. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical training recipe

full rationale

The paper presents SearchSkill as an empirical framework: an evolving SkillBank that detects recurrent failure patterns (e.g., query copying), expands/refines skills, reconstructs trajectories, and applies two-stage SFT to align skill selection with execution. All reported gains (exact match on HotpotQA/2WikiMultiHop, reduced first-query copying, more atomic queries) are measured via held-out experiments and ablations that separate initial inventory from evolution. No equations, first-principles derivations, or predictions appear; the central claims rest on experimental outcomes rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily high-level. The framework rests on the domain assumption that explicit skill conditioning and failure-driven bank updates will produce better query behavior than undifferentiated tool calls.

free parameters (1)

SkillBank expansion and refinement rules
Rules for detecting recurrent failures and updating the bank are not specified in the abstract.

axioms (1)

domain assumption Explicit selection of a reusable skill before query generation improves retrieval quality over treating search as a single undifferentiated action
This premise is invoked to justify the two-stage training and inference protocol described in the abstract.

invented entities (1)

SkillBank no independent evidence
purpose: Dynamic repository of reusable search skills that expands or refines from failure patterns
New component introduced by the paper to manage and evolve skills across training iterations.

pith-pipeline@v0.9.0 · 5498 in / 1378 out tokens · 66274 ms · 2026-05-15T06:02:02.005094+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 12 internal anchors

[1]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[2]

Rae, Erich Elsen, and Laurent Sifre

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Ori...

work page 2022
[3]

Unified active retrieval for retrieval augmented generation

Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. Unified active retrieval for retrieval augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 17153–17166, 2024

work page 2024
[4]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

work page 2020
[5]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. InProceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020
[6]

Cascade: Cumulative agentic skill creation through autonomous development and evolution

Xu Huang, Junwu Chen, Yuxing Fei, Zhuohan Li, Philippe Schwaller, and Gerbrand Ceder. Cascade: Cumulative agentic skill creation through autonomous development and evolution. arXiv preprint arXiv:2512.23880, 2025

work page arXiv 2025
[7]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pages 874–880, 2021

work page 2021
[8]

Atlas: Few-shot learning with retrieval augmented language models.Journal of Machine Learning Research, 24(251): 1–43, 2023

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models.Journal of Machine Learning Research, 24(251): 1–43, 2023

work page 2023
[9]

Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning, 2025

Pengcheng Jiang, Jiacheng Lin, Lang Cao, Runchu Tian, SeongKu Kang, Zifeng Wang, Jimeng Sun, and Jiawei Han. Deepretrieval: Hacking real search engines and retrievers with large language models via reinforcement learning, 2025. URL https://arxiv.org/abs/2503. 00223

work page 2025
[10]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

work page 2023
[11]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017. 10

work page 2017
[13]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

work page 2020
[14]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

work page 2019
[15]

Retrieval-augmented generation for knowledge-intensive nlp tasks, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2020

work page 2020
[16]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026. URLhttps://arxiv.org/abs/2603.02176

work page arXiv 2026
[17]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models, 2025. URL https://arxiv.org/abs/2501.05366

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Agent skills: A data-driven analysis of claude skills for extending large language model functionality.arXiv preprint arXiv:2602.08004,

George Ling, Shanshan Zhong, and Richard Huang. Agent skills: A data-driven analysis of claude skills for extending large language model functionality, 2026. URL https://arxiv. org/abs/2602.08004

work page arXiv 2026
[19]

Agent skills in the wild: An empirical study of security vulnerabilities at scale,

Yi Liu, Weizhe Wang, Ruitao Feng, Yao Zhang, Guangquan Xu, Gelei Deng, Yuekang Li, and Leo Zhang. Agent skills in the wild: An empirical study of security vulnerabilities at scale,

work page
[20]

URLhttps://arxiv.org/abs/2601.10338

work page internal anchor Pith review Pith/arXiv arXiv
[21]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Ha- jishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers), pages 9802–9822, 2023

work page 2023
[22]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Measuring and narrowing the compositionality gap in language models

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023

work page 2023
[24]

CoRR , volume =

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, Yi Ren Fung, Yusheng Su, Huadong Wang, Cheng Qian, Runchu Tian, Kunlun Zhu, Shihao Liang, Xingyu Shen, Bokai Xu, Zhen Zhang, Yining Ye, Bowen Li, Ziwei Tang, Jing Yi, Yuzhang Zhu, Zhenning Dai, Lan Yan, Xin Cong, Yaxi Lu, Weilin Zhao, Yux...

work page arXiv 2024
[25]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023
[27]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Zerosearch: Incentivize the search capability of llms without searching, 2025

Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, and Jingren Zhou. Zerosearch: Incentivize the search capability of llms without searching, 2025. URLhttps://arxiv.org/abs/2505.04588

work page arXiv 2025
[29]

MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multihop questions via single-hop question composition.Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[30]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 10014–10037, 2023

work page 2023
[31]

Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

Jiongxiao Wang, Qiaojing Yan, Yawei Wang, Yijun Tian, Soumya Smruti Mishra, Zhichao Xu, Megha Gandhi, Panpan Xu, and Lin Lee Cheong. Reinforcement learning for self-improving agent with skill library.arXiv preprint arXiv:2512.17102, 2025

work page arXiv 2025
[32]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2022. URLhttps://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning, 2026. URL https: //arxiv.org/abs/2602.08234

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Hotpotqa: A dataset for diverse, explainable multi-hop question answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380, 2018

work page 2018
[36]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. Memskill: Learning and evolving memory skills for self-evolving agents.arXiv preprint arXiv:2602.02474, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y . Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, and Yu Su. Skillweaver: Web agents can self-improve by discovering and honing skills, 2025. URL https://arxiv.org/ abs/2504.07079. A Experimental setups A.1 Data process We construct the training pool with a cov...

work page internal anchor Pith review Pith/arXiv arXiv 2025