arxiv: 2501.05366 · v1 · submitted 2025-01-09 · 💻 cs.AI · cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li , Guanting Dong , Jiajie Jin , Yuyao Zhang , Yujia Zhou , Yutao Zhu , Peitian Zhang , Zhicheng Dou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords agentic searchlarge reasoning modelsretrieval-augmented generationcomplex reasoningopen-domain QAknowledge insufficiencyreason-in-documents

0 comments

The pith

Search-o1 embeds agentic search and document reasoning into large reasoning models to fetch and clean external knowledge mid-chain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models frequently hit knowledge gaps during extended step-by-step reasoning, which produces uncertainties and errors. Search-o1 inserts an agentic workflow that triggers retrieval when the model detects missing facts, followed by a separate Reason-in-Documents step that distills the returned material before it rejoins the reasoning trace. Experiments on science, mathematics, coding tasks and six open-domain QA benchmarks show gains over baseline LRMs. The approach keeps the original long-chain flow intact while reducing reliance on incomplete internal knowledge.

Core claim

By embedding an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module into the reasoning process, Search-o1 enables large reasoning models to dynamically fetch and refine external knowledge at points of uncertainty, leading to improved performance on complex tasks without disrupting the flow of long-step reasoning.

What carries the argument

Agentic search workflow that pauses at uncertain knowledge points to retrieve documents, combined with the Reason-in-Documents module that analyzes and distills those documents to minimize noise before re-injection into the chain.

If this is right

Higher accuracy on science, mathematics, and coding benchmarks that currently expose knowledge gaps in LRMs.
Better results on open-domain QA by supplementing internal knowledge with retrieved facts at the right moments.
Greater trustworthiness because external retrieval reduces propagation of internal uncertainties through long chains.
A path toward more versatile LRMs that handle problems requiring facts beyond their training cutoff without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same workflow could be layered onto other reasoning models to test whether gains require the specific o1-style base.
Measuring the exact drop in hallucinated intermediate steps per chain would quantify how much the distillation step improves coherence.
Real-time domains such as medical or legal reasoning might adopt the modules if the overhead of retrieval stays low.
Future scaling laws for LRMs may need to include search frequency as a controllable variable alongside model size.

Load-bearing premise

The Reason-in-Documents module must reliably extract useful facts from retrieved documents while preserving coherent reasoning flow and without introducing new errors that outweigh the benefit of added knowledge.

What would settle it

Compare performance on the same science, math, coding, and QA benchmarks when the agentic search and Reason-in-Documents components are disabled versus enabled; lack of improvement or added errors would falsify the integration claim.

read the original abstract

Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{https://github.com/sunnynexus/Search-o1}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Search-o1 embeds agentic search inside o1-style reasoning and adds a Reason-in-Documents step, but the abstract gives no numbers so the actual gains stay unverified.

read the letter

Search-o1 puts an agentic search loop inside large reasoning models like o1, pulling in external knowledge exactly when the chain hits an uncertain spot, and then runs a Reason-in-Documents pass to trim noise from the retrieved pages before they go back into the reasoning. The new piece is that search trigger based on uncertainty during the step-by-step process, combined with the dedicated document analysis module. The paper shows this setup on science, math, coding, and open QA benchmarks, and they open-sourced the code, which lets others test the workflow directly. It does a decent job laying out how the agent decides to search and how the refinement step aims to keep the reasoning coherent. That addresses a real issue with knowledge cutoffs in these models. The soft spot is the evaluation. The abstract talks about strong performance but gives no numbers, no specific baselines, and no breakdown of whether the Reason-in-Documents step actually reduces errors or just adds overhead. Without those details or an ablation on the module alone, it's hard to know if the gains come from the search or if the distillation works as claimed. The concern about noise injection at uncertain points is fair and needs checking in the full results. This paper is for researchers building or extending reasoning models with retrieval. Anyone experimenting with o1-like systems or RAG in long chains would find the architecture description and code useful. It deserves a serious referee. The core idea is practical and the framework is spelled out enough to review properly. I would recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Search-o1, a framework that augments large reasoning models (LRMs) such as OpenAI-o1 with an agentic retrieval-augmented generation workflow. When the LRM encounters uncertain knowledge points during long-horizon reasoning, the system triggers external document retrieval; a dedicated Reason-in-Documents module then analyzes the verbose retrieved content to distill relevant information and minimize noise before injecting it back into the reasoning chain. The authors claim that this integration yields strong performance on complex reasoning tasks in science, mathematics, coding, and six open-domain QA benchmarks, with code released at https://github.com/sunnynexus/Search-o1.

Significance. If the reported gains are robust and the Reason-in-Documents module demonstrably improves signal-to-noise without net error injection, the work would meaningfully advance reliable long-form reasoning by closing knowledge gaps via agentic search. The open-source release is a concrete strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the central performance claim rests on 'extensive experiments' and 'strong performance,' yet the manuscript supplies no quantitative results, baseline comparisons, or error analysis in the abstract and does not isolate the contribution of the Reason-in-Documents module (e.g., pre/post error rates on knowledge points or coherence scores). This prevents verification that retrieval improves rather than degrades the LRM baseline.
[Reason-in-Documents module] Reason-in-Documents module description (likely §3): the claim that the module 'deeply analyze[s] the retrieved information before injecting it … minimizing noise and preserving coherent reasoning flow' is load-bearing for the overall argument, but no isolated ablation or metric (e.g., knowledge-point accuracy before vs. after distillation) is provided to confirm net benefit at the precise points where retrieval is triggered.

minor comments (2)

[Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., accuracy deltas on the six QA benchmarks) to support the 'strong performance' assertion.
[Experiments] Ensure all six open-domain QA benchmarks are explicitly named with citations in the experimental setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to include quantitative results in the abstract and an isolated ablation for the Reason-in-Documents module.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central performance claim rests on 'extensive experiments' and 'strong performance,' yet the manuscript supplies no quantitative results, baseline comparisons, or error analysis in the abstract and does not isolate the contribution of the Reason-in-Documents module (e.g., pre/post error rates on knowledge points or coherence scores). This prevents verification that retrieval improves rather than degrades the LRM baseline.

Authors: We agree the abstract lacks specific numbers and that the module's isolated contribution is not quantified. In the revision we have added concrete performance figures (accuracy deltas on science/math/coding/QA benchmarks versus base LRM and standard RAG) plus error analysis to the abstract. We have also inserted a dedicated ablation subsection reporting knowledge-point accuracy and coherence scores before versus after the Reason-in-Documents module, confirming net error reduction at retrieval trigger points. revision: yes
Referee: [Reason-in-Documents module] Reason-in-Documents module description (likely §3): the claim that the module 'deeply analyze[s] the retrieved information before injecting it … minimizing noise and preserving coherent reasoning flow' is load-bearing for the overall argument, but no isolated ablation or metric (e.g., knowledge-point accuracy before vs. after distillation) is provided to confirm net benefit at the precise points where retrieval is triggered.

Authors: We accept that an isolated metric at the exact retrieval trigger points was missing. The revised manuscript now contains a targeted ablation that measures knowledge-point accuracy and noise reduction immediately before and after the distillation step, demonstrating a clear net benefit without error injection. These results are reported at the points where the agentic search is invoked. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in empirical framework

full rationale

The paper describes an empirical agentic framework (Search-o1) that augments LRMs with retrieval and a Reason-in-Documents module, evaluated on external benchmarks in science, math, coding, and QA. No equations, derivations, fitted parameters, or first-principles predictions are present in the abstract or described workflow. Claims rest on experimental results rather than any self-referential reduction of outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation chain is therefore self-contained against external benchmarks, consistent with a normal non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that LRMs suffer from knowledge insufficiency and that retrieved documents can be refined without breaking reasoning coherence; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption LRMs suffer from knowledge insufficiency leading to uncertainties and potential errors during extended reasoning
Explicitly stated in the abstract as the core limitation being addressed.

pith-pipeline@v0.9.0 · 5536 in / 1144 out tokens · 46844 ms · 2026-05-13T17:30:54.104077+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 7.0

SearchSkill improves exact match scores and retrieval efficiency on open-domain QA by conditioning LLM actions on skills from an evolving SkillBank updated from failure patterns via two-stage SFT.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

ReflectMT internalizes reflection via two-stage RL to enable direct high-quality machine translation that outperforms explicit reasoning models like DeepSeek-R1 on WMT24 while using 94% fewer tokens.
AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning
cs.IR 2026-04 unverdicted novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

SLIM dynamically optimizes active external skills in agentic RL via leave-one-skill-out marginal contribution estimates and three lifecycle operations, outperforming baselines by 7.1% on ALFWorld and SearchQA while sh...
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 6.0

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
Pause or Fabricate? Training Language Models for Grounded Reasoning
cs.CL 2026-04 conditional novelty 6.0

GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task succe...
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
Procedural Knowledge at Scale Improves Reasoning
cs.CL 2026-04 unverdicted novelty 6.0

Reasoning Memory decomposes reasoning trajectories into 32 million subquestion-subroutine pairs and retrieves them via in-thought prompts to improve language model performance on math, science, and coding benchmarks b...
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
cs.AI 2026-05 unverdicted novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 5.0

CroSearch-R1 applies search-augmented RL with cross-lingual integration and multilingual rollouts to improve RAG effectiveness on multilingual collections.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
cs.IR 2026-04 unverdicted novelty 5.0

SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

E3-TIR integrates expert prefixes, guided branches, and self-exploration via mix policy optimization to deliver 6% better tool-use performance with under 10% of the usual synthetic data and 1.46x ROI.
Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search
cs.AI 2026-04 unverdicted novelty 5.0

HiExp extracts hierarchical experience knowledge from reasoning trajectories via contrastive analysis and clustering to regularize RL training, turning stochastic exploration into strategic search with reported gains ...
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
EigentSearch-Q+: Enhancing Deep Research Agents with Structured Reasoning Tools
cs.AI 2026-04 unverdicted novelty 4.0

Structured query and evidence tools added to an AI research agent improve benchmark accuracy by 0.6 to 3.8 percentage points.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · cited by 25 Pith papers · 16 internal anchors

[1]

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[3]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Junying Chen, Zhenyang Cai, Ke Ji, Xidong Wang, Wanlong Liu, Rongsheng Wang, Jianye Hou, and Benyou Wang. Huatuogpt-o1, towards medical complex reasoning with llms. arXiv preprint arXiv:2412.18925, 2024

work page internal anchor Pith review arXiv 2024
[4]

Do not think that much for 2+3=? on the overthinking of o1-like llms, 2024

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthinking of o1-like llms, 2024

work page 2024
[5]

Mindsearch: Mimicking human minds elicits deep ai searcher

Zehui Chen, Kuikun Liu, Qiuchen Wang, Jiangning Liu, Wenwei Zhang, Kai Chen, and Feng Zhao. Mindsearch: Mimicking human minds elicits deep ai searcher. arXiv preprint arXiv:2407.20183, 2024

work page arXiv 2024
[6]

Thinking, fast and slow

Kahneman Daniel. Thinking, fast and slow. 2017

work page 2017
[7]

Deepseek-r1-lite-preview is now live: unleashing supercharged reasoning power!, November 2024

DeepSeek-AI. Deepseek-r1-lite-preview is now live: unleashing supercharged reasoning power!, November 2024

work page 2024
[8]

Self-play with execution feedback: Improving instruction-following capabilities of large language models

Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, and Jingren Zhou. Self-play with execution feedback: Improving instruction-following capabilities of large language models. CoRR, abs/2406.13542, 2024

work page arXiv 2024
[9]

Toward general instruction-following alignment for retrieval-augmented generation

Guanting Dong, Xiaoshuai Song, Yutao Zhu, Runqi Qiao, Zhicheng Dou, and Ji-Rong Wen. Toward general instruction-following alignment for retrieval-augmented generation. CoRR, abs/2410.09584, 2024

work page arXiv 2024
[10]

How abilities in large language models are affected by supervised fine-tuning data composition

Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. How abilities in large language models are affected by supervised fine-tuning data composition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computat...

work page 2024
[11]

Progressive multimodal reasoning via active retrieval

Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, and Ji-Rong Wen. Progressive multimodal reasoning via active retrieval. arXiv preprint arXiv:2412.14835, 2024

work page arXiv 2024
[12]

Understand what LLM needs: Dual preference alignment for retrieval-augmented generation

Guanting Dong, Yutao Zhu, Chenghao Zhang, Zechen Wang, Zhicheng Dou, and Ji-Rong Wen. Understand what LLM needs: Dual preference alignment for retrieval-augmented generation. CoRR, abs/2406.18676, 2024

work page arXiv 2024
[13]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

From local to global: A graph rag approach to query-focused summarization, 2024

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2024

work page 2024
[15]

Towards revealing the mystery behind chain of thought: a theoretical perspective

Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[16]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Joaquin Vanschoren and Sai-Kit Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 202...

work page 2021
[17]

Scaling Laws for Autoregressive Generative Modeling

Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[18]

Constructing A multi- hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing A multi- hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Núria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6609...

work page 2020
[19]

Yoshitaka Inoue, Tianci Song, and Tianfan Fu

Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489, 2024

work page arXiv 2024
[20]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report. CoRR, abs/2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. CoRR, abs/2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Jiang, Q

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023
[25]

Technical report: Enhancing llm reasoning with reward-guided tree search

Jinhao Jiang, Zhipeng Chen, Yingqian Min, Jie Chen, Xiaoxue Cheng, Jiapeng Wang, Yiru Tang, Haoxiang Sun, Jia Deng, Wayne Xin Zhao, et al. Technical report: Enhancing llm reasoning with reward-guided tree search. arXiv preprint arXiv:2411.11694, 2024

work page arXiv 2024
[26]

Bowen Jin, Jinsung Yoon, Jiawei Han, and Sercan Ö. Arik. Long-context llms meet RAG: overcoming challenges for long inputs in RAG. CoRR, abs/2410.05983, 2024

work page arXiv 2024
[27]

Bider: Bridging knowledge incon- sistency for efficient retrieval-augmented llms via key supporting evidence

Jiajie Jin, Yutao Zhu, Yujia Zhou, and Zhicheng Dou. Bider: Bridging knowledge incon- sistency for efficient retrieval-augmented llms via key supporting evidence. arXiv preprint arXiv:2402.12174, 2024

work page arXiv 2024
[28]

TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In ACL, pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. 12

work page 2017
[29]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019

work page 2019
[30]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[31]

Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V . Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving quantitative reasoning prob- lems with language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A....

work page 2022
[32]

Dotamath: Decomposition of thought with code assistance and self-correction for mathematical reasoning

Chengpeng Li, Guanting Dong, Mingfeng Xue, Ru Peng, Xiang Wang, and Dayiheng Liu. Dotamath: Decomposition of thought with code assistance and self-correction for mathematical reasoning. CoRR, abs/2407.04078, 2024

work page arXiv 2024
[33]

Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks

Xiaoxi Li, Zhicheng Dou, Yujia Zhou, and Fangchao Liu. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang, editors, Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024...

work page 2024
[34]

Retrollm: Empowering large language models to retrieve fine-grained evidence within genera- tion

Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yongkang Wu, Zhonghua Li, Qi Ye, and Zhicheng Dou. Retrollm: Empowering large language models to retrieve fine-grained evidence within genera- tion. arXiv preprint arXiv:2412.11919, 2024

work page arXiv 2024
[35]

From matching to generation: A survey on generative information retrieval

Xiaoxi Li, Jiajie Jin, Yujia Zhou, Yuyao Zhang, Peitian Zhang, Yutao Zhu, and Zhicheng Dou. From matching to generation: A survey on generative information retrieval. CoRR, abs/2404.14851, 2024

work page arXiv 2024
[36]

Unigen: A unified generative framework for retrieval and question answering with large language models

Xiaoxi Li, Yujia Zhou, and Zhicheng Dou. Unigen: A unified generative framework for retrieval and question answering with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Inte...

work page 2024
[37]

Kag: Boosting llms in professional domains via knowledge augmented generation, 2024

Lei Liang, Mengshu Sun, Zhengke Gui, Zhongshu Zhu, Zhouyu Jiang, Ling Zhong, Yuan Qu, Peilong Zhao, Zhongpu Bo, Jin Yang, Huaidong Xiong, Lin Yuan, Jun Xu, Zaoyang Wang, Zhiqiang Zhang, Wen Zhang, Huajun Chen, Wenguang Chen, and Jun Zhou. Kag: Boosting llms in professional domains via knowledge augmented generation, 2024

work page 2024
[38]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[39]

The unlocking spell on base llms: Rethinking alignment via in-context learning

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khy- athi Raghavi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[40]

Deductive verification of chain-of-thought reasoning

Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. Deductive verification of chain-of-thought reasoning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems...

work page 2023
[42]

How much can rag help the reasoning of llm?arXiv preprint arXiv:2410.02338,

Jingyu Liu, Jiaen Lin, and Yong Liu. How much can RAG help the reasoning of llm? CoRR, abs/2410.02338, 2024

work page arXiv 2024
[43]

arXiv preprint arXiv:2305.14283 , year=

Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283, 2023

work page arXiv 2023
[44]

Selfcheck: Using llms to zero-shot check their own step-by-step reasoning

Ning Miao, Yee Whye Teh, and Tom Rainforth. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. In The Twelfth International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[45]

Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems.arXiv preprint arXiv:2412.09413, 2024

Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, et al. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. arXiv preprint arXiv:2412.09413, 2024

work page arXiv 2024
[46]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024

work page 2024
[47]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguis- tics: EMNLP 2023, Singapore, December 6-10, 2023 , pages 5687–5711. Association for Computationa...

work page 2023
[48]

We-math: Does your large multimodal model achieve human-like mathematical reasoning?arXiv preprint arXiv:2407.01284, 2024

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma Gongque, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning? CoRR, abs/2407.01284, 2024

work page arXiv 2024
[49]

O1 replication journey: A strategic progress report–part 1

Yiwei Qin, Xuefeng Li, Haoyang Zou, Yixiu Liu, Shijie Xia, Zhen Huang, Yixin Ye, Weizhe Yuan, Hector Liu, Yuanzhi Li, et al. O1 replication journey: A strategic progress report–part 1. arXiv preprint arXiv:2410.18982, 2024

work page arXiv 2024
[50]

Qwen2.5 technical report, 2024

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page 2024
[51]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020

work page 2020
[52]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. CoRR, abs/2311.12022, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms

Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, and Ji-Rong Wen. Small models, big insights: Leveraging slim proxy models to decide when and what to retrieve for llms. arXiv preprint arXiv:2402.12052, 2024

work page arXiv 2024
[54]

Qwq: Reflect deeply on the boundaries of the unknown, November 2024

Qwen Team. Qwq: Reflect deeply on the boundaries of the unknown, November 2024

work page 2024
[55]

musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[56]

Planxrag: Planning-guided retrieval augmented generation

Prakhar Verma, Sukruta Prakash Midigeshi, Gaurav Sinha, Arno Solin, Nagarajan Natarajan, and Amit Sharma. Planxrag: Planning-guided retrieval augmented generation. arXiv preprint arXiv:2410.20753, 2024. 14

work page arXiv 2024
[57]

Drt-o1: Optimized deep reasoning translation via long chain-of-thought

Jiaan Wang, Fandong Meng, Yunlong Liang, and Jie Zhou. Drt-o1: Optimized deep reasoning translation via long chain-of-thought. arXiv preprint arXiv:2412.17498, 2024

work page arXiv 2024
[58]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023

work page arXiv 2023
[59]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[60]

Ken C. L. Wong, Hongzhi Wang, Etienne E. V os, Bianca Zadrozny, Campbell D. Watson, and Tanveer F. Syeda-Mahmood. Addressing deep learning model uncertainty in long-range climate forecasting with late fusion. CoRR, abs/2112.05254, 2021

work page arXiv 2021
[61]

A comparative study on reasoning patterns of openai’s o1 model

Siwei Wu, Zhongyuan Peng, Xinrun Du, Tuney Zheng, Minghao Liu, Jialong Wu, Jiachen Ma, Yizhi Li, Jian Yang, Wangchunshu Zhou, Qunshu Lin, Junbo Zhao, Zhaoxiang Zhang, Wenhao Huang, Ge Zhang, Chenghua Lin, and Jiaheng Liu. A comparative study on reasoning patterns of openai’s o1 model. CoRR, abs/2410.13639, 2024

work page arXiv 2024
[62]

How easily do irrelevant inputs skew the responses of large language models?, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models?, 2024

work page 2024
[63]

Evaluating mathematical reasoning beyond accuracy

Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, and Pengfei Liu. Evaluating mathematical reasoning beyond accuracy. CoRR, abs/2404.05692, 2024

work page arXiv 2024
[64]

Recomp: Improving retrieval- augmented lms with compression and selective augmentation,

Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023

work page arXiv 2023
[65]

Llava-o1: Let vision language models reason step-by-step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

work page arXiv 2024
[66]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. CoRR, abs/2409.12122, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop ques- tion answering. In EMNLP, pages 2369–2380, Brussels, Belgium, October-November 2018. Association for Computational Linguistics

work page 2018
[69]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte 10 carlo tree search.arXiv preprint arXiv:2412.18319, 2024

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, et al. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.arXiv preprint arXiv:2412.18319, 2024

work page arXiv 2024
[70]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems

Tian Ye, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu. Physics of language models: Part 2.2, how to learn from mistakes on grade-school math problems. arXiv preprint arXiv:2408.16293, 2024. 15

work page arXiv 2024
[72]

Making retrieval-augmented language models robust to irrelevant context

Ori Yoran, Tomer Wolfson, Ori Ram, and Jonathan Berant. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024

work page 2024
[73]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

work page 2024
[74]

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825, 2023

work page internal anchor Pith review arXiv 2023
[75]

Quiet-star: Language models can teach themselves to think before speaking.arXiv preprint arXiv:2403.09629,

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-star: Language models can teach themselves to think before speaking. arXiv preprint arXiv:2403.09629, 2024

work page arXiv 2024
[76]

Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective

Zhiyuan Zeng, Qinyuan Cheng, Zhangyue Yin, Bo Wang, Shimin Li, Yunhua Zhou, Qipeng Guo, Xuanjing Huang, and Xipeng Qiu. Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective. arXiv preprint arXiv:2412.14135, 2024

work page arXiv 2024
[77]

Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning

Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. CoRR, abs/2410.02884, 2024

work page arXiv 2024
[78]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review arXiv 2024
[79]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[81]

o1-coder: an o1 replication for coding

Yuxiang Zhang, Shangxi Wu, Yuqi Yang, Jiangming Shu, Jinlin Xiao, Chao Kong, and Jitao Sang. o1-coder: an o1 replication for coding. arXiv preprint arXiv:2412.00154, 2024

work page arXiv 2024
[82]

Retrieval-Augmented Generation for AI-Generated Content: A Survey

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024

work page internal anchor Pith review arXiv 2024

Showing first 80 references.