pith. sign in

arxiv: 2312.07559 · v2 · pith:PJGDHFNRnew · submitted 2023-12-08 · 💻 cs.CL · cs.AI· cs.LG

PaperQA: Retrieval-Augmented Generative Agent for Scientific Research

classification 💻 cs.CL cs.AIcs.LG
keywords scientificagentliteraturepaperqaacrossmodelsansweringfull-text
0
0 comments X
read the original abstract

Large Language Models (LLMs) generalize well across language tasks, but suffer from hallucinations and uninterpretability, making it difficult to assess their accuracy without ground-truth. Retrieval-Augmented Generation (RAG) models have been proposed to reduce hallucinations and provide provenance for how an answer was generated. Applying such models to the scientific literature may enable large-scale, systematic processing of scientific knowledge. We present PaperQA, a RAG agent for answering questions over the scientific literature. PaperQA is an agent that performs information retrieval across full-text scientific articles, assesses the relevance of sources and passages, and uses RAG to provide answers. Viewing this agent as a question answering model, we find it exceeds performance of existing LLMs and LLM agents on current science QA benchmarks. To push the field closer to how humans perform research on scientific literature, we also introduce LitQA, a more complex benchmark that requires retrieval and synthesis of information from full-text scientific papers across the literature. Finally, we demonstrate PaperQA's matches expert human researchers on LitQA.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

    cs.CL 2026-06 unverdicted novelty 7.0

    NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly ...

  2. SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

    cs.CR 2026-06 accept novelty 7.0

    SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

  3. Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

    cs.AI 2026-05 unverdicted novelty 7.0

    Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...

  4. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

  5. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling, a multi-agent LLM system, extracts ~6.3 million nuanced structured records from PubMed across six tasks with reported error rates of 0.6-7.7%, lower than several curated databases.

  6. RAG over Thinking Traces Can Improve Reasoning Tasks

    cs.IR 2026-05 unverdicted novelty 7.0

    Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.

  7. PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

    cs.IR 2026-04 unverdicted novelty 7.0

    PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.

  8. FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

    cs.AI 2026-04 conditional novelty 7.0

    FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.

  9. OrchestrXR: A Multi-Agent System for Idea-to-Prototype XR Study Authoring

    cs.HC 2026-07 unverdicted novelty 6.0

    OrchestrXR uses multi-agent orchestration with structured schemas to generate Unity XR study prototypes from ideas, supported by a user study with 12 researchers indicating effective support and intent preservation.

  10. Lacuna: A Research Map for Machine Learning

    cs.DL 2026-06 unverdicted novelty 6.0

    Lacuna is an LLM-powered research map for ML that outperforms OpenScholar on retrieval benchmarks and GPT-Researcher on multi-stage report generation tasks.

  11. Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

    cs.CL 2026-06 conditional novelty 6.0

    A systematic analysis of 284 manually reviewed papers plus 1.8k+ others from 2023-2025 reveals under-reporting of human evaluation study design details, creating ambiguity in what was measured and how.

  12. NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

    cs.AI 2026-05 unverdicted novelty 6.0

    NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.

  13. FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

    cs.LG 2026-05 unverdicted novelty 6.0

    FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.

  14. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.

  15. RAG over Thinking Traces Can Improve Reasoning Tasks

    cs.IR 2026-05 unverdicted novelty 6.0

    RAG over structured thinking traces boosts LLM reasoning on AIME, LiveCodeBench, and GPQA, with relative gains up to 56% and little added cost.

  16. XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration

    cs.CL 2025-05 conditional novelty 6.0

    XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.

  17. Supervising the search process produces reliable and generalizable information-seeking agents

    cs.CL 2025-02 unverdicted novelty 6.0

    Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.

  18. A Technical Taxonomy of LLM Agent Communication Protocols

    cs.MA 2026-06 unverdicted novelty 5.0

    Creates a five-dimension taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) from nine protocols and identifies architectural patterns plus convergence trends.

  19. ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

    cs.AI 2026-05 unverdicted novelty 5.0

    ForeSci is a temporally controlled benchmark with 500 tasks for assessing LLM agents on forward-looking AI research judgments in four domains using cutoff-aligned knowledge bases.

  20. Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

    cs.AI 2026-05 unverdicted novelty 5.0

    Compass is an expert-guided LLM agent framework that extracts 3,751 marine Pb records from 230k papers to build the largest integrated database, achieving 92% accuracy via multi-layered validation.

  21. Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning

    cs.IR 2026-05 unverdicted novelty 5.0

    Eliot is a query-time clustering and temporal visualization system for arXiv literature, evaluated via offline metrics on eight domains and a user survey showing 85% meaningful cluster labels.

  22. Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

    cs.AI 2026-05 unverdicted novelty 5.0

    pArticleMap combines article embeddings, graph-based frontier extraction, and agentic LLMs to map nanomedicine literature and generate hypotheses, achieving 10.8% gold recovery and 61% future-neighborhood rate in retr...

  23. Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

    eess.SY 2026-05 unverdicted novelty 5.0

    The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.

  24. Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

    eess.SY 2026-05 unverdicted novelty 5.0

    Experiment-as-Code Labs encodes experiments as declarative configurations that AI agents generate, systems software analyzes and orchestrates, and device APIs execute on physical lab hardware.

  25. Plasma GraphRAG: Physics-Grounded Parameter Selection for Gyrokinetic Simulations

    physics.plasm-ph 2026-04 unverdicted novelty 5.0

    Plasma GraphRAG automates physics-grounded parameter selection for gyrokinetic simulations via a domain-specific knowledge graph and LLMs, reporting over 10% better quality and up to 25% fewer hallucinations than stan...

  26. Diverse LLMs or Diverse Question Interpretations? That is the Ensembling Question

    cs.CL 2025-07 unverdicted novelty 5.0

    Question interpretation diversity outperforms model diversity for LLM ensembling on binary QA tasks using majority voting.

  27. In Context Learning and Reasoning for Symbolic Regression with Large Language Models

    cs.CL 2024-10 unverdicted novelty 5.0

    GPT-4 models rediscover Langmuir isotherms and produce fits on Nikuradse pipe-flow data via iterative chain-of-thought prompting with scientific context and external code feedback.

  28. AI for Auto-Research: Roadmap & User Guide

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

  29. Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval

    cs.CV 2026-01 unverdicted novelty 4.0

    Large-scale profiling of recent AI literature shows growth in safety, multimodal reasoning, and agent studies alongside stabilization in neural machine translation and graph methods.

  30. Retrieval-Augmented Generation for Large Language Models: A Survey

    cs.CL 2023-12 unverdicted novelty 3.0

    A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

  31. Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

    cs.CL 2025-02 unverdicted novelty 2.0

    Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.