RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support

2) ((1) RCSB Protein Data Bank; (2) RCSB Protein Data Bank; Brian P. Hudson (1); CA; Chenghua Shao (1); Ezra Peisach (1); Gregg V. Crichlow (1); Irina Persikova (1); Jasmine Y. Young (1); Justin W. Flatt (1)

arxiv: 2604.22800 · v1 · submitted 2026-04-13 · 💻 cs.IR · cs.AI· cs.CL· q-bio.QM

RCSB PDB AI Help Desk: retrieval-augmented generation for protein structure deposition support

Vivek Reddy Chithari (1) , Jasmine Y. Young (1) , Irina Persikova (1) , Yuhe Liang (1) , Gregg V. Crichlow (1) , Justin W. Flatt (1) , Sutapa Ghosh (1) , Brian P. Hudson (1)

show 15 more authors

Ezra Peisach (1) Monica Sekharan (1) Chenghua Shao (1) Stephen K. Burley (1 2) ((1) RCSB Protein Data Bank Rutgers The State University of New Jersey Piscataway NJ USA (2) RCSB Protein Data Bank San Diego Supercomputer Center University of California San Diego CA USA)

This is my paper

Pith reviewed 2026-05-10 15:30 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLq-bio.QM

keywords RAGretrieval-augmented generationProtein Data BankPDBAI help deskdeposition supportbiocuration assistanceLangChain

0 comments

The pith

A retrieval-augmented generation system now supplies citation-backed answers to Protein Data Bank depositors at any hour.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an AI Help Desk built to handle the thousands of annual inquiries from researchers depositing experimental structures into the Protein Data Bank. A small team of biocurators currently manages this volume while also validating data, so the system retrieves relevant sections from official guides and uses a language model to produce responses that cite their sources. It adds safeguards such as two-stage document processing, relevance-ranked retrieval, an off-topic filter, and a prompt that avoids internal terminology. A reader would care because successful deployment would let biocurators focus on core scientific work instead of routine correspondence while still giving depositors immediate, reliable help. The approach shows how retrieval-augmented methods can scale support for large public scientific repositories.

Core claim

The authors built and deployed a production RAG system on LangChain with a PostgreSQL vector store and a compact GPT model. Documents are extracted from PDFs while preserving layout, split in two stages, and retrieved via maximal marginal relevance; a topical guardrail rejects unrelated queries and a dual-model setup first condenses the question before generating a streaming answer that includes source citations. The full pipeline runs on Kubernetes and is intended to provide continuous assistance without exposing internal biocuration details.

What carries the argument

The dual-LLM retrieval-augmented generation pipeline that combines two-stage chunking, maximal marginal relevance retrieval, a topical guardrail, and a specialized system prompt to keep answers accurate and domain-appropriate.

If this is right

Depositors receive immediate assistance with submission questions without waiting for human review.
Biocurators spend less time on routine correspondence and more on data validation and curation.
Every response includes direct links or citations to the source documents used.
Support becomes available continuously across time zones for global depositors.
The same architecture could reduce response times for the more than 40 percent of worldwide PDB entries processed by this site.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-plus-guardrail pattern could be reused for help desks at other large biological databases that face similar query loads.
Logging accepted and rejected queries over time would allow the retrieval index to be expanded or the prompt to be refined without changing the core code.
If accuracy remains high, the system could later incorporate additional document types such as video transcripts or interactive deposition tutorials.
Wider adoption might shift the role of biocurators toward overseeing and updating the knowledge base rather than answering individual messages.

Load-bearing premise

The retrieval, filtering, and prompting steps together will keep the model from producing inaccurate claims or revealing non-public biocuration information.

What would settle it

A depositor query about internal validation steps that returns either a factual error about deposition rules or a mention of private curation procedures would show the safeguards have failed.

read the original abstract

Motivation: Structural Biologists have contributed more than 245,000 experimentally determined three-dimensional structures of biological macromolecules to the Protein Data Bank (PDB). Incoming data are validated and biocurated by ~20 expert biocurators across the wwPDB. RCSB PDB biocurators who process more than 40% of global depositions face increasing challenges in maintaining efficient Help Desk operations, with approximately 19,000 messages in approximately 8,000 entries received from depositors in 2025. Results: We developed an AI-powered Help Desk using Retrieval-Augmented Generation (RAG) built on LangChain with a pgvector store (PostgreSQL) and GPT-4.1-mini. The system employs pymupdf4llm for Markdown-preserving PDF extraction, two-stage document chunking, Maximal Marginal Relevance retrieval, a topical guardrail that filters off-topic queries, and a specialized system prompt that prevents exposure of internal terminology. A dual-LLM architecture uses separate model configurations for question condensing and response generation. Deployed in production on Kubernetes with PostgreSQL (pgvector), it provides around-the-clock depositor assistance with citation-backed, streaming responses. Availability and implementation: Freely available at https://rcsb-deposit-help.rcsb.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear description of a deployed RAG help desk for PDB depositors using standard tools plus domain guardrails, but it offers no test results or metrics to show the system actually works as intended.

read the letter

The paper reports a production RAG system built by the RCSB PDB team to answer depositor questions. They use LangChain with pgvector, GPT-4.1-mini, two-stage chunking, MMR retrieval, a topical guardrail, and a dual-LLM setup that separates question condensing from answer generation. A specialized prompt keeps internal biocuration language out of responses. The system is live at a public URL and handles real traffic from thousands of messages per year. That specific combination for this high-volume scientific archive is new enough to note, even if the underlying pieces are established.

Referee Report

1 major / 0 minor

Summary. The manuscript describes the development and production deployment of an AI-powered Help Desk for RCSB PDB depositors. It uses a RAG pipeline built on LangChain with a pgvector PostgreSQL store and GPT-4.1-mini, incorporating pymupdf4llm PDF extraction, two-stage chunking, Maximal Marginal Relevance retrieval, a topical guardrail, a specialized system prompt to avoid internal terminology, and a dual-LLM setup for question condensing and response generation. The system is reported to deliver citation-backed, streaming responses and is made available at https://rcsb-deposit-help.rcsb.org.

Significance. If the described components reliably suppress hallucinations and enforce domain-appropriate responses, the system could meaningfully reduce the workload on the ~20 biocurators handling ~19,000 annual depositor messages while providing 24/7 support. The public deployment and open URL constitute a concrete, usable artifact that supports reproducibility and potential adoption by other wwPDB sites.

major comments (1)

[Results] Results section: The central claim that the two-stage chunking, MMR retrieval, topical guardrail, and specialized prompt produce accurate, citation-backed, non-hallucinated responses that respect domain terminology is presented without any supporting evidence. No held-out test queries, expert-rated accuracy figures, hallucination rates, ablation results, or production usage metrics are reported, leaving the performance assertions unverified.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review of our manuscript on the RCSB PDB AI Help Desk. We address the single major comment below.

read point-by-point responses

Referee: [Results] Results section: The central claim that the two-stage chunking, MMR retrieval, topical guardrail, and specialized prompt produce accurate, citation-backed, non-hallucinated responses that respect domain terminology is presented without any supporting evidence. No held-out test queries, expert-rated accuracy figures, hallucination rates, ablation results, or production usage metrics are reported, leaving the performance assertions unverified.

Authors: We agree that the manuscript presents the system components and their intended effects without quantitative supporting evidence such as held-out test sets, expert-rated accuracy, hallucination rates, ablation studies, or production metrics. The work is a systems description of a production deployment rather than an empirical benchmark paper; no such formal evaluation was performed during development. In the revised manuscript we will add an 'Evaluation and Limitations' subsection that (1) reports any available operational statistics from the live deployment at https://rcsb-deposit-help.rcsb.org, (2) explicitly states that formal quantitative validation was outside the scope of this effort, and (3) moderates language in the Results section to describe the design choices as engineering measures intended to mitigate the listed issues rather than as proven performance guarantees. This revision directly addresses the referee's concern while preserving the manuscript's focus on implementation and availability. revision: yes

Circularity Check

0 steps flagged

No circularity; straightforward system description with no derivations

full rationale

The paper is a descriptive engineering report on the implementation and deployment of a RAG-based AI Help Desk. It specifies the use of off-the-shelf components (LangChain, pgvector, GPT-4.1-mini, pymupdf4llm) and standard techniques (two-stage chunking, MMR retrieval, topical guardrail, specialized prompt, dual-LLM architecture) without any equations, fitted parameters, predictions, or first-principles derivations. No load-bearing claims reduce to self-citations, self-definitions, or renamed inputs. The central assertions are architectural and deployment facts, not derived results that could be circular by construction. This is a self-contained systems paper with no derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering implementation report with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5667 in / 1056 out tokens · 33592 ms · 2026-05-10T15:30:28.553051+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

and Bourne, P.E

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic Acids Research, 28, 235–242. RCSB PDB AI Help Desk Page 13 Burley, S.K., Berman, H.M., Bhikadiya, C., et al. (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and educati...

work page 2000
[2]

https://openai.com/index/gpt-4-1/ (14 April 2025, last accessed)

OpenAI (2025) Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/ (14 April 2025, last accessed). OpenAI (2024) New embedding models and API updates. https://openai.com/index/new-embedding-models- and-api-updates/ (25 January 2024, last accessed). pgvector (2026) pgvector: open-source vector similarity search for Postgres. https://github.com...

work page 2025
[3]

Nucleic Acids Research, 47, D520–D528

wwPDB Consortium (2019) Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research, 47, D520–D528. Young, J.Y., Westbrook, J.D., Feng, Z., et al. (2017) OneDep: unified wwPDB system for deposition, biocuration, and validation of macromolecular structures in the PDB archive. Structure, 25, 536–545. Young, J.Y....

work page 2019

[1] [1]

and Bourne, P.E

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N. and Bourne, P.E. (2000) The Protein Data Bank. Nucleic Acids Research, 28, 235–242. RCSB PDB AI Help Desk Page 13 Burley, S.K., Berman, H.M., Bhikadiya, C., et al. (2019) RCSB Protein Data Bank: biological macromolecular structures enabling research and educati...

work page 2000

[2] [2]

https://openai.com/index/gpt-4-1/ (14 April 2025, last accessed)

OpenAI (2025) Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/ (14 April 2025, last accessed). OpenAI (2024) New embedding models and API updates. https://openai.com/index/new-embedding-models- and-api-updates/ (25 January 2024, last accessed). pgvector (2026) pgvector: open-source vector similarity search for Postgres. https://github.com...

work page 2025

[3] [3]

Nucleic Acids Research, 47, D520–D528

wwPDB Consortium (2019) Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Research, 47, D520–D528. Young, J.Y., Westbrook, J.D., Feng, Z., et al. (2017) OneDep: unified wwPDB system for deposition, biocuration, and validation of macromolecular structures in the PDB archive. Structure, 25, 536–545. Young, J.Y....

work page 2019