pith. machine review for the scientific record. sign in

arxiv: 2602.15019 · v5 · submitted 2026-02-16 · 💻 cs.AI · cs.IR

Recognition: 1 theorem link

· Lean Theorem

Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:28 UTC · model grok-4.3

classification 💻 cs.AI cs.IR
keywords drug asset scoutingAI agentsglobal searchbio-pharmaceuticalbenchmark evaluationmultilingual sourcescompetitive intelligenceF1 score
0
0 comments X

The pith

The Bioptic Agent, a tree-based self-learning system, achieves 79.7% F1 score scouting global drug assets and outperforms major AI models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a specialized AI agent can locate bio-pharmaceutical assets from worldwide sources more completely than current general-purpose research tools. Most new drug patents and candidates now originate outside the United States, especially in non-English channels, so missing them creates large financial exposure for investors and business development teams. The authors built a benchmark from real expert screening queries using a multilingual multi-agent pipeline to generate test cases with ground-truth assets. Their Bioptic Agent reaches 79.7% F1 on this test, well ahead of Gemini 3.1 Deep Think at 59.2%, Gemini 3.1 Pro Deep Research at 58.6%, Claude Opus 4.6 at 56.2%, OpenAI GPT-5.2 Pro at 46.6%, Perplexity Deep Research at 44.2%, and Exa Websets at 26.9%. Results improve with added compute, indicating that wider search scales with resources.

Core claim

A tuned, tree-based self-learning Bioptic Agent enables complete, non-hallucinated scouting of bio-pharmaceutical assets from heterogeneous, multilingual sources. On a benchmark of complex investor queries paired with ground-truth assets largely outside U.S.-centric radar, generated via a multilingual multi-agent pipeline from expert priors, the agent scores 79.7% F1. This exceeds Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance rises steeply with additional compute.

What carries the argument

The Bioptic Agent is a tuned tree-based self-learning AI system that performs wide global searches to discover drug assets without hallucinations.

If this is right

  • Drug investors and business development teams can reduce multi-billion-dollar coverage gaps by surfacing under-the-radar candidates from non-English and regional sources.
  • Scouting speed and completeness increase as more compute is allocated, allowing faster responses to shifts in global innovation pipelines.
  • The benchmarking methodology supplies a repeatable standard for evaluating future AI agents on realistic pharma intelligence tasks.
  • Lower hallucination rates in generated asset reports increase reliability for automated initial screening.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same wide-search design could transfer to tracking emerging technologies or regulatory filings in other multilingual domains.
  • Pairing the agent with periodic human review loops might further raise accuracy in high-value investment decisions.
  • Widespread adoption could move industry practice from US-centric screening toward systematic global monitoring of drug development.

Load-bearing premise

The LLM-as-judge evaluation calibrated to expert opinions accurately measures real-world scouting performance and the generated benchmark queries faithfully represent the complexity of actual investor screening tasks.

What would settle it

Running the Bioptic Agent and the competing systems on a fresh set of live screening queries drawn from current investor deals, then measuring recall against verified closed transactions or expert-confirmed asset lists, would settle whether the reported performance advantage holds.

Figures

Figures reproduced from arXiv: 2602.15019 by Alisa Vinogradova, Andrey Doronichev, Dmitrii Radkevich, Dmitry Kobyzev, Ilya Yasny, Kong Nguyen, Luba Greenwood, Roman Doronin, Shoman Kasbekar, Vlad Vinogradov.

Figure 1
Figure 1. Figure 1: Quality–time tradeoff for asset scouting. y-axis: F1-score (harmonic mean of precision and recall; higher is better). x-axis: wall-clock time (log scale; larger indicates longer compute). DR here stands for Deep Research; lang￾free stands for no language parallelism. share of strategic value lies. In this high-stakes environment, missing a single qualifying program/asset can mean losing a top partnering or… view at source ↗
Figure 2
Figure 2. Figure 2: Completeness Benchmark construction pipeline Top: Assets Mining the Regional News Miner Agent surfaces regional-stage drug assets from non-English sources; the Attributes Enrichment Agent validates and structures each asset; the Google Search Agent prioritizes under-the-radar assets via an English vs. origin-language discoverability check. Bot￾tom: Query Generation real Investor Queries are clustered by in… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of asset origin language and ther￾apeutic areas in the benchmark test split. Left: pro￾portion of assets by origin language. Right: proportion of therapeutic-area labels assigned across assets. over covered in US media assets. The agent is enforced to emit a hierarchically structured attribute record in which every atomic claim is paired with explicit provenance (a list of source URL and verba… view at source ↗
Figure 4
Figure 4. Figure 4: Benchmark query composition. Left: distri￾bution of queries across difficulty tiers (Broad, Tight, Complex/multi-hop). Right: prevalence of high-level con￾straint categories across queries (multi-label). A single query can trigger multiple categories; therefore, a category is counted once per query if the query contains that type of constraint. the GT asset is one of the correct answers. We forbid GT ident… view at source ↗
Figure 7
Figure 7. Figure 7: Part of the Coach Agent generated tree of directives for the query "Find RNA-targeting therapeutics for chronic [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total; a growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high-recall discovery across heterogeneous, multilingual sources without hallucinations. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real deal complexity, we collected screening queries from expert investors, BD, and VC professionals, and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a benchmarking methodology for drug asset scouting in biopharma, constructed via a multilingual multi-agent pipeline that generates complex queries from expert-derived priors paired with ground-truth assets largely outside U.S.-centric sources. It presents a tuned tree-based self-learning Bioptic Agent that achieves 79.7% F1 on this benchmark, outperforming Gemini 3.1 Deep Think (59.2%), Gemini 3.1 Pro Deep Research (58.6%), Claude Opus 4.6 (56.2%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%), with performance scaling with additional compute.

Significance. If the benchmark construction and LLM-as-judge evaluation are shown to be robust and externally validated, the work could meaningfully advance specialized AI agents for high-recall, low-hallucination scouting across multilingual global sources, with direct relevance to investment risk reduction and business development in biopharma. The scaling observation aligns with broader AI trends and is a positive element.

major comments (2)
  1. [Benchmark construction] Benchmark construction paragraph: the description of the multilingual multi-agent pipeline for generating queries from expert priors and establishing ground-truth assets supplies no specifics on pipeline steps, query complexity controls, or external validation against real investor screening tasks, leaving open whether the benchmark reproduces the recall/hallucination profile of actual BD workflows.
  2. [Evaluation] Evaluation section: the LLM-as-judge is described as calibrated to expert opinions, yet no inter-rater agreement metrics (Cohen’s kappa or equivalent), calibration sample size, or hold-out human validation results are reported; this directly undermines confidence in the 79.7% F1 claim and the reported gaps versus baselines, as judge bias or benchmark mismatch cannot be ruled out.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'tuned, tree-based self-learning Bioptic Agent' is introduced without a forward reference to the architecture section; adding a brief parenthetical on key components would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and will revise the paper to incorporate additional details where the current description is insufficient.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction paragraph: the description of the multilingual multi-agent pipeline for generating queries from expert priors and establishing ground-truth assets supplies no specifics on pipeline steps, query complexity controls, or external validation against real investor screening tasks, leaving open whether the benchmark reproduces the recall/hallucination profile of actual BD workflows.

    Authors: We agree that the manuscript's description of the benchmark construction is high-level and lacks the requested specifics on pipeline steps, query complexity controls, and external validation. In the revised version, we will expand the relevant section to detail the multilingual multi-agent pipeline steps (including agent roles, conditional generation logic from expert priors, and ground-truth asset sourcing), introduce explicit controls for query complexity (such as reasoning depth, multilingual coverage, and diversity metrics), and report results from external validation comparing benchmark queries to anonymized real-world investor screening tasks provided by our expert collaborators. This will strengthen the demonstration that the benchmark aligns with actual BD workflow profiles. revision: yes

  2. Referee: [Evaluation] Evaluation section: the LLM-as-judge is described as calibrated to expert opinions, yet no inter-rater agreement metrics (Cohen’s kappa or equivalent), calibration sample size, or hold-out human validation results are reported; this directly undermines confidence in the 79.7% F1 claim and the reported gaps versus baselines, as judge bias or benchmark mismatch cannot be ruled out.

    Authors: We acknowledge that the manuscript does not report inter-rater agreement metrics, calibration sample size, or hold-out human validation results for the LLM-as-judge. In the revision, we will add a new subsection detailing the evaluation protocol, including the calibration sample size (number of expert-annotated examples), Cohen’s kappa (or equivalent) scores measuring agreement between the LLM judge and human experts, and performance on a hold-out validation set where independent human experts scored a subset of outputs. These additions will provide quantitative support for the judge's reliability and help rule out bias concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and evaluation are externally constructed from expert priors without reduction to fitted parameters or self-referential definitions

full rationale

The paper's central claim is an empirical F1 score on a benchmark built from expert-collected screening queries used as priors for conditional generation, followed by LLM-as-judge grading calibrated to expert opinions. This chain does not reduce by construction to quantities defined inside the agent's own outputs or fitted parameters; the benchmark queries and ground-truth assets are presented as independent of the Bioptic Agent's internal mechanisms. No equations, self-citations, uniqueness theorems, or ansatzes are invoked to force the performance result. The evaluation therefore remains a standard (if imperfect) external benchmark comparison rather than a self-definitional or fitted-input prediction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the reliability of the LLM-as-judge calibration and the representativeness of the expert-derived benchmark queries; the agent itself is presented as a tuned system whose internal parameters are not enumerated.

free parameters (1)
  • tuning parameters of Bioptic Agent
    The agent is described as tuned and self-learning, implying parameters adjusted to achieve the reported F1 score.
axioms (1)
  • domain assumption LLM-as-judge evaluation calibrated to expert opinions accurately reflects true scouting quality
    Used to grade all model outputs on the benchmark.
invented entities (1)
  • Bioptic Agent no independent evidence
    purpose: Complete non-hallucinated scouting of drug assets across multilingual sources
    New specialized agent introduced and benchmarked in the paper.

pith-pipeline@v0.9.0 · 5689 in / 1476 out tokens · 24841 ms · 2026-05-15T21:28:06.518818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    The R&D productivity challenge: transforming the pharmaceutical ecosystem.Drug Discovery Today,

    Alexander Schuhmacher, Oliver Gassmann, Sebastian Kwisda, Malte Kremer, Markus Hinder, and Dominik Hartl. The R&D productivity challenge: transforming the pharmaceutical ecosystem.Drug Discovery Today,

  2. [2]

    External innovation: Bio- pharma dealmaking to boost r&d productivity, 2025

    McKinsey & Company. External innovation: Bio- pharma dealmaking to boost r&d productivity, 2025. URL https://www.mckinsey.com/industries/life- sciences/our-insights/external-innovation-biopharma- dealmaking-to-boost-r-and-d-productivity

  3. [3]

    World intellectual property indicators 2025

    World Intellectual Property Organization. World intellectual property indicators 2025. https://www.wipo.int/edocs/pubdocs/en/wipo- pub-941-17-2025-en-world-intellectual-property- indicators-2025.pdf, 2025

  4. [4]

    Pfizer CEO Says U.S

    Reuters. Pfizer CEO Says U.S. Pharma In- dustry Needs to Collaborate with China. https://www.reuters.com/business/healthcare- pharmaceuticals/pfizer-ceo-says-us-pharma-industry- needs-collaborate-with-china-2025-10-15/, October 2025

  5. [5]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McK- inney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. BrowseComp: A simple yet challenging benchmark for browsing agents, 2025. URL https: //arxiv.org/abs/2504.12516

  6. [6]

    Hendryx, Brad Kenstler, and Bing Liu

    Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tah- seen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, and Bing Liu. Re- searchRubrics: A benchmark of prompts and rubrics for evaluating deep research agents, 2025. URL https: //arx...

  7. [7]

    DRACO: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026

    Joey Zhong, Hao Zhang, Clare Southern, Jeremy Yang, Thomas Wang, Kate Jung, Shu Zhang, Denis Yarats, Johnny Ho, and Jerry Ma. DRACO: a cross-domain benchmark for deep research accuracy, completeness, and objectivity, 2026. URL https://arxiv.org/abs/2602. 11685

  8. [8]

    DeepSearchQA: Bridg- ing the Comprehensiveness Gap for Deep Research Agents, 2026

    Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, and Dipanjan Das. DeepSearchQA: Bridg- ing the Comprehensiveness Gap for Deep Research Agents, 2026. URL https://arxiv.org/abs/2601.20975

  9. [9]

    WideSearch: Benchmarking Agentic Broad Info-Seeking, 2025

    Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xi- ang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. WideSearch: Benchmarking Agentic Broad Info-Seeking, 2025

  10. [10]

    An Illu- sion of Progress? Assessing the Current State of Web Agents, 2025

    Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, and Yu Su. An Illu- sion of Progress? Assessing the Current State of Web Agents, 2025. URL https://arxiv.org/abs/2504.01382

  11. [11]

    LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence,

    Alisa Vinogradova, Vlad Vinogradov, Dmitrii Rad- kevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Roman Doronin, and Andrey Doronichev. LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence,

  12. [12]

    LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

    URL https://arxiv.org/abs/2508.16571. Find RNA-targeting therapeutics for chronic hepatitis B. HBV oligonucleotide programs (siRNA/ASO/ddRNAi/shRNA) in undercover APAC markets surfaced via national registries and local pipeline disclosures SiRNA assets (GalNAc or LNP) surfaced via national registries and local pipeline disclosures. GalNAc-delivered siRNA ...