Constructing large scale biomedical knowledge bases from scratch with rapid annotation of interpretable patterns

Ashok Thillaisundaram; Julien Fauqueur; Theodosia Togia

arxiv: 1907.01417 · v2 · pith:VHSJENT3new · submitted 2019-07-02 · 💻 cs.CL

Constructing large scale biomedical knowledge bases from scratch with rapid annotation of interpretable patterns

Julien Fauqueur , Ashok Thillaisundaram , Theodosia Togia This is my paper

Pith reviewed 2026-05-25 11:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords biomedical knowledge base constructionrelation extractionpattern-based annotationzero-shot extractioninterpretable patternsdrug discoveryknowledge base completion

0 comments

The pith

Domain experts can label thousands of biomedical relationship pairs in minutes by marking interpretable patterns instead of individual facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a method that extracts facts for a chosen binary relationship from biomedical text without any training data or hand-crafted rules. It automatically finds and ranks the most salient patterns, then shows them to experts in readable form so that marking a pattern as compatible labels every candidate pair that uses that pattern. Even starting with zero seed examples, experts can thereby collect thousands of high-quality pairs within minutes. When a small number of pairs already exist, the system uses them either to improve the ranking of patterns or to generate additional weak labels automatically. The resulting sets are shown to be useful both for direct knowledge-base population and for downstream knowledge-base completion tasks.

Core claim

By discovering, ranking and presenting the most salient patterns to domain experts in an interpretable form, and allowing experts to mark patterns as compatible with the desired relationship type, the system enables indirect batch-annotation of candidate pairs, allowing discovery of thousands of high-quality pairs within minutes even with no seed data.

What carries the argument

Pattern discovery and ranking system that surfaces interpretable patterns for expert compatibility marking, which transfers the relationship label to all matching candidate pairs.

If this is right

Knowledge bases for a chosen relationship can be constructed when no relevant facts exist at the start.
A small number of existing pairs, even with a more general relationship, can be used to improve pattern ranking or to generate additional weakly labelled pairs automatically.
The resulting labelled sets support both direct knowledge-base population and downstream knowledge-base completion tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern-marking workflow could be applied in other scientific domains that lack seed data for relation extraction.
The batches of labelled pairs could serve as training data for supervised models that further scale extraction beyond the initial expert session.
Because patterns remain human-readable, experts retain direct control over which linguistic expressions receive the relationship label.

Load-bearing premise

Marking a pattern as compatible correctly transfers the relationship label to every candidate pair that expresses that pattern without adding substantial noise.

What would settle it

A random sample of the pairs produced from the marked patterns is manually checked for precision; if precision falls substantially below the level claimed for high-quality pairs, the central claim does not hold.

read the original abstract

Knowledge base construction is crucial for summarising, understanding and inferring relationships between biomedical entities. However, for many practical applications such as drug discovery, the scarcity of relevant facts (e.g. gene X is therapeutic target for disease Y) severely limits a domain expert's ability to create a usable knowledge base, either directly or by training a relation extraction model. In this paper, we present a simple and effective method of extracting new facts with a pre-specified binary relationship type from the biomedical literature, without requiring any training data or hand-crafted rules. Our system discovers, ranks and presents the most salient patterns to domain experts in an interpretable form. By marking patterns as compatible with the desired relationship type, experts indirectly batch-annotate candidate pairs whose relationship is expressed with such patterns in the literature. Even with a complete absence of seed data, experts are able to discover thousands of high-quality pairs with the desired relationship within minutes. When a small number of relevant pairs do exist - even when their relationship is more general (e.g. gene X is biologically associated with disease Y) than the relationship of interest - our system leverages them in order to i) learn a better ranking of the patterns to be annotated or ii) generate weakly labelled pairs in a fully automated manner. We evaluate our method both intrinsically and via a downstream knowledge base completion task, and show that it is an effective way of constructing knowledge bases when few or no relevant facts are already available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical pattern-discovery and expert batch-annotation workflow that lets users bootstrap specific biomedical relations from zero or very few seed pairs.

read the letter

The main thing here is a system that finds salient patterns in the literature, ranks them, and lets domain experts mark which ones match a target relation; that mark then labels every candidate pair expressing the pattern. The zero-seed case is the clearest contribution, along with the option to use a handful of more general pairs either to improve ranking or to create weak labels automatically. The workflow is simple and directly targets the scarcity problem in domains like drug discovery, where hand-labeling or training data are expensive to obtain. The claim that experts can surface thousands of pairs in minutes is the sort of concrete, usable result that matters for actual KB building. The paper also reports both intrinsic checks and a downstream KB completion task, which is the right way to evaluate this kind of annotation aid. The soft spot is exactly the one the stress-test flags: marking a pattern as compatible assumes the pattern is specific enough that the label transfers cleanly. Biomedical text frequently re-uses surface or syntactic patterns under negation, modality, or for different relations, and nothing in the abstract shows how the method detects or filters those cases. Without the actual precision numbers or error analysis it is hard to know how large the noise problem becomes at scale. This is for readers who build or maintain biomedical KBs and need low-data annotation tools rather than end-to-end neural models. It is worth sending to referees so the evaluation details and pattern-specificity handling can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a pattern-discovery system for pre-specified binary biomedical relations that ranks and surfaces interpretable patterns to domain experts; experts mark patterns as compatible, which batch-labels all candidate entity pairs expressing those patterns. With zero seed data the method claims experts can obtain thousands of high-quality pairs in minutes; with a few (even more general) seed pairs it can improve pattern ranking or produce weak labels automatically. Intrinsic and downstream KB-completion evaluations are asserted.

Significance. If the label-transfer step via pattern marking is shown to be high-precision, the approach supplies a practical, low-data, interpretable route to KB construction in data-scarce biomedical domains. The emphasis on expert pattern annotation rather than instance-by-instance labeling is a genuine strength when seed facts are absent.

major comments (2)

[Abstract] Abstract: the central claim that experts discover 'thousands of high-quality pairs' in minutes with zero seed data is asserted without any reported counts, precision figures, inter-annotator agreement, or comparison to baselines. The downstream KB-completion evaluation is likewise mentioned but supplies no metrics or dataset details, so the evidence for the claim cannot be assessed.
[Method (pattern compatibility step)] The method's correctness hinges on the assumption that marking a surface or syntactic pattern as compatible transfers the target relation label to every candidate pair expressing it. Biomedical text routinely realizes the same pattern under negation, modality, or for a different relation; without quantitative measurement of false-positive rate on a held-out sample of annotated pairs (or explicit handling of such contexts), the scale-up claim is at risk.

minor comments (2)

Define 'high-quality' explicitly and state how it is measured (manual review? overlap with existing KB? downstream task performance?).
Clarify the exact input representation of patterns (surface strings, dependency paths, etc.) and how ranking is performed when no seeds are available.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive view of the method's potential for low-data KB construction. We address the major comments below, proposing revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that experts discover 'thousands of high-quality pairs' in minutes with zero seed data is asserted without any reported counts, precision figures, inter-annotator agreement, or comparison to baselines. The downstream KB-completion evaluation is likewise mentioned but supplies no metrics or dataset details, so the evidence for the claim cannot be assessed.

Authors: We agree that the abstract lacks specific quantitative support for the claims. The body of the manuscript includes intrinsic evaluations reporting the number of high-quality pairs discovered (thousands in minutes), precision figures from expert annotations, and details on the downstream KB-completion task including datasets and performance metrics. We will revise the abstract to include key results such as the scale of pairs obtained and evaluation outcomes to make the evidence clear. revision: yes
Referee: [Method (pattern compatibility step)] The method's correctness hinges on the assumption that marking a surface or syntactic pattern as compatible transfers the target relation label to every candidate pair expressing it. Biomedical text routinely realizes the same pattern under negation, modality, or for a different relation; without quantitative measurement of false-positive rate on a held-out sample of annotated pairs (or explicit handling of such contexts), the scale-up claim is at risk.

Authors: We acknowledge this important point regarding potential false positives from negation, modality, or relation ambiguity. The system presents patterns in an interpretable form to allow experts to judge compatibility carefully, which in practice helps mitigate such issues. However, the manuscript does not include a dedicated quantitative measurement of the false-positive rate on held-out annotated pairs. We will add an explicit discussion of this limitation and its implications for the scale-up claim in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: practical annotation workflow with no fitted predictions or self-referential derivations

full rationale

The paper describes an interactive pattern-discovery and expert-marking system for batch-annotating biomedical relation pairs from literature. No equations, parameters, or statistical predictions are defined; the core process is human-in-the-loop pattern compatibility marking that directly produces the claimed pairs. The abstract and described method contain no self-citation load-bearing steps, no fitted inputs renamed as predictions, and no ansatz or uniqueness claims that reduce to prior author work. The central claim (thousands of pairs discoverable in minutes from zero seed data) rests on the empirical effectiveness of the annotation interface rather than any closed mathematical loop. This is a standard non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that text patterns can be automatically discovered, ranked by salience, and reliably interpreted by experts to reflect specific relationship types across instances.

axioms (1)

domain assumption Patterns in biomedical text can be discovered and ranked such that expert judgments on pattern compatibility transfer accurately to entity pairs.
Invoked as the mechanism enabling batch annotation without seed data.

pith-pipeline@v0.9.0 · 5801 in / 1236 out tokens · 49205 ms · 2026-05-25T11:05:12.763388+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our system discovers, ranks and presents the most salient patterns to domain experts in an interpretable form. By marking patterns as compatible with the desired relationship type, experts indirectly batch-annotate candidate pairs...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a number of methods for extracting patterns from a sentence... PATH: shortest path between the two entity mentions in the dependency graph...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.