pith. sign in

arxiv: 1907.04358 · v1 · pith:73G3QUDYnew · submitted 2019-07-09 · 💻 cs.LO · q-bio.PE· stat.ML

Making Study Populations Visible through Knowledge Graphs

Pith reviewed 2026-05-24 23:56 UTC · model grok-4.3

classification 💻 cs.LO q-bio.PEstat.ML
keywords study cohortsknowledge graphsontologyTable 1clinical practice guidelinespopulation analysissemantic webRDF
0
0 comments X

The pith

A Study Cohort Ontology represents Table 1 population data as knowledge graphs so practitioners can compare study cohorts to their own patients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a Study Cohort Ontology to standardize the descriptions of research study populations that appear in the first table of published papers. These standardized representations are stored as RDF knowledge graphs using property associations from an existing integrated ontology. The system then supports queries and visualizations that let users assess how closely a given clinical population matches the study group behind a treatment recommendation. The central goal is to make it easier to judge whether trial results apply to the patients actually being treated.

Core claim

By building the Study Cohort Ontology on top of SIO property associations, the authors encode the three main elements of Table 1s—collections of study subjects, subject characteristics, and statistical measures—directly in RDF. This declarative modeling turns opaque population descriptions into queryable graph data that supports population analysis scenarios and cohort similarity visualizations.

What carries the argument

The Study Cohort Ontology (SCO), which encodes vocabulary, subject collections, characteristics, and statistical measures from Table 1s using SIO property associations in RDF knowledge graphs.

If this is right

  • Practitioners can run declarative queries to compare their patient population against study cohorts.
  • Cohort similarity visualizations become possible from the standardized graph data.
  • Clinically relevant inferences about study population applicability can be derived without manual extraction of Table 1 details.
  • Treatment guideline users gain a structured way to evaluate generalizability of trial results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph structure could be linked to electronic health record data for automated matching at the point of care.
  • Extending the ontology to capture inclusion/exclusion criteria beyond basic Table 1 statistics would increase its utility.
  • Publication of SCO-annotated Table 1s alongside papers would create a reusable public resource for population comparison.

Load-bearing premise

The vocabulary and statistical measures found in ordinary Table 1s can be fully and losslessly captured by the SCO and SIO associations without external data sources or further extensions.

What would settle it

A real Table 1 whose reported terms or statistical measures cannot be represented in the SCO without loss of information or the need for additional ontology terms would show the encoding is incomplete.

Figures

Figures reproduced from arXiv: 1907.04358 by Amar K. Das, Deborah L. McGuinness, James P. McCusker, Kristin P. Bennett, Miao Qi, Nkcheniyere N. Agu, Oshani Seneviratne, Shruthi Chari.

Figure 1
Figure 1. Figure 1: An overview of the cohort analytics workflow which 1) ingests terms from population descriptions of research studies, 2) standardizes their representations via KR techniques and 3) supports study applicability applications. The numbering is in-line with the figure and is indicative of data flow. 3 declarative manner: in a clear, unambiguous, and computer understandable manner [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 2
Figure 2. Figure 2: A) A high-level overview of SCO that captures the vocabulary and associations needed to model the descriptions of study populations. B) We depict associations that cannot be realized without actual instantiation of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An annotated example of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A snapshot of our faceted browser tool that provides medical practitioners with the ability to customize cohort analyses. Currently, the feature facets are limited to the patient features from NHANES, that overlap with, [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Treatment recommendations within Clinical Practice Guidelines (CPGs) are largely based on findings from clinical trials and case studies, referred to here as research studies, that are often based on highly selective clinical populations, referred to here as study cohorts. When medical practitioners apply CPG recommendations, they need to understand how well their patient population matches the characteristics of those in the study cohort, and thus are confronted with the challenges of locating the study cohort information and making an analytic comparison. To address these challenges, we develop an ontology-enabled prototype system, which exposes the population descriptions in research studies in a declarative manner, with the ultimate goal of allowing medical practitioners to better understand the applicability and generalizability of treatment recommendations. We build a Study Cohort Ontology (SCO) to encode the vocabulary of study population descriptions, that are often reported in the first table in the published work, thus they are often referred to as Table 1. We leverage the well-used Semanticscience Integrated Ontology (SIO) for defining property associations between classes. Further, we model the key components of Table 1s, i.e., collections of study subjects, subject characteristics, and statistical measures in RDF knowledge graphs. We design scenarios for medical practitioners to perform population analysis, and generate cohort similarity visualizations to determine the applicability of a study population to the clinical population of interest. Our semantic approach to make study populations visible, by standardized representations of Table 1s, allows users to quickly derive clinically relevant inferences about study populations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the development of a Study Cohort Ontology (SCO) that encodes vocabulary from Table 1s in clinical research studies. It leverages the Semanticscience Integrated Ontology (SIO) to define property associations, models collections of study subjects, characteristics, and statistical measures as RDF knowledge graphs, and outlines prototype scenarios for population analysis and cohort similarity visualizations. The goal is to enable medical practitioners to assess how well their clinical populations match study cohorts, thereby supporting inferences about the applicability and generalizability of treatment recommendations in clinical practice guidelines.

Significance. If the SCO provides a complete representation and the visualizations support the claimed inferences, the work could address a real barrier in translating clinical trial results to practice by making cohort descriptions queryable and comparable. Credit is due for the constructive use of established standards (RDF and SIO) rather than ad-hoc definitions, which supports interoperability. The absence of any evaluation metrics, coverage analysis, or user studies means the practical significance remains prospective rather than demonstrated.

major comments (2)
  1. [Abstract] Abstract (central claim paragraph): The statement that the semantic approach 'allows users to quickly derive clinically relevant inferences about study populations' is not supported by any evaluation of the prototype scenarios, such as metrics on inference correctness, coverage of real Table 1s, or comparison to existing cohort-matching tools. This directly undermines the load-bearing utility claim.
  2. [SCO and SIO modeling description] Description of SCO construction and SIO usage: The modeling assumes that typical Table 1 content (subject collections, characteristics, means, SDs, percentages, p-values) can be represented losslessly via SCO classes plus SIO property associations (e.g., has measurement value) without external vocabularies or unstated extensions; no completeness argument, example triples, or coverage table is provided to substantiate this.
minor comments (2)
  1. The manuscript would benefit from including at least one concrete RDF example or ontology diagram illustrating how a sample Table 1 row (e.g., age mean and SD) is encoded.
  2. Clarify whether the prototype system is fully implemented or remains at the scenario-design stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important points about the scope of claims and the need for additional substantiation in a prototype paper. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (central claim paragraph): The statement that the semantic approach 'allows users to quickly derive clinically relevant inferences about study populations' is not supported by any evaluation of the prototype scenarios, such as metrics on inference correctness, coverage of real Table 1s, or comparison to existing cohort-matching tools. This directly undermines the load-bearing utility claim.

    Authors: We agree that the abstract phrasing implies demonstrated capability rather than a prototype illustration. The manuscript focuses on ontology development and example scenarios to show how inferences could be derived, without performing quantitative evaluations or comparisons. We will revise the abstract to replace 'allows users to quickly derive' with 'is intended to enable users to derive' and add a sentence clarifying that the scenarios are illustrative. This is a partial revision that clarifies scope without adding new evaluation work. revision: partial

  2. Referee: [SCO and SIO modeling description] Description of SCO construction and SIO usage: The modeling assumes that typical Table 1 content (subject collections, characteristics, means, SDs, percentages, p-values) can be represented losslessly via SCO classes plus SIO property associations (e.g., has measurement value) without external vocabularies or unstated extensions; no completeness argument, example triples, or coverage table is provided to substantiate this.

    Authors: The modeling uses SCO classes combined with SIO properties to represent the core Table 1 elements described in the paper. We acknowledge the absence of explicit example triples or a coverage table. In revision we will add a new subsection with concrete RDF triples for representative Table 1 content (means, SDs, percentages, p-values) and a short table listing the covered statistical measures. A formal completeness argument across all possible Table 1 variations is outside the scope of this prototype-focused work, as it would require a separate corpus study; the revision will instead note the intended coverage based on the examples used. revision: partial

Circularity Check

0 steps flagged

No circularity: constructive ontology work on external standards

full rationale

The paper constructs the Study Cohort Ontology (SCO) to represent Table 1 content and leverages the independent, pre-existing Semanticscience Integrated Ontology (SIO) for property associations in RDF. No equations, fitted parameters, predictions, or derivations appear. The central claim is an engineering demonstration of standardized representations using external vocabularies (RDF, SIO); it does not reduce to self-definition, self-citation chains, or renaming of its own inputs. The work is self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Relies on standard semantic web technologies and introduces one new ontology without external validation data in the abstract.

axioms (2)
  • domain assumption SIO provides sufficient property associations for linking study subject classes and characteristics
    Abstract states leverage of SIO for defining property associations between classes.
  • domain assumption Table 1 descriptions contain the key components needed for population matching (subjects, characteristics, statistical measures)
    Abstract models these as the core of the knowledge graphs.
invented entities (1)
  • Study Cohort Ontology (SCO) no independent evidence
    purpose: Encode vocabulary of study population descriptions from Table 1s
    New ontology developed specifically for this purpose

pith-pipeline@v0.9.0 · 5836 in / 1277 out tokens · 20514 ms · 2026-05-24T23:56:26.819348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We build a Study Cohort Ontology (SCO) to encode the vocabulary of study population descriptions... leverage the well-used Semanticscience Integrated Ontology (SIO) for defining property associations... model the key components of Table 1s, i.e., collections of study subjects, subject characteristics, and statistical measures in RDF knowledge graphs.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We reuse classes and properties from existing biomedical ontologies... only define them ourselves when they do not exist... tested our ontology with the Hermit reasoner.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    pharmacologic approaches to glycemic treatment: Standards of medical care in diabetes - 2018

    American Diabetes Association (ADA) et al.: 8. pharmacologic approaches to glycemic treatment: Standards of medical care in diabetes - 2018. Diabetes Care 41(Supplement 1), S73–S85 (2018)

  2. [2]

    cardiovascular disease and risk management: standards of medical care in diabetes - 2018

    American Diabetes Association (ADA) et al.: 9. cardiovascular disease and risk management: standards of medical care in diabetes - 2018. Diabetes Care 41(Supplement 1), S86–S104 (2018)

  3. [3]

    In: Proc

    Auer, S., Kovtun, V., Prinz, M., Kasprzik, A., Stocker, M., Vidal, M.E.: Towards a knowledge graph for science. In: Proc. 8th Int. Conf. on Web Intell., Mining and Semantics. p. 1. ACM, Novi Sad, Serbia (2018)

  4. [4]

    OWL Reference Guide

    Bechhofer, S., Van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel-Schneider, P.F., Stein, L.A., et al.: OWL web ontology language reference. OWL Reference Guide. https://www.w3.org/TR/owl-ref/

  5. [5]

    Courtot, M., Gibson, F., Lister, A.L., Malone, J., Schober, D., Brinkman, R.R., Ruttenberg, A.: Mireot: The minimum information to reference an external ontol- ogy term. Appl. Ontology6(1), 23–33 (2011)

  6. [6]

    In: Proc

    Cyganiak, R., Field, S., Gregory, A., Halb, W., Tennison, J.: Semantic statistics: Bringing together sdmx and scovo. In: Proc. Linked Data on the Web Workshop (LDOW2010). Raleigh, North Carolina, USA (April 27, 2010 [Online] Available: http://ceur-wsorg/Vol-628/ Accessed on: Mar 26, 2019)

  7. [7]

    List of Desirable Ontology Best-Practices

    Garijo, D., Poveda-VillalÃşn, M.: A checklist for complete vo- cabulary metadata. List of Desirable Ontology Best-Practices . http://dgarijo.github.io/Widoco/doc/bestPractices/index-en.html 16 S. Chari et al

  8. [8]

    In: Clinical Practice Guidelines We Can Trust, pp

    Graham, R., et al.: Trustworthy clinical practice guidelines: Challenges and poten- tial. In: Clinical Practice Guidelines We Can Trust, pp. 53–75. National Academies Press (US), Washington D.C., USA (2011)

  9. [9]

    Hurtado, C.A., Poulovassilis, A., Wood, P.T.: Query relaxation in rdf. J. Data Semantics X 4900, 31–61 (2008)

  10. [10]

    New England J

    Investigators, O.: Telmisartan, ramipril, or both in patients at high risk for vascular events. New England J. Medicine358(15), 1547–1559 (2008)

  11. [11]

    Enigma Knowledge Capture and Discovery Project

    Jang, M., Jahanshad, N., Espiritu, R.: The cohort ontology. Enigma Knowledge Capture and Discovery Project. https://knowledgecaptureanddiscovery.github.io/ EnigmaOntology/release/cohort/1.0.0/index-en.html

  12. [12]

    Acta Informatica Medica16(4), 219 (2008)

    Masic, I., Miokovic, M., Muhamedagic, B.: Evidence based medicine–new ap- proaches and challenges. Acta Informatica Medica16(4), 219 (2008)

  13. [13]

    Introduction and need for principles

    National Institute of Health (NIH): Rigor and reproducibility. Introduction and need for principles. https://www.nih.gov/research-training/rigor-reproducibility

  14. [14]

    Semantically-aware population health risk analyses

    New, A., Rashid, S.M., Erickson, J.S., McGuinness, D.L., Bennett, K.P.: Semantically-aware population health risk analyses, presented as a poster at Ma- chine Learning for Health (ML4H) Workshop, NeurIPS, Montreal, Canada, 2018, [Online]. Available: https://arxiv.org/abs/1811.11190. Accessed on: Mar. 20, 2019

  15. [15]

    Rethinking Clinical Trials

    NIH Colloboratory: Table 1 project. Rethinking Clinical Trials. https://sites.duke.edu/rethinkingclinicaltrials/ehr-phenotyping/table-1-project/

  16. [16]

    Nucleic Acids Res.37(suppl_2), W170– W173 (2009)

    Noy, N.F., Shah, N.H., Whetzel, P.L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D.L., Storey, M.A., Chute, C.G., et al.: Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Res.37(suppl_2), W170– W173 (2009)

  17. [17]

    In: The Semantic Web, pp

    Patel, C., Cimino, J., Dolby, J., Fokoue, A., Kalyanpur, A., Kershenbaum, A., Ma, L., Schonberg, E., Srinivas, K.: Matching patient records to clinical trials using ontologies. In: The Semantic Web, pp. 816–829. Springer, Busan, Korea (2007)

  18. [18]

    A Strawman Draft

    Reinhardt, S.: Property reification vocabulary. A Strawman Draft. https://www.w3.org/wiki/PropertyReificationVocabulary

  19. [19]

    In: Proc

    Shankar, R.D., Martins, S.B., O’Connor, M.J., Parrish, D.B., Das, A.K.: Epoch: an ontological framework to support clinical trials management. In: Proc. Int. Workshop on Healthcare Inf. and Knowl. Manage. pp. 25–32. ACM, Arlington, Virginia (2006)

  20. [20]

    Sim, I., Tu, S.W., Carini, S., Lehmann, H.P., Pollock, B.H., Peleg, M., Wittkowski, K.M.: The ontology of clinical research (ocre): an informatics foundation for the science of clinical research. J. Biomed. Informatics52, 78–91 (2014)

  21. [21]

    Journal of biomedical informatics44(2), 239–250 (2011)

    Tu, S.W., Peleg, M., Carini, S., Bobak, M., Ross, J., Rubin, D., Sim, I.: A practi- cal method for transforming free-text eligibility criteria into computable criteria. Journal of biomedical informatics44(2), 239–250 (2011)

  22. [22]

    In: AMIA Annu

    Valdez, J., Kim, M., Rueschman, M., Socrates, V., Redline, S., Sahoo, S.S.: Prov- care semantic provenance knowledgebase: evaluating scientific reproducibility of research studies. In: AMIA Annu. Symp. Proc. vol. 2017, p. 1705. Amer. Med. Inform. Assoc., Washington D.C., USA (2017)

  23. [23]

    Xiang, Z., Courtot, M., Brinkman, R.R., Ruttenberg, A., He, Y.: Ontofox: web- based support for ontology reuse. BMC Res. Notes3(1), 175 (2010)

  24. [24]

    Younesi, E.: A Knowledge-based Integrative Modeling Approach for In-Silico Identification of Mechanistic Targets in Neurodegeneration with Focus on Alzheimer’s Disease. Ph.D. thesis, Department of Mathematics and Natural Sci- ences, Universitäts-und Landesbibliothek Bonn, Bonn, Germany (2014)