DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Daohan Su; Guoren Wang; Hongchao Qin; Jia Li; Qiangqiang Dai; Rong-Hua Li; Ru Zhang; Sicheng Liu; Xunkai Li; Yaxin Deng

arxiv: 2602.01839 · v2 · submitted 2026-02-02 · 💻 cs.LG · cs.AI· q-bio.GN

DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis

Ru Zhang , Xunkai Li , Yaxin Deng , Sicheng Liu , Daohan Su , Qiangqiang Dai , Hongchao Qin , Rong-Hua Li

show 2 more authors

Guoren Wang Jia Li

This is my paper

Pith reviewed 2026-05-16 08:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.GN

keywords single-cell transcriptomicsdata-centric AIcell ontologyphylogenetic structuregraph constructionzero-shot evaluationcross-species alignment

0 comments

The pith

DOGMA integrates Cell Ontology and phylogenetic structure into cell-graph construction for robust zero-shot single-cell analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current sequence-based and heuristic methods for single-cell transcriptomics either ignore intercellular relationships or lack biological grounding, leading to suboptimal data representations for machine learning. DOGMA addresses this by providing a prior-guided pipeline that statistically aligns cells with Cell Ontology and phylogenetic trees to build cell graphs, then uses Gene Ontology to add functional semantics at the feature level. If successful, this produces cell representations that support stronger zero-shot cell-type prediction across species and organs while lowering memory and time costs in downstream tasks.

Core claim

DOGMA is a data-centric framework that reshapes raw single-cell sequencing data through multi-level biological priors: statistical alignment with Cell Ontology and phylogenetic structure for biologically grounded cell-graph construction, plus Gene Ontology for semantic feature enhancement, yielding robust cross-species alignment and improved ML utility.

What carries the argument

The prior-guided graph construction pipeline that integrates statistical alignment with Cell Ontology and phylogenetic structure to produce cell graphs from raw transcriptomics data.

If this is right

Strong robustness in strict zero-shot cell-type evaluation on complex multi-species and multi-organ benchmarks.
Improved sample efficiency when training data is limited.
Substantially lower GPU memory and inference time during downstream model evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prior-integration steps could be tested on other single-cell data types such as proteomics or spatial transcriptomics.
Performance may improve further if the underlying ontologies are periodically refreshed with new biological annotations.
The approach suggests a general pattern for embedding hierarchical priors into graph-based models for other biological sequence domains.

Load-bearing premise

That aligning sequencing data with Cell Ontology and phylogenetic structure yields biologically accurate cell graphs that boost downstream ML performance without injecting ontology biases.

What would settle it

A new multi-species, multi-organ dataset where a purely data-driven heuristic graph method achieves higher zero-shot cell-type accuracy than DOGMA.

read the original abstract

Recently, data-centric AI methodology has been a dominant paradigm in single-cell transcriptomics analysis, which treats data representation rather than model complexity as the fundamental bottleneck. In the review of current studies, earlier sequence methods treat cells as independent entities and adapt prevalent ML models to analyze their directly inherited sequence data. Despite their simplicity and intuition, these methods overlook the latent intercellular relationships driven by the functional mechanisms of biological systems and the inherent quality issues of the raw sequencing data. Therefore, a series of structured methods has emerged. Although they employ various heuristic rules to capture intricate intercellular relationships and enhance the raw sequencing data, these methods often neglect biological prior knowledge. This omission incurs substantial overhead and yields suboptimal graph representations, hindering the utility of ML models. To address these issues, we propose DOGMA, a data-centric framework designed for the structural reshaping and semantic enhancement of raw data through multi-level biological prior knowledge. Transcending reliance on purely data-driven heuristics, DOGMA provides a prior-guided graph construction pipeline that integrates statistical alignment with Cell Ontology and phylogenetic structure for biologically grounded cell-graph construction and robust cross-species alignment. Furthermore, Gene Ontology is utilized to bridge the feature-level semantic gap by incorporating functional priors. In complex multi-species and multi-organ benchmarks, DOGMA exhibits strong robustness in strict zero-shot cell-type evaluation and sample efficiency while using substantially lower GPU memory and inference time in downstream evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOGMA pulls Cell Ontology and phylogenetic structure into cell-graph construction for single-cell data but the robustness and efficiency claims need actual numbers and ablations to be convincing.

read the letter

DOGMA builds a pipeline that uses Cell Ontology to shape cell graphs, adds phylogenetic alignment for cross-species work, and layers in Gene Ontology to enrich features. The goal is to move past heuristic rules that ignore biological priors and instead ground the data representation in established knowledge. That combination is the main new piece here, even if it builds on earlier ontology-guided work in the field.

Referee Report

3 major / 2 minor

Summary. The paper proposes DOGMA, a data-centric framework for single-cell transcriptomics analysis. It integrates statistical alignment with Cell Ontology and phylogenetic structure for biologically grounded cell-graph construction, plus Gene Ontology for feature-level semantic enhancement. The central claim is that this yields strong robustness in strict zero-shot cell-type classification, improved sample efficiency, and substantially lower GPU memory and inference time on complex multi-species and multi-organ benchmarks, outperforming purely data-driven or heuristic baselines.

Significance. If the quantitative claims hold after proper validation, the work would advance data-centric single-cell ML by demonstrating how external biological ontologies can be woven into graph construction without introducing high circularity. This could improve cross-species generalization and efficiency in settings where raw sequencing data is noisy or limited, provided the structural priors add genuine signal rather than database artifacts.

major comments (3)

[Abstract and §3] Abstract and §3 (method): the headline claims of 'strong robustness' and 'substantially lower GPU memory and inference time' are stated without any quantitative metrics, baseline tables, ablation results, or error analysis, preventing verification that the reported gains are attributable to the ontology integration rather than other implementation choices.
[§4] §4 (experiments): no ablation isolates the Cell Ontology + phylogenetic alignment component (e.g., statistical alignment alone versus alignment plus ontology) on the exact zero-shot multi-species splits; without this control, the robustness and sample-efficiency numbers cannot be causally linked to the structural priors.
[§4 and §5] §4 and §5: the manuscript does not test or discuss performance degradation on target species with sparse phylogenetic or Cell Ontology coverage, leaving open the risk that reported gains reflect annotation frequency biases rather than true biological homology.

minor comments (2)

[§3] Notation for the multi-level prior integration pipeline is introduced without a clear diagram or pseudocode, making the exact flow from statistical alignment to final graph construction difficult to follow.
[Abstract] The abstract would be strengthened by naming the specific benchmarks and reporting at least the key delta values (e.g., accuracy lift and memory reduction percentages) rather than qualitative descriptors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the empirical support and transparency of our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the headline claims of 'strong robustness' and 'substantially lower GPU memory and inference time' are stated without any quantitative metrics, baseline tables, ablation results, or error analysis, preventing verification that the reported gains are attributable to the ontology integration rather than other implementation choices.

Authors: We agree that the abstract and §3 would be clearer with explicit quantitative anchors. In the revised manuscript we will insert the key metrics (zero-shot accuracy deltas, GPU memory reduction percentages, and inference-time speedups) directly into the abstract and add a concise summary table plus error bars in §3 that reference the full results in §4. This will make the attribution to the ontology-driven graph construction explicit. revision: yes
Referee: [§4] §4 (experiments): no ablation isolates the Cell Ontology + phylogenetic alignment component (e.g., statistical alignment alone versus alignment plus ontology) on the exact zero-shot multi-species splits; without this control, the robustness and sample-efficiency numbers cannot be causally linked to the structural priors.

Authors: We accept that an explicit ablation isolating the ontology-augmented alignment is necessary. We will add this control experiment to the revised §4, reporting performance on the identical zero-shot multi-species splits for (i) statistical alignment alone and (ii) statistical alignment plus Cell Ontology and phylogenetic structure. The new results will be presented alongside the existing baselines to establish the incremental contribution of the structural priors. revision: yes
Referee: [§4 and §5] §4 and §5: the manuscript does not test or discuss performance degradation on target species with sparse phylogenetic or Cell Ontology coverage, leaving open the risk that reported gains reflect annotation frequency biases rather than true biological homology.

Authors: We acknowledge this limitation in coverage. In the revision we will add a dedicated paragraph in §5 that discusses the risk of annotation-frequency bias and reports a post-hoc analysis of performance stratified by phylogenetic and Cell Ontology coverage depth on the existing benchmark species. If the available data prove insufficient for a conclusive stratification, we will note this as a boundary condition and outline a targeted follow-up experiment. revision: partial

Circularity Check

0 steps flagged

Low circularity: framework grounds graph construction in external independent ontologies rather than self-derived parameters

full rationale

The derivation chain begins with raw sequencing data and applies statistical alignment followed by integration of Cell Ontology, phylogenetic structure, and Gene Ontology. These resources are independently curated external databases, not constructed from the target single-cell datasets or model outputs. No equations reduce the zero-shot robustness or sample-efficiency claims to a fitted parameter renamed as prediction, nor does any step rely on a self-citation chain whose validity depends on the present paper. The central pipeline therefore remains self-contained against external benchmarks, producing only a minor self-citation allowance that does not bear the load of the reported performance gains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework depends on the assumption that external biological ontologies provide accurate and sufficient priors for reshaping sequencing data; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Cell Ontology and phylogenetic structures accurately capture intercellular functional relationships for graph construction
Invoked in the prior-guided graph construction pipeline described in the abstract.
domain assumption Gene Ontology bridges feature-level semantic gaps in raw sequencing data
Used for semantic enhancement of gene features.

pith-pipeline@v0.9.0 · 5585 in / 1279 out tokens · 38450 ms · 2026-05-16T08:44:29.551292+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DOGMA redefines graph construction by integrating Statistical Anchors with Cell Ontology and Phylogenetic Trees... Gene Ontology is utilized to bridge the feature-level semantic gap
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Masked Top-K strategy... Semantic Mask Mi that activates only for biologically plausible candidates (cells with label distance dCO(Li, Lj)≤1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.