pith. sign in

arxiv: 2606.28856 · v1 · pith:U6OACFYInew · submitted 2026-06-27 · 🧬 q-bio.OT · cs.AI

Building AI-Ready Data Systems for Space Life Sciences, Aerospace Medicine, and Deep Space Exploration

Pith reviewed 2026-06-30 08:41 UTC · model grok-4.3

classification 🧬 q-bio.OT cs.AI
keywords AI-ready dataFAIR principlesspace life sciencesdata infrastructuredeep space explorationaerospace medicinemachine-actionable datainternational governance
0
0 comments X

The pith

Spaceflight biological data requires a three-tier progression from FAIR to AI-ready to space-ready forms to become usable by AI systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that open access standards like FAIR enable human reuse but fall short for AI systems because of varying demands on data structure, metadata, and interfaces. It proposes advancing through three tiers: FAIR data, then AI-ready data that is machine-actionable, and finally space-ready data tailored for deep space conditions. This restructuring would close the AI access gap for heterogeneous spaceflight datasets in life sciences and aerospace medicine. A neutral international coordinating body is put forward to provide governance and ensure trustworthy, agent-accessible infrastructure. Without these steps, AI cannot reliably support biological research needed for deep space exploration.

Core claim

The authors state that a three-tier approach proceeding from FAIR to AI-ready to space-ready data, backed by a neutral international coordinating body, is required to systematically restructure heterogeneous spaceflight biological data into machine-actionable forms that close the AI access gap and enable trustworthy, agent-accessible infrastructure for deep space biological research.

What carries the argument

The three-tier data progression from FAIR to AI-ready to space-ready, which carries the argument by defining successive levels of machine accessibility and space-specific optimization for biological datasets.

If this is right

  • Existing infrastructures can be improved to support AI access to diverse spaceflight datasets.
  • AI systems gain the capacity to access and analyze heterogeneous scientific data from space missions.
  • A trustworthy, agent-accessible infrastructure becomes available for deep space biological research.
  • Systematic restructuring of data into machine-actionable forms is needed beyond current open-access practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Standardized data handling could emerge across international space agencies to support shared AI tools.
  • Integrated datasets might allow AI to model combined effects of space environment and biology in real time.
  • The approach could extend to other high-stakes domains like climate or medical research needing agent-accessible data.

Load-bearing premise

That existing open-access infrastructures cannot meet the distinct demands of growing AI approaches on data structure, metadata, and access interfaces.

What would settle it

Showing that multiple current FAIR-compliant space biology databases can be queried and analyzed accurately by diverse AI models with no added restructuring or new governance.

read the original abstract

While AI holds the potential to revolutionize space life sciences, realizing this promise is contingent upon the systematic restructuring of heterogeneous spaceflight biological data into machine-actionable, AI-ready forms. Even though open access principles support human reuse and scientific reproducibility, this does not necessarily enable AI systems to access and analyze such a diverse set of scientific datasets. In addition, the growing array of AI approaches places distinct demands on data structure, metadata, and access interfaces. In order to respond to such growing changes we propose a three-tier approach, proceeding from FAIR to AI-ready to space-ready data. We discuss existing infrastructures and how they can be improved to close the AI access gap. We conclude by proposing a neutral international coordinating body as the governance backbone for the trustworthy, agent-accessible space biology infrastructure that deep space biological research will require.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that realizing the potential of AI in space life sciences requires restructuring heterogeneous spaceflight biological data into machine-actionable forms via a three-tier progression from FAIR to AI-ready to space-ready data. It asserts that open access and FAIR principles do not suffice for AI systems due to their distinct demands on data structure, metadata, and access interfaces, and proposes a neutral international coordinating body to provide governance for trustworthy, agent-accessible infrastructure.

Significance. Should the proposed three-tier approach and coordinating body be adopted and validated, this work would be significant in bridging the gap between current data infrastructures and the needs of AI for deep space exploration. It identifies a potentially critical limitation in existing open-access systems for supporting advanced AI applications in biology and medicine, offering a conceptual framework that could guide future infrastructure development in the field.

major comments (2)
  1. [Abstract] Abstract: The assertion that open access 'does not necessarily enable AI systems to access and analyze such a diverse set of scientific datasets' and that 'the growing array of AI approaches places distinct demands on data structure, metadata, and access interfaces' is load-bearing for justifying the three-tier proposal and new coordinating body, yet the text supplies no concrete examples of named AI methods, specific spaceflight datasets, or documented failure modes of existing systems such as NASA GeneLab or ESA archives.
  2. [Abstract] Abstract (final paragraph): The proposal for a 'neutral international coordinating body' as governance backbone rests on the unshown premise that existing bodies cannot meet AI demands; without differentiation from or analysis of current international data-sharing mechanisms, the necessity of a new entity over incremental improvements to existing infrastructures is not secured.
minor comments (2)
  1. The terms 'AI-ready' and 'space-ready' are introduced without explicit definitions or criteria in the abstract, which would improve clarity if provided with examples in the main text.
  2. The discussion of how existing infrastructures can be improved would benefit from at least one illustrative case study or table comparing current metadata standards to AI-specific requirements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the justification for our proposed framework. We respond to each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that open access 'does not necessarily enable AI systems to access and analyze such a diverse set of scientific datasets' and that 'the growing array of AI approaches places distinct demands on data structure, metadata, and access interfaces' is load-bearing for justifying the three-tier proposal and new coordinating body, yet the text supplies no concrete examples of named AI methods, specific spaceflight datasets, or documented failure modes of existing systems such as NASA GeneLab or ESA archives.

    Authors: We agree that the abstract would be strengthened by concrete examples to support the load-bearing claims. The full manuscript discusses limitations of current infrastructures, but does not provide named AI methods or specific failure modes in the abstract. We will revise the abstract to include brief, specific examples (e.g., transformer-based models requiring standardized metadata schemas and challenges with heterogeneous omics data in GeneLab) while maintaining length constraints. revision: yes

  2. Referee: [Abstract] Abstract (final paragraph): The proposal for a 'neutral international coordinating body' as governance backbone rests on the unshown premise that existing bodies cannot meet AI demands; without differentiation from or analysis of current international data-sharing mechanisms, the necessity of a new entity over incremental improvements to existing infrastructures is not secured.

    Authors: The manuscript positions the new body as necessary for neutral, cross-agency coordination focused on agent-accessible infrastructure. We acknowledge that the abstract does not differentiate from existing mechanisms. In revision, we will expand the discussion section with a concise analysis of current bodies (e.g., NASA GeneLab governance, ESA data policies, and ISS international agreements) to highlight gaps in AI-specific requirements that incremental changes may not fully address. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level conceptual proposal without derivations or reductions to inputs

full rationale

The manuscript is a policy-style proposal advocating a three-tier data progression (FAIR to AI-ready to space-ready) plus a coordinating body. It contains no equations, no fitted parameters, no predictions derived from data, and no mathematical derivations. The central argument rests on the stated premise that existing open-access systems fall short for AI demands, but this premise is presented as an assumption rather than derived from or reduced to any self-referential construction within the paper. No self-citation chains, ansatzes, or renamings function as load-bearing steps that collapse the claim back onto its own inputs. The text is therefore self-contained as an advocacy document and exhibits none of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about data usability and AI requirements rather than new measurements or derivations; the coordinating body is an invented governance entity without independent evidence.

axioms (2)
  • domain assumption Open access principles support human reuse and scientific reproducibility but do not necessarily enable AI systems to access and analyze diverse scientific datasets
    Invoked in the abstract as the starting premise that motivates the need for AI-ready restructuring.
  • domain assumption The growing array of AI approaches places distinct demands on data structure, metadata, and access interfaces
    Used to justify why FAIR alone is insufficient and why a new tiered approach is required.
invented entities (1)
  • neutral international coordinating body no independent evidence
    purpose: governance backbone for the trustworthy, agent-accessible space biology infrastructure
    Proposed as the solution for coordination and trust but introduced without details on structure, authority, or evidence of feasibility.

pith-pipeline@v0.9.1-grok · 5772 in / 1524 out tokens · 36287 ms · 2026-06-30T08:41:49.834524+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Afshinnekoo, E. et al. Fundamental biological features of spaceflight: Advancing the field to enable deep-space exploration. Cell 183 , 1162–1184 (2020)

  2. [2]

    Gebre, S. G. et al. NASA open science data repository: open science for life in space. Nucleic Acids Res. 53 , D1697–D1710 (2025)

  3. [3]

    Otsuki, A. et al. ibSLS: A Biobank for Democratizing Access to Multi-Omics Data and Biospecimens from Spaceflight Research. bioRxiv (2025) doi:10.1101/2025.09.08.675003

  4. [4]

    Moon Base Igniting Progress

    NASA. Moon Base Igniting Progress. NP-2026-04-6806-HQ https://www.nasa.gov/wp-content/uploads/2026/04/moon-base-architecture-users-guide. pdf (2026)

  5. [5]

    Overbey, E. G. et al. The Space Omics and Medical Atlas (SOMA) and international astronaut biobank. Nature 632 , 1145–1154 (2024)

  6. [6]

    Into the deep

    Dolgin, E. Into the deep. Science 391 , 436–441 (2026)

  7. [7]

    Mason, C. E. et al. A second space age spanning omics, platforms and medicine across orbits. Nature 632 , 995–1008 (2024)

  8. [8]

    Sanders, L. M. et al. Biological research and self-driving labs in deep space supported by artificial intelligence. Nat. Mach. Intell. 5 , 208–219 (2023)

  9. [9]

    Scott, R. T. et al. Biomonitoring and precision health in deep space supported by artificial intelligence. Nat. Mach. Intell. 5 , 196–207 (2023)

  10. [10]

    Moon to Mars Architecture Definition Document

    NASA, Exploration Systems Development Mission Directorate. Moon to Mars Architecture Definition Document . https://www.nasa.gov/wp-content/uploads/2025/12/add-revision-c-20251211.pdf?emrc= 18 02371b

  11. [11]

    Ilangovan, H. et al. Harmonizing heterogeneous transcriptomics datasets for machine learning-based analysis to identify spaceflown murine liver-specific changes. NPJ Microgravity 10 , 61 (2024)

  12. [12]

    & Cline, M

    Casaletto, J., Bernier, A., McDougall, R. & Cline, M. S. Federated analysis for privacy-preserving data sharing: A technical and legal primer. Annu. Rev. Genomics Hum. Genet. 24 , 347–368 (2023)

  13. [13]

    Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187 , 6125–6151 (2024)

  14. [14]

    Theodoris, C. V. et al. Transfer learning enables predictions in network biology. Nature 618 , 616–624 (2023)

  15. [15]

    Cui, H. et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 21 , 1470–1480 (2024)

  16. [16]

    Hollmann, N. et al. Accurate predictions on small data with a tabular foundation model. Nature 637 , 319–326 (2025)

  17. [17]

    K., Hernandez, J

    Li, B., Saini, A. K., Hernandez, J. G. & Moore, J. H. Agentic AI and the rise of in silico team science in biomedical research. Nat. Biotechnol. (2026) doi:10.1038/s41587-026-03035-1

  18. [18]

    Soman, K. et al. Biomedical knowledge graph-optimized prompt generation for large language models. Bioinformatics 40 , btae560 (2024)

  19. [19]

    Caufield, H. et al. CurateGPT: A flexible language-model assisted biocuration tool. arXiv [cs.CL] (2024) doi:10.48550/arXiv.2411.00046

  20. [20]

    Nelson, C. A. et al. Knowledge network embedding of transcriptomic data from spaceflown mice uncovers signs and symptoms associated with terrestrial diseases. Life 11 , 42 (2021)

  21. [21]

    & Zitnik, M

    Chandak, P., Huang, K. & Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 10 , 67 (2023)

  22. [22]

    Huang, K. et al. A foundation model for clinician-centered drug repurposing. medRxiv 19 (2024) doi:10.1101/2023.03.19.23287458

  23. [23]

    Casaletto, J. et al. Bridging Earth and space: A flexible and resilient federated learning framework deployed on the International Space Station. bioRxiv (2025) doi:10.1101/2025.01.14.633017

  24. [24]

    Sheller, M. J. et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 10 , 12598 (2020)

  25. [25]

    Pati, S. et al. Federated learning enables big data for rare cancer boundary detection. Nat. Commun. 13 , 7346 (2022)

  26. [26]

    Pereira, T. D. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nat. Methods 19 , 486–495 (2022)

  27. [27]

    Bohnslav, J. P. et al. DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels. Elife 10 , (2021)

  28. [28]

    Ma, J. et al. Segment anything in medical images. Nat. Commun. 15 , 654 (2024)

  29. [29]

    Huang, A. S. et al. Artificial intelligence deep learning models to predict Spaceflight Associated Neuro-Ocular Syndrome. Am. J. Ophthalmol. 278 , 115–123 (2025)

  30. [30]

    Casaletto, J. A. et al. Analyzing the relationship between gene expression and phenotype in space-flown mice using a causal inference machine learning ensemble. Sci. Rep. 15 , 2363 (2025)

  31. [31]

    Gottesman, O. et al. Guidelines for reinforcement learning in healthcare. Nat. Med. 25 , 16–18 (2019)

  32. [32]

    & Bez, J

    Hiniduma, K., Byna, S. & Bez, J. L. Data readiness for AI: A 360-degree survey. ACM Comput. Surv. 57 , 1–39 (2025)

  33. [33]

    Rutter, L. et al. A New Era for Space Life Science: International Standards for Space Omics Processing. Patterns (N Y) 1 , 100148 (2020)

  34. [34]

    Manzano, A. et al. Enhancing European capabilities for application of multi-omics studies in biology and biomedicine space research. iScience 26 , 107289 (2023)

  35. [35]

    Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3 , 160018 (2016). 20

  36. [36]

    & Chafetz, H

    Verhulst, S., Zahuranec, A. & Chafetz, H. Moving Toward the FAIR-R principles: Advancing AI-Ready Data. (2025) doi:10.2139/ssrn.5164337

  37. [37]

    Hiniduma, K., Ryan, D., Byna, S., Bez, J. L. & Madduri, R. AIDRIN 2.0: A framework to assess data readiness for AI. arXiv [cs.CY] (2025) doi:10.48550/arXiv.2505.18213

  38. [38]

    Clark, T. et al. AI-readiness for biomedical data: Bridge2AI recommendations. bioRxivorg (2024) doi:10.1101/2024.10.23.619844

  39. [39]

    Rehm, H. L. et al. GA4GH: International policies and standards for data sharing across genomic research and healthcare. Cell Genom. 1 , 100029 (2021)

  40. [40]

    The White House https://www.whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission /

    Launching the Genesis Mission. The White House https://www.whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission /. (2025)

  41. [41]

    V., Gentemann, C

    Costes, S. V., Gentemann, C. L., Platts, S. H. & Carnell, L. A. Biological horizons: pioneering open science in the cosmos. Nat. Commun. 15 , 4780 (2024)

  42. [42]

    & Jacobsen, A

    Mons, B., Schultes, E., Liu, F. & Jacobsen, A. The FAIR principles: First generation implementation choices and challenges. Data Intell. 2 , 1–9 (2020)

  43. [43]

    Apache Parquet

    Apache Software Foundation. Apache Parquet. Parquet https://parquet.apache.org/ (2026)

  44. [44]

    Apache Arrow

    Apache Software Foundation. Apache Arrow. Apache Arrow https://arrow.apache.org/ (2026)

  45. [45]

    Sanders, L. M. et al. Batch effect correction methods for NASA GeneLab transcriptomic datasets. Frontiers in Astronomy and Space Sciences 10 , (2023)

  46. [46]

    Casaletto, J. A. et al. Machine learning ensemble investigates age in the transcriptomic response to spaceflight in Murine mammary tissue: Observational study. JMIRx Bio 4 , e73041–e73041 (2026)

  47. [47]

    Overbey, E. G. et al. Challenges and considerations for single-cell and spatially resolved transcriptomics sample collection during spaceflight. Cell Rep. Methods 2 , 100325 (2022)

  48. [48]

    & Rocca-Serra, P

    González-Beltrán, A., Maguire, E., Sansone, S.-A. & Rocca-Serra, P. linkedISA: 21 semantic representation of ISA-Tab experimental metadata. BMC Bioinformatics 15 Suppl 14 , S4 (2014)

  49. [49]

    Caufield, J. H. et al. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 40 , (2024)

  50. [50]

    Moxon, S. A. T. et al. LinkML: an open data modeling framework. GigaScience 15 , (2026)

  51. [51]

    (Github)

    AI4Curation . (Github). https://github.com/ai4curation

  52. [52]

    https://www.w3.org/TR/prov-o/

    PROV-O: The PROV Ontology. https://www.w3.org/TR/prov-o/

  53. [53]

    Wilkinson, S. R. et al. Applying the FAIR Principles to computational workflows. Sci. Data 12 , 328 (2025)

  54. [54]

    Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35 , 316–319 (2017)

  55. [55]

    Walsh, I. et al. DOME: recommendations for supervised machine learning validation in biology. Nat. Methods 18 , 1122–1127 (2021)

  56. [56]

    SPD-41a: Scientific Information Policy for the Science Mission Directorate

    NASA Science Mission Directorate. SPD-41a: Scientific Information Policy for the Science Mission Directorate . https://science.nasa.gov/wp-content/uploads/2023/08/smd-information-policy-spd-41a.p df (2022)

  57. [57]

    Putman, T. E. et al. The Monarch Initiative in 2024: an analytic platform integrating phenotypes, genes and diseases across species. Nucleic Acids Research 52 , D938–D949 (2024)

  58. [58]

    Morris, J. H. et al. The scalable precision medicine open knowledge engine (SPOKE): a massive knowledge graph of biomedical information. Bioinformatics 39 , (2023)

  59. [59]

    Şen, B. et al. CROssBARv2: A unified computational framework for heterogeneous biomedical data representation and LLM-driven exploration. bioRxiv (2026) doi: 10.64898/2026.04.12.718028

  60. [60]

    Morton, K. et al. ROBOKOP: an abstraction layer and user interface for knowledge 22 graphs to support question answering. Bioinformatics 35 , 5382–5384 (2019)

  61. [61]

    Bizon, C. et al. ROBOKOP KG and KGB: Integrated knowledge graphs from federated sources. J. Chem. Inf. Model. 59 , 4968–4973 (2019)

  62. [62]

    Lobentanzer, S. et al. Democratizing knowledge representation with BioCypher. Nat. Biotechnol. 41 , 1056–1059 (2023)

  63. [63]

    Kuehl, M. et al. BioContextAI is a community hub for agentic biomedical systems. Nat. Biotechnol. 43 , 1755–1757 (2025)

  64. [64]

    Makarov, V. A. et al. Natural language querying of biological databases with large language models. Drug Discov. Today 31 , 104654 (2026)

  65. [65]

    Edge, D. et al. From local to global: A graph RAG approach to query-focused summarization. arXiv [cs.CL] (2024) doi:10.48550/arXiv.2404.16130

  66. [66]

    The crisis of biomedical foundation models

    Wang, F. The crisis of biomedical foundation models. J. Biomed. Inform. 171 , 104917 (2025)

  67. [67]

    Huang, K. et al. Biomni: A general-purpose biomedical AI agent. bioRxivorg (2025) doi:10.1101/2025.05.30.656746

  68. [68]

    & Donoviel, D

    Wu, J., Strangman, G., Bokhari, R. & Donoviel, D. Human and Environmental Research Matrix for Exploration of Space (HERMES) Project. in (International Astronautical Federation, 2025)

  69. [69]

    Rutter, L. A. et al. Astronaut omics and the impact of space on the human body at scale. Nat. Commun. 15 , 4952 (2024)

  70. [70]

    Camera, A. et al. Aging and putative frailty biomarkers are altered by spaceflight. Sci. Rep. 14 , 13098 (2024)

  71. [71]

    D., Chen, Y

    Li, R., Romano, J. D., Chen, Y. & Moore, J. H. Centralized and federated models for the analysis of clinical data. Annu. Rev. Biomed. Data Sci. 7 , 179–199 (2024)

  72. [72]

    Casaletto, J. et al. Using federated learning to overcome data gravity in space. in 2022 ASGSR Annual Conference (2022)

  73. [73]

    A., Dunbar, B

    Bloomfield, S. A., Dunbar, B. J., Schmit, C. D., Sawyer, A. J. & Charles, J. B. Developing an international database on long-term health effects of spaceflight. Acta Astronaut. 23 198 , 347–353 (2022)

  74. [74]

    Shiba, D. et al. Development of new experimental platform ‘MARS’-Multiple Artificial-gravity Research System-to elucidate the impacts of micro/partial gravity on mice. Sci. Rep. 7 , 10837 (2017)

  75. [75]

    Rambla, J. et al. Beacon v2 and Beacon networks: A ‘lingua franca’ for federated data discovery in biomedical genomics, and beyond. Hum. Mutat. 43 , 791–799 (2022)

  76. [76]

    Akhtar, M. et al. Croissant: A Metadata Format for ML-Ready Datasets. in Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning 1–6 (ACM, New York, NY, USA, 2024)

  77. [77]

    Gebru, T. et al. Datasheets for datasets. Commun. ACM 64 , 86–92 (2021)

  78. [78]

    Hespeels, B. et al. Rotifers in Space: Transcriptomic Response of the bdelloid rotifer Adineta vaga aboard the International Space Station. NASA GeneLab https://doi.org/10.26030/K36D-D232 (2025)

  79. [79]

    Moris, V. C. et al. Rotifers in space: transcriptomic response of the bdelloid rotifer Adineta vaga aboard the International Space Station. BMC Biol. 23 , 182 (2025)

  80. [80]

    Qin, C. et al. SciHorizon: Benchmarking AI-for-science readiness from scientific data to large language models. in Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 5754–5765 (ACM, New York, NY, USA, 2025)

Showing first 80 references.