pith. sign in

arxiv: 2512.17795 · v2 · submitted 2025-12-19 · 💻 cs.DL · cs.AI· cs.IR

Intelligent Knowledge Mining Framework: Bridging AI Analysis and Trustworthy Preservation

Pith reviewed 2026-05-16 20:44 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.IR
keywords knowledge miningAI analysisdata preservationdigital repositoriesframework designactionable intelligence
0
0 comments X

The pith

A dual-stream architecture bridges AI knowledge mining with trustworthy archiving to create living data ecosystems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Intelligent Knowledge Mining Framework as a conceptual model to solve the problem of data trapped in silos across digital systems. It describes a dual-stream setup where one process uses AI to turn raw data into semantically rich and actionable knowledge, while a parallel stream handles archiving to keep integrity, provenance, and reproducibility intact. If true, this would let organizations move beyond static storage to ecosystems that continuously supply usable intelligence to users. The work outlines the motivation, research questions, methodology, and design details for building such a system.

Core claim

The paper establishes that by implementing a dual-stream architecture—one for systematic transformation of raw data into machine-actionable knowledge via AI mining and the other for parallel trustworthy archiving—the Intelligent Knowledge Mining Framework serves as a foundational model that converts static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers.

What carries the argument

The dual-stream architecture of the Intelligent Knowledge Mining Framework, consisting of a horizontal Mining Process that transforms raw data into semantically rich knowledge and a parallel Trustworthy Archiving Stream that maintains integrity, provenance, and reproducibility.

If this is right

  • Static repositories gain the ability to deliver ongoing actionable intelligence rather than remaining passive stores.
  • AI-driven analysis and preservation processes operate in parallel without one undermining the other.
  • Data producers and consumers interact through a shared flow of machine-actionable knowledge.
  • Computational reproducibility becomes a built-in property of all archived knowledge assets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The model could guide updates to existing digital libraries by adding parallel AI processing layers.
  • Real-world use would require defining exact interfaces and standards between the mining and archiving streams.
  • Fields handling large unstructured datasets, such as scientific publishing, could test the framework for improved knowledge reuse.

Load-bearing premise

That defining a dual-stream architecture alone will successfully bridge dynamic AI analysis and long-term preservation without further technical specifications or empirical demonstration.

What would settle it

A concrete implementation of the framework on heterogeneous data sources that shows both higher rates of actionable knowledge extraction and maintained long-term reproducibility would confirm or refute the central claim.

Figures

Figures reproduced from arXiv: 2512.17795 by Binh Vu.

Figure 1
Figure 1. Figure 1: The Nunamaker Research Framework for Information Systems [ [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Decomposition of the Research Program into Targeted R&D Projects. The overall [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A Conceptual Schema for Planning and Synthesizing R&D Project Contributions. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The SECI Model of Knowledge Creation [33], illustrating the spiral process through which tacit and explicit knowledge are converted and amplified within an organization. A significant theoretical shift occurred with the widespread recognition of the critical impor￾tance of tacit knowledge, the unarticulated, experience-based wisdom of individuals. This led to second-generation systems, which adopted a pers… view at source ↗
Figure 5
Figure 5. Figure 5: Layered Architecture of a Knowledge Management System. This model provides a [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The DIKW Pyramid [40]. This model illustrates the hierarchical process of trans￾forming raw data into actionable wisdom. The IKMF aims to facilitate the transitions between each level. to the ”knowledge graveyard” problem and the high failure rate of many KMS initiatives. This reveals a foundational challenge that underpins the entire effort: The creation of a persistent and active organizational memory. A… view at source ↗
Figure 7
Figure 7. Figure 7: A conceptual NLP processing pipeline [43], illustrating how raw text is transformed into a richly annotated document (‘Doc‘) object through a series of modular components. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A conceptual illustration of the Latent Dirichlet Allocation (LDA) model. It models [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The Semantic Web Stack, illustrating the hierarchy of technologies from foundational [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example of the SKOS (Simple Knowledge Organization System) data model [ [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The parts hierarchy of the OWL 2 RDF-Based Semantics. Each node represents [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: The Neuro-Symbolic AI Cycle. Sub-symbolic models (e.g., LLMs) learn from data [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: , OAIS provides a comprehensive conceptual model for a digital archive, defining its key functional entities and information packages. It establishes a common vocabulary and a set of mandatory responsibilities for any organization claiming to be a trustworthy digital repository. This model is often implemented using production-ready institutional repository software, with DSpace being a prominent open-sou… view at source ↗
Figure 14
Figure 14. Figure 14: The CERIF Data Model, illustrating how base entities (like Project and Person) are [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: The IKMF Reference Model, illustrating the progression from Producer to Consumer [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
read the original abstract

The unprecedented proliferation of digital data presents significant challenges in access, integration, and value creation across all data-intensive sectors. Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization and collaborative decision-making. This paper introduces the Intelligent Knowledge Mining Framework (IKMF), a comprehensive conceptual model designed to bridge the critical gap between dynamic AI-driven analysis and trustworthy long-term preservation. The framework proposes a dual-stream architecture: a horizontal Mining Process that systematically transforms raw data into semantically rich, machine-actionable knowledge, and a parallel Trustworthy Archiving Stream that ensures the integrity, provenance, and computational reproducibility of these assets. By defining a blueprint for this symbiotic relationship, the paper provides a foundational model for transforming static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers. This paper outlines the motivation, problem statement, and key research questions guiding the research and development of the framework, presents the underlying scientific methodology, and details its conceptual design and modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Intelligent Knowledge Mining Framework (IKMF) as a conceptual dual-stream architecture: a horizontal Mining Process that transforms raw data into semantically rich, machine-actionable knowledge, paired with a parallel Trustworthy Archiving Stream that preserves integrity, provenance, and reproducibility. It positions this as a blueprint for converting static repositories into living ecosystems that enable flow of actionable intelligence, outlining motivation, research questions, methodology, and high-level design.

Significance. If the unspecified integration mechanisms between dynamic mining and static archiving can be rigorously defined and validated, the framework could supply a useful high-level blueprint for digital libraries and data-intensive domains seeking to combine AI-driven enrichment with long-term trustworthiness. As presented, however, the contribution remains a high-level proposal without derivations, protocols, or tests, limiting its immediate significance to stimulating discussion rather than providing an actionable model.

major comments (2)
  1. [Conceptual design and modeling] Conceptual design and modeling sections: the central claim that the Mining Process and Trustworthy Archiving Stream form a symbiotic relationship enabling simultaneous real-time semantic enrichment and long-term reproducibility is asserted without any protocol for provenance tracking during live updates, conflict resolution between mutable knowledge graphs and immutable archives, or formal invariants for computational reproducibility. This integration mechanism is load-bearing for the transformation of static repositories into living ecosystems but is left unspecified.
  2. [Scientific methodology] Scientific methodology and key research questions sections: no mathematical derivations, formal invariants, data, or empirical tests are supplied to evaluate whether the dual-stream architecture actually achieves its stated goals. The soundness assessment rests entirely on architectural description, which is insufficient to substantiate the foundational-model claim.
minor comments (1)
  1. [Abstract and introduction] The abstract and introduction repeat the high-level motivation without distinguishing the novel aspects of IKMF from prior work on knowledge graphs, digital preservation, or AI pipelines; adding targeted comparisons would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We appreciate the acknowledgment of the framework's potential as a high-level blueprint. Below we respond point-by-point to the major comments, clarifying the manuscript's conceptual scope while outlining targeted revisions to address the concerns about integration mechanisms and validation.

read point-by-point responses
  1. Referee: [Conceptual design and modeling] Conceptual design and modeling sections: the central claim that the Mining Process and Trustworthy Archiving Stream form a symbiotic relationship enabling simultaneous real-time semantic enrichment and long-term reproducibility is asserted without any protocol for provenance tracking during live updates, conflict resolution between mutable knowledge graphs and immutable archives, or formal invariants for computational reproducibility. This integration mechanism is load-bearing for the transformation of static repositories into living ecosystems but is left unspecified.

    Authors: We agree that the integration mechanisms between the dynamic Mining Process and the static Trustworthy Archiving Stream are described at a high conceptual level without detailed protocols. The manuscript positions IKMF as a foundational blueprint rather than a fully specified system, which is why concrete mechanisms for provenance tracking during live updates, conflict resolution (e.g., between mutable knowledge graphs and immutable archives), and formal invariants were not elaborated. In the revised manuscript we will expand the conceptual design section to include high-level architectural outlines for these aspects, such as using append-only ledgers for archiving provenance and version-control strategies for knowledge-graph updates, while explicitly noting that full protocol definitions and invariants remain topics for follow-on implementation work. This will better bound the current contribution without overclaiming. revision: partial

  2. Referee: [Scientific methodology] Scientific methodology and key research questions sections: no mathematical derivations, formal invariants, data, or empirical tests are supplied to evaluate whether the dual-stream architecture actually achieves its stated goals. The soundness assessment rests entirely on architectural description, which is insufficient to substantiate the foundational-model claim.

    Authors: The paper is framed throughout as a conceptual model that outlines motivation, research questions, and high-level design; it does not present empirical evaluation or formal proofs. We accept that the absence of mathematical derivations, invariants, data, or tests means the soundness argument rests on architectural coherence and alignment with existing principles in AI knowledge extraction and digital preservation. In revision we will augment the methodology section with an explicit discussion of the conceptual nature of the work and a roadmap for future validation (e.g., simulation-based case studies or prototype implementations). We do not claim the current manuscript provides a fully substantiated operational model, only a blueprint intended to guide subsequent rigorous development. revision: partial

Circularity Check

0 steps flagged

No circularity: conceptual proposal without derivations or self-referential reductions

full rationale

The paper presents a high-level conceptual framework (IKMF) defined by a dual-stream architecture whose purpose is stated as bridging analysis and preservation. No equations, fitted parameters, predictions, or load-bearing self-citations appear in the abstract or described structure. The central claim is an architectural assertion by definition rather than a result derived from prior inputs that reduces tautologically. No steps match the enumerated circularity patterns; the derivation chain is self-contained as a modeling exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper rests on standard domain assumptions about data silos and introduces the IKMF as a new organizing concept without independent evidence or fitted parameters.

axioms (2)
  • domain assumption Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization.
    Stated directly in the abstract as the core problem motivating the framework.
  • ad hoc to paper A dual-stream architecture can bridge AI-driven analysis and trustworthy long-term preservation.
    This is the central modeling choice of the IKMF.
invented entities (1)
  • Intelligent Knowledge Mining Framework (IKMF) no independent evidence
    purpose: To provide a blueprint for transforming static repositories into living knowledge ecosystems.
    New conceptual model introduced by the paper.

pith-pipeline@v0.9.0 · 5469 in / 1221 out tokens · 29918 ms · 2026-05-16T20:44:38.036867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages

  1. [1]

    T. Hey, S. Tansley, and K. Tolle,The fourth paradigm: data-intensive scientific discovery, vol. 1. Microsoft research, 2009

  2. [2]

    3d data management: Controlling data volume, velocity and variety,

    D. Laney, “3d data management: Controlling data volume, velocity and variety,”META group research note, vol. 6, 2001

  3. [3]

    Bridging data silos using big data integration,

    S. Abraham, D. S. Ewen, and B. Burnett, “Bridging data silos using big data integration,” International Journal of Database Management Systems, vol. 11, no. 2/3, pp. 1–17, 2019

  4. [4]

    1,500 scientists lift the lid on reproducibility,

    M. Baker, “1,500 scientists lift the lid on reproducibility,”Nature News, vol. 533, no. 7604, pp. 452–454, 2016

  5. [5]

    W. H. Inmon, C. Imhoff, and R. Sousa,Corporate information factory. John Wiley & Sons, 2002

  6. [6]

    C. C. Aggarwal and C. Zhai,Mining text data. Springer Science & Business Media, 2012

  7. [7]

    Review: Knowledge management and knowledge management systems: Conceptual foundations and research issues,

    M. Alavi and D. E. Leidner, “Review: Knowledge management and knowledge management systems: Conceptual foundations and research issues,”MIS Quarterly, pp. 107–136, 2001

  8. [8]

    Evolution of knowledge management,

    B. Maˇ si´ c, S. Neˇ si´ c, D. Nikoli´ c, and M. Dˇ zeletovi´ c, “Evolution of knowledge management,” Industrija, vol. 45, no. 2, pp. 127–147, 2017. 26

  9. [9]

    Knowledge management in organizations: examining the interaction between technologies, techniques, and people,

    G. D. Bhatt, “Knowledge management in organizations: examining the interaction between technologies, techniques, and people,”Journal of knowledge management, vol. 5, no. 1, pp. 68–75, 2001

  10. [10]

    Cognitive load during problem solving: Effects on learning,

    J. Sweller, “Cognitive load during problem solving: Effects on learning,”Cognitive science, vol. 12, no. 2, pp. 257–285, 1988

  11. [11]

    Digital ecosystems: Evolving service-oriented architectures,

    G. Briscoe and P. De Wilde, “Digital ecosystems: Evolving service-oriented architectures,” in2008 Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pp. 2997–3004, IEEE, 2008

  12. [12]

    Probabilistic machine learning and artificial intelligence,

    Z. Ghahramani, “Probabilistic machine learning and artificial intelligence,”Nature, vol. 521, no. 7553, pp. 452–459, 2015

  13. [13]

    Knowledge graphs,

    A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutierrez, J. E. L. Gayo, S. Kirrane, S. Neumaier, A. Polleres,et al., “Knowledge graphs,”ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–37, 2021

  14. [14]

    A content and knowledge management system supporting emotion detection from speech,

    B. Vu, M. de Velasco, P. Mc Kevitt, R. Bond, R. Turkington, F. Booth, M. Mulvenna, M. Fuchs, and M. Hemmje, “A content and knowledge management system supporting emotion detection from speech,” inConversational Dialogue Systems for the Next Decade (L. F. D’Haro, Z. Callejas, and S. Nakamura, eds.), vol. 704 ofLecture Notes in Electrical Engineering, Sprin...

  15. [15]

    Vu,A Taxonomy Management System Supporting Crowd-based Taxonomy Generation, Evolution, and Management

    B. Vu,A Taxonomy Management System Supporting Crowd-based Taxonomy Generation, Evolution, and Management. PhD thesis, Hagen, 2020

  16. [16]

    MPEG-7: The generic multimedia content description standard, part 1,

    J. M. Martinez, “MPEG-7: The generic multimedia content description standard, part 1,” IEEE multimedia, vol. 9, no. 2, pp. 78–87, 2002

  17. [17]

    Towards continuous professional monitoring of health status based on energetic balancing,

    B. Vu, S. Bruchhaus, A. Moorhead, H. Zheng, L. D’Arco, L. Lynch, L. S. Sica, M. Pon- ticorvo, F. Diano, H. Afli, P. Joshi, A. Molinari, and M. Hemmje, “Towards continuous professional monitoring of health status based on energetic balancing,” in2022 IEEE Inter- national Workshop on Sport, Technology and Research (STAR), (Trento - Cavalese, Italy), pp. 72–77, 2022

  18. [18]

    Supporting Mental Health in Young People: Integrated Methodology for cLinical dEcisions (SMILE)

    European Commission, CORDIS, “Supporting Mental Health in Young People: Integrated Methodology for cLinical dEcisions (SMILE).” EU CORDIS Project Page, 2023. Grant agreement ID: 101080923

  19. [19]

    Using Large Language Models for Microbiome Findings Reports in Laboratory Diagnos- tics,

    T. Krause, L. Glau, P. Newels, T. Reis, M. X. Bornschlegl, M. Kramer, and M. L. Hemmje, “Using Large Language Models for Microbiome Findings Reports in Laboratory Diagnos- tics,”BioMedInformatics, vol. 4, no. 3, pp. 1979–2001, 2024

  20. [20]

    The fair guiding principles for scientific data management and stewardship,

    M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne,et al., “The fair guiding principles for scientific data management and stewardship,”Scientific data, vol. 3, no. 1, pp. 1–9, 2016

  21. [21]

    Preservation of digital records: issues and challenges in the digital era,

    P. Jain and N. Mnjama, “Preservation of digital records: issues and challenges in the digital era,”Journal of the South African Society of Archivists, vol. 49, pp. 157–172, 2016

  22. [22]

    Premis data dictionary for preservation metadata, version 3.0,

    PREMIS Editorial Committee, “Premis data dictionary for preservation metadata, version 3.0,” 2015. 27

  23. [23]

    The oais reference model: a study from a technical point of view,

    F. d. J. Lavorato and R. d. C. Sant’Ana, “The oais reference model: a study from a technical point of view,” in2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–6, IEEE, 2014

  24. [24]

    Prov-o: The prov ontology,

    T. Lebo, S. Sahoo, and D. McGuinness, “Prov-o: The prov ontology,” 2013

  25. [25]

    Cris/oar portals: a roadmap for making research information publicly visible,

    P. De Castro, M. Casado, and M. Legido, “Cris/oar portals: a roadmap for making research information publicly visible,”Program, vol. 45, no. 4, pp. 415–433, 2011

  26. [26]

    The irods data grid,

    A. Rajasekar, R. Moore, M. Wan, and W. Schroeder, “The irods data grid,” inData Grids-The Next Generation of Data-Centric Collaborations, pp. 101–141, Springer, 2010

  27. [27]

    Systems development in information systems research,

    J. F. Nunamaker, M. Chen, and T. D. M. Purdin, “Systems development in information systems research,”Journal of Management Information Systems, vol. 7, no. 3, pp. 89–106, 1991

  28. [28]

    Design science in information systems research,

    A. R. Hevner, S. T. March, J. Park, and S. Ram, “Design science in information systems research,”MIS quarterly, pp. 75–105, 2004

  29. [29]

    Creating impact through systematic programs of research,

    J. F. Nunamaker, R. O. Briggs, N. W. Twyman, and J. S. Giboney, “Creating impact through systematic programs of research,”Journal of Management Information Systems, vol. 31, no. 3, pp. 13–41, 2014

  30. [30]

    A design science research methodology for information systems research,

    K. Peffers, T. Tuunanen, M. A. Rothenberger, and S. Chatterjee, “A design science research methodology for information systems research,”Journal of management information sys- tems, vol. 24, no. 3, pp. 45–77, 2007

  31. [31]

    Towards trustworthiness in ai-based big data analysis,

    M. X. Bornschlegl, “Towards trustworthiness in ai-based big data analysis,” 2024

  32. [32]

    The evolution of knowledge management systems needs to be managed,

    R. Lindgren and O. Henfridsson, “The evolution of knowledge management systems needs to be managed,”Knowledge Management Research & Practice, vol. 2, no. 1, pp. 56–64, 2004

  33. [33]

    Nonaka and H

    I. Nonaka and H. Takeuchi,The knowledge-creating company: How Japanese companies create the dynamics of innovation. Oxford university press, 1995

  34. [34]

    Wenger, R

    E. Wenger, R. A. McDermott, and W. Snyder,Cultivating communities of practice: A guide to managing knowledge. Harvard Business School Press, 2002

  35. [35]

    The knowledge caf´ e—a knowledge management system and its application to hospitality and tourism,

    N. Gronau, E. Weber, and A. Kienle, “The knowledge caf´ e—a knowledge management system and its application to hospitality and tourism,” inInformation and Communication Technologies in Tourism 2009, pp. 309–320, Springer, 2009

  36. [36]

    Kms failure: a study of the contributing factors and cures,

    A. Y. Chua and W.-Y. Lam, “Kms failure: a study of the contributing factors and cures,” Industrial Management & Data Systems, vol. 109, no. 1, pp. 64–79, 2009

  37. [37]

    Knowledge management critical failure factors: a multi- case study,

    P. Akhavan and A. Pezeshkan, “Knowledge management critical failure factors: a multi- case study,”VINE, vol. 44, pp. 22–41, 02 2014

  38. [38]

    Social media, social acts, and knowledge sharing in or- ganizations: A case of a professional services firm,

    M. H. Jarrahi and S. Sawyer, “Social media, social acts, and knowledge sharing in or- ganizations: A case of a professional services firm,”Journal of the American Society for Information Science and Technology, vol. 63, no. 10, pp. 2028–2040, 2012

  39. [39]

    Absorptive capacity: A new perspective on learning and innovation,

    W. M. Cohen and D. A. Levinthal, “Absorptive capacity: A new perspective on learning and innovation,”Administrative science quarterly, pp. 128–152, 1990

  40. [40]

    Dikw pyramid

    J. Winter, “Dikw pyramid.” Jeff Winter Insights, 2023. Accessed: 2025-07-10. 28

  41. [41]

    The probabilistic relevance framework: Bm25 and be- yond,

    S. Robertson and H. Zaragoza, “The probabilistic relevance framework: Bm25 and be- yond,”Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009

  42. [42]

    Dense pas- sage retrieval for open-domain question answering,

    V. Karpukhin, B. Oguz, S. Min, P. Lewis, W.-t. Yih, N. Goyal, and D. Chen, “Dense pas- sage retrieval for open-domain question answering,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020

  43. [43]

    Processing pipelines

    spaCy, “Processing pipelines.” spaCy Usage Documentation, 2024. Accessed: 2025-07-10

  44. [44]

    Neural archi- tectures for named entity recognition,

    G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural archi- tectures for named entity recognition,” inProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270, 2016

  45. [45]

    Distant supervision for relation extraction without labeled data,

    M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” inProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011, 2009

  46. [46]

    Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,

    M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880, 2020

  47. [47]

    Pegasus: Pre-training with extracted gap- sentences for abstractive summarization,

    J. Zhang, Y. Zhao, M. Saleh, and P. J. Liu, “Pegasus: Pre-training with extracted gap- sentences for abstractive summarization,” inInternational Conference on Machine Learn- ing, pp. 11328–11339, PMLR, 2020

  48. [48]

    Latent dirichlet allocation,

    D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003

  49. [49]

    What is topic modeling? Discuss key algorithms, working, applications, and the pros and cons

    AIML.com, “What is topic modeling? Discuss key algorithms, working, applications, and the pros and cons.” Web page, 2024. Accessed: 2025-07-11

  50. [50]

    Spectral

    D. Angelov, “Top2vec: Distributed representations of topics,”arXiv preprint arXiv:2008.09470, 2020

  51. [51]

    Reading wikipedia to answer open-domain questions,

    D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading wikipedia to answer open-domain questions,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1870–1879, 2017

  52. [52]

    Data fusion,

    J. Bleiholder and F. Naumann, “Data fusion,”ACM Computing Surveys (CSUR), vol. 41, no. 1, pp. 1–41, 2008

  53. [53]

    The semantic web,

    T. Berners-Lee, J. Hendler, and O. Lassila, “The semantic web,”Scientific american, vol. 284, no. 5, pp. 34–43, 2001

  54. [54]

    Rdf 1.1 concepts and abstract syntax,

    R. Cyganiak, D. Wood, and M. Lanthaler, “Rdf 1.1 concepts and abstract syntax,”W3C Recommendation, 2014

  55. [55]

    Owl web ontology language overview,

    D. L. McGuinness, F. Van Harmelen,et al., “Owl web ontology language overview,”W3C recommendation, vol. 10, no. 2004-02-10, p. 2004, 2004

  56. [56]

    The rq-tech methodology: A new paradigm for conceptualizing strategic enterprise architectures,

    C. Hoyland, K. Adams, A. Tolk, and L. Xu, “The rq-tech methodology: A new paradigm for conceptualizing strategic enterprise architectures,”Journal of Management Analytics, vol. 1, pp. 55–77, 05 2014. 29

  57. [57]

    Lambe,Organizing knowledge: taxonomies, knowledge and organizational effectiveness

    P. Lambe,Organizing knowledge: taxonomies, knowledge and organizational effectiveness. Chandos Publishing, 2007

  58. [58]

    Hedden,The accidental taxonomist

    H. Hedden,The accidental taxonomist. Information Today, Inc., 2016

  59. [59]

    Skos (simple knowledge organization system)

    J. Busse, “Skos (simple knowledge organization system).” Web page for Deutscher Terminologie-Tag e.V., 2023. Accessed: 2025-07-10

  60. [60]

    Thesauri, taxonomies, and ontologies: An etymological note,

    A. Gilchrist, “Thesauri, taxonomies, and ontologies: An etymological note,”Journal of Documentation, vol. 59, no. 1, pp. 7–18, 2003

  61. [61]

    Toward principles for the design of ontologies used for knowledge sharing?,

    T. R. Gruber, “Toward principles for the design of ontologies used for knowledge sharing?,” International journal of human-computer studies, vol. 43, no. 5-6, pp. 907–928, 1995

  62. [62]

    Owl 2 web ontology language rdf-based semantics (second edition)

    B. Motik, P. F. Patel-Schneider, and B. Cuenca Grau, “Owl 2 web ontology language rdf-based semantics (second edition).” W3C Recommendation, December 2012. Accessed: 2025-07-10

  63. [63]

    Knowledge engineering: principles and meth- ods,

    R. Studer, V. R. Benjamins, and D. Fensel, “Knowledge engineering: principles and meth- ods,”Data & knowledge engineering, vol. 25, no. 1-2, pp. 161–197, 1998

  64. [64]

    Fast algorithms for mining association rules,

    R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” inProc. 20th int. conf. very large data bases, VLDB, vol. 1215, pp. 487–499, 1994

  65. [65]

    Swrl: A semantic web rule language combining owl and ruleml,

    I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, and M. Dean, “Swrl: A semantic web rule language combining owl and ruleml,” vol. 21, pp. 2004–05, 2004

  66. [66]

    How neuro-symbolic ai helps understand scenes

    Bosch Global, “How neuro-symbolic ai helps understand scenes.” Bosch Stories, April 2022. Accessed: 2025-07-10

  67. [67]

    Neurosymbolic ai: The 3rd wave,

    A. S. d. Garcez and L. C. Lamb, “Neurosymbolic ai: The 3rd wave,”arXiv preprint arXiv:1908.06627, 2019

  68. [68]

    Survey of hallucination in natural language generation,

    Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

  69. [69]

    Deepproblog: Neural probabilistic logic programming,

    R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt, “Deepproblog: Neural probabilistic logic programming,” inAdvances in Neural Information Processing Systems 31 (NeurIPS 2018)(S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa- Bianchi, and R. Garnett, eds.), pp. 3749–3759, 2018

  70. [70]

    Scallop: From prob- abilistic deductive databases to scalable differentiable reasoning,

    J. Huang, Z. Li, B. Chen, K. Samel, M. Naik, L. Song, and X. Si, “Scallop: From prob- abilistic deductive databases to scalable differentiable reasoning,” inAdvances in Neural Information Processing Systems 34 (NeurIPS 2021), pp. 25134–25145, 2021

  71. [71]

    Reference Model for an Open Archival Information System (OAIS),

    Consultative Committee for Space Data Systems, “Reference Model for an Open Archival Information System (OAIS),” Tech. Rep. CCSDS 650.0-M-3, CCSDS, 2024. Magenta Book

  72. [72]

    Dspace: an open source dynamic digital repository,

    M. Smith, M. Barton, M. Bass, M. Branschofsky, G. McClellan, D. Stuve, R. Tansley, and J. H. Walker, “Dspace: an open source dynamic digital repository,”D-Lib magazine, vol. 9, no. 1, 2003

  73. [73]

    Cerif: The common european research infor- mation format model,

    B. J”org, K. G. Jeffery, and A. Asserson, “Cerif: The common european research infor- mation format model,” inThe 6th International Conference on Theory and Practice of Electronic Governance, pp. 381–384, 2012. 30

  74. [74]

    Business semantics management sup- ports government innovation information portal,

    G. Grootel, P. Spyns, S. Christiaens, and B. J¨ org, “Business semantics management sup- ports government innovation information portal,” vol. 5872, pp. 757–766, 11 2009

  75. [75]

    Emulation as a digital preservation strategy,

    S. Granger, “Emulation as a digital preservation strategy,”D-Lib magazine, vol. 6, no. 10, pp. 1010–45, 2000

  76. [76]

    FAIR digital objects for science: From data pieces to actionable knowledge units,

    K. De Smedt, D. Koureas, and P. Wittenburg, “FAIR digital objects for science: From data pieces to actionable knowledge units,”Publications, vol. 8, no. 2, p. 21, 2020

  77. [77]

    RO-Crate 1.1: A lightweight approach to research data packaging

    P. Sefton, S. Soiland-Reyes, L. J. Castro, C. Goble, and RO-Crate Community, “RO-Crate 1.1: A lightweight approach to research data packaging.” Zenodo, Aug. 2021

  78. [78]

    The BagIt file packaging format (V1.0)

    J. Kunze, J. Littman, L. Madden, J. Scancella, and C. Adams, “The BagIt file packaging format (V1.0).” RFC 8493, Oct. 2018

  79. [79]

    Common workflow language, v1.0

    P. Amstutz, M. R. Crusoe, N. Tijani´ c, B. Chapman, J. Chilton, M. Heuer, A. Kartashov, D. Leehr, H. M´ enager, M. Nedeljkovich, M. Scales, S. Soiland-Reyes, and L. Stojanovic, “Common workflow language, v1.0.” figshare, July 2016

  80. [80]

    Nextflow enables reproducible computational workflows,

    P. Di Tommaso, M. Chatzou, E. W. Floden, P. P. Barja, E. Palumbo, and C. Notredame, “Nextflow enables reproducible computational workflows,”Nature Biotechnology, vol. 35, no. 4, pp. 316–319, 2017. 31