pith. machine review for the scientific record. sign in

arxiv: 2605.09505 · v2 · submitted 2026-05-10 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords epilepsyknowledge graphlarge language modelsclinical reasoningpharmacogenomicsGraph-RAGbenchmarkneurology
0
0 comments X

The pith

A new epilepsy knowledge graph boosts LLM performance on clinical reasoning tasks by up to 41 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EpiGraph, a knowledge graph that combines information from over 48,000 epilepsy papers into a structured form with entities and evidence triplets. This graph is used to augment large language models through Graph-RAG for better handling of complex epilepsy cases involving signals, genetics, and treatments. Evaluations on five tasks show consistent gains in model accuracy when the graph is added. The most significant improvements appear in pharmacogenomic reasoning, where performance rises by 30 to 41 percent. Such results indicate that explicit knowledge structures can help models reason more reliably with medical evidence.

Core claim

EpiGraph integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. When this graph augments six different LLMs on the EpiBench tasks for clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning, performance improves consistently, with the largest gains in pharmacogenomic reasoning of 30 to 41 percent.

What carries the argument

EpiGraph, the heterogeneous knowledge graph built from literature with five clinical layers and evidence-grounded triplets that supports Graph-RAG augmentation of LLMs.

If this is right

  • LLM performance improves across all five EpiBench tasks when EpiGraph is integrated via Graph-RAG.
  • The largest gains occur in pharmacogenomic reasoning, increasing by 30 to 41 percent.
  • Structured epilepsy knowledge from peer-reviewed sources enhances evidence-grounded clinical reasoning.
  • EpiBench serves as a framework for testing knowledge-augmented LLMs in neurological applications.
  • The approach demonstrates the value of knowledge graphs for complex medical decision-making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method of building domain-specific graphs from large paper collections could extend to other medical fields with evidence-intensive reasoning needs.
  • The observed improvements may stem from reduced factual errors in LLM outputs when guided by structured triplets.
  • Testing the system on live clinical data or with expert neurologists would provide stronger validation beyond the benchmark.
  • Community contributions to the open code could expand the graph with newer research findings.

Load-bearing premise

The automatically extracted triplets and five-layer structure from the source papers accurately capture clinically reliable knowledge without systematic errors or omissions.

What would settle it

Running the same LLM evaluations on EpiBench without using EpiGraph and finding no performance improvement or even degradation would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.09505 by Jathurshan Pradeepkumar, Jimeng Sun, Yasuko Matsubara, Yasushi Sakurai, Yushun Dong, Yuyang Dai, Zheng Chen.

Figure 2
Figure 2. Figure 2: Pipeline overview of EPIKG, comprising two components. Left: paper-derived evidence graph is processed through an extraction pipeline that identifies entities, relations, and supporting evidence, mapped into relation graph. Right: An example of how EPIKG grounds the paper-derived evidence graph: given the query paper “Resistance to excitotoxin-induced seizures...”, EPIKG retrieves the supporting reasoning … view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity analysis and ablation results of Graph-RAG. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two impression generation examples on S0001. Three column refer to medgemma model [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of EPIBENCH Running Time. Blue denotes T1 Knowledge QA, green denotes T2 Report Generation, red denotes T3 Precision Medicine, purple denotes T4 Treatment Recommen￾dation, and olive denotes T5 Deep Research. The red star highlights Graph-RAG (Ours). Ablation and Sensitivity ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of EPIBENCH Result. I Limitations and Future Work Language and ontology coverage. EPIKG is constructed from English-language ontologies and literature, limiting its coverage of epilepsy syndromes and gene–disease associations that are primarily documented in non-English sources. Rare syndromes with fewer than five supporting papers are likely underrepresented in the extracted relation set, as the … view at source ↗
read the original abstract

Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EpiGraph, a heterogeneous knowledge graph constructed from 48,166 peer-reviewed papers and seven clinical resources, yielding 24,324 entities and 32,009 evidence-grounded triplets organized into five clinical layers. It defines EpiBench, a benchmark with five tasks covering clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. Evaluations of six LLMs under standard and Graph-RAG settings report consistent performance improvements from integrating EpiGraph, with the largest gains (+30--41%) in pharmacogenomic reasoning.

Significance. If the automatically extracted triplets prove clinically reliable, the work supplies a valuable, publicly released benchmark and code base for evaluating knowledge-augmented LLMs on evidence-intensive neurological reasoning. The reported gains illustrate the potential utility of structured domain graphs for clinical tasks that require retrieval across heterogeneous sources.

major comments (2)
  1. [EpiGraph construction and triplet extraction] The graph-construction pipeline (48k papers + 7 resources into five layers and 32k triplets) is presented only at high level. No precision, recall, inter-annotator agreement, or expert review of sampled triplets against source papers is reported. Because the headline result—largest gains in pharmacogenomic reasoning—depends directly on the clinical accuracy of these triplets, the absence of validation leaves the central claim only moderately supported.
  2. [Evaluation and results] Results are summarized as “consistent gains” without reported statistical significance tests, confidence intervals, or exact baseline definitions (e.g., which retrieval method or prompt template constitutes the non-Graph-RAG condition). This detail is required to substantiate the quantitative claims, especially the +30--41% pharmacogenomics improvement.
minor comments (2)
  1. [Abstract] The abstract lists five tasks but does not name the six LLMs or the primary metrics (accuracy, F1, human preference, etc.) used for each task; adding these would improve clarity.
  2. [Results tables and figures] Figure captions and table headers should explicitly state whether reported numbers are means over multiple runs or single-run values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our methods and results. We address each major comment below and will revise the manuscript to incorporate additional details and analyses.

read point-by-point responses
  1. Referee: [EpiGraph construction and triplet extraction] The graph-construction pipeline (48k papers + 7 resources into five layers and 32k triplets) is presented only at high level. No precision, recall, inter-annotator agreement, or expert review of sampled triplets against source papers is reported. Because the headline result—largest gains in pharmacogenomic reasoning—depends directly on the clinical accuracy of these triplets, the absence of validation leaves the central claim only moderately supported.

    Authors: We agree that additional validation details would strengthen the central claims. In the revised manuscript we will expand the Methods section with a full description of the extraction pipeline (including the specific LLMs, prompting strategies, and post-processing rules used to generate the 32,009 triplets). We will also add a validation subsection reporting precision and recall on a randomly sampled set of 500 triplets manually verified against source papers by a board-certified neurologist, together with inter-annotator agreement statistics on a 100-triplet overlap subset. These additions will directly address the reliability of the pharmacogenomic triplets that drive the largest reported gains. revision: yes

  2. Referee: [Evaluation and results] Results are summarized as “consistent gains” without reported statistical significance tests, confidence intervals, or exact baseline definitions (e.g., which retrieval method or prompt template constitutes the non-Graph-RAG condition). This detail is required to substantiate the quantitative claims, especially the +30--41% pharmacogenomics improvement.

    Authors: We acknowledge the need for greater statistical rigor and transparency. In the revision we will (1) report exact baseline definitions, including the retrieval method (dense passage retrieval with the same embedding model) and prompt templates used in the standard setting, (2) add paired statistical significance tests (Wilcoxon signed-rank) with p-values for each task and model, and (3) include 95% confidence intervals computed via bootstrap resampling for all accuracy and F1 scores. These changes will allow readers to assess the robustness of the +30--41% pharmacogenomics improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: construction and evaluation remain independent

full rationale

The paper describes an external data pipeline (48k papers + 7 resources) that produces a fixed graph and benchmark tasks, then measures LLM performance on those tasks with and without the graph. No equations, fitted parameters, or predictions are defined inside the paper whose outputs are then re-used as inputs. Evaluation relies on external LLMs and newly introduced tasks rather than any internal derivation that reduces to the construction choices. Self-citations, if present, are not load-bearing for any claimed result. The central claims are therefore empirically falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about the utility of knowledge graphs for LLM reasoning and the representativeness of the extracted clinical facts; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Retrieved graph triplets improve LLM reasoning on clinical tasks without introducing harmful noise
    Central to the Graph-RAG evaluation design.

pith-pipeline@v0.9.0 · 5542 in / 1124 out tokens · 39403 ms · 2026-05-14T21:20:48.007939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 11 internal anchors

  1. [1]

    Evobrain: Dynamic multi-channel EEG graph modeling for time-evolving brain networks

    Rikuto Kotoge, Zheng Chen, Tasuku Kimura, Yasuko Matsubara, Takufumi Yanagisawa, Haruhiko Kishima, and Yasushi Sakurai. Evobrain: Dynamic multi-channel EEG graph modeling for time-evolving brain networks. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  2. [2]

    Drug-resistant epilepsy

    Patrick Kwan, Steven C Schachter, and Martin J Brodie. Drug-resistant epilepsy. New England Journal of Medicine, 365(10):919–926, 2011

  3. [3]

    Long-term eeg partitioning for seizure onset detection

    Zheng Chen, Yasuko Matsubara, Yasushi Sakurai, and Jimeng Sun. Long-term eeg partitioning for seizure onset detection. In Proc. AAAI Conf. Artif. Intell., pages 14221–14229, 2025

  4. [4]

    Review of pharmacogenetics of antiseizure medications: focusing on genetic variants of mecha- nistic targets

    Chin-Wei Kuo, Ching-Yun Huang, Hsuan-Ming Chen, Jing-Jane Tsai, and Chin-Wei Huang. Review of pharmacogenetics of antiseizure medications: focusing on genetic variants of mecha- nistic targets. Frontiers in Pharmacology, 15:1411487, 2024

  5. [5]

    A review on knowledge graphs for healthcare: Resources, applications, and promises

    Hejie Cui, Jiaying Lu, Ran Xu, Shiyu Wang, Wenjing Ma, Yue Yu, Shaojun Yu, Xuan Kan, Chen Ling, Liang Zhao, et al. A review on knowledge graphs for healthcare: Resources, applications, and promises. Journal of biomedical informatics, page 104861, 2025

  6. [6]

    Building a knowledge graph to enable precision medicine

    Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine. Scientific data, 10(1):67, 2023

  7. [7]

    Biomedical knowledge graph: A survey of domains, tasks, and real-world applications

    Yuxing Lu, Sin Yee Goi, Xukai Zhao, and Jinzhuo Wang. Biomedical knowledge graph: A survey of domains, tasks, and real-world applications. arXiv preprint arXiv:2501.11632, 2025

  8. [8]

    Sun, and Adam Cross

    Lang Cao, J. Sun, and Adam Cross. Autord: An automatic and end-to-end system for rare disease knowledge graph construction based on ontology-enhanced large language models (preprint). JMIR Medical Informatics, 12, 2024

  9. [9]

    A review of biomedical datasets relating to drug discovery: a knowledge graph perspective

    Stephen Bonner, Ian P Barrett, Cheng Ye, Rowan Swiers, Ola Engkvist, Andreas Bender, Charles Tapley Hoyt, and William L Hamilton. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Briefings in Bioinformatics, 23(6):bbac404, 2022

  10. [10]

    The unified medical language system (umls): integrating biomedical terminology

    Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004

  11. [11]

    Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care

    Satya S Sahoo, Samden D Lhatoo, Deepak K Gupta, Licong Cui, Meng Zhao, Catherine Jayapandian, Alireza Bozorgi, and Guo-Qiang Zhang. Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. Journal of the American Medical Informatics Association, 21(1):82–89, 2014

  12. [12]

    The epilepsy ontology: a community-based ontology tailored for semantic interoperability and text mining

    Astghik Sargsyan, Philipp Wegner, Stephan Gebel, Abish Kaladharan, Priya Sethumadhavan, Vanessa Lage-Rupprecht, Johannes Darms, Bruce Schultz, Jürgen Klein, Marc Jacobs, et al. The epilepsy ontology: a community-based ontology tailored for semantic interoperability and text mining. Bioinformatics advances, 3(1):vbad033, 2023

  13. [13]

    Large language models encode clinical knowledge

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

  14. [14]

    Perfor- mance of chatgpt on usmle: potential for ai-assisted medical education using large language models

    Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Perfor- mance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023

  15. [15]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023. 10

  16. [16]

    Llm-empowered patient-provider communication: A data-centric survey from a clinical perspective

    Ruosi Shao, Md Shamim Seraj, Kangyi Zhao, Yingtao Luo, Lincan Li, Bolin Shen, Averi Bates, Yue Zhao, Chongle Pan, Lisa Hightow-Weidman, et al. Llm-empowered patient-provider communication: A data-centric survey from a clinical perspective. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of th...

  17. [17]

    Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study

    Yanjun Gao, Ruizhe Li, Emma Croxford, John Caskey, Brian W Patterson, Matthew Churpek, Timothy Miller, Dmitriy Dligach, and Majid Afshar. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study. Jmir Ai, 4:e58670, 2025

  18. [18]

    Agentic medical knowledge graphs enhance medical question answering: Bridging the gap between llms and evolving medical knowledge

    Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L Parker, Rahul G Krishnan, and Milad Lankarany. Agentic medical knowledge graphs enhance medical question answering: Bridging the gap between llms and evolving medical knowledge. arXiv preprint arXiv:2502.13010, 2025

  19. [19]

    Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993

    Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs. arXiv preprint arXiv:2504.00993, 2025

  20. [20]

    LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

    Lincan Li, Zheng Chen, and Yushun Dong. Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis. arXiv preprint arXiv:2604.28178, 2026

  21. [21]

    Artificial intelligence in epilepsy: a systemic review

    Almuntasar Al-Breiki, Said Al-Sinani, Ahmed Elsharaawy, Mohamed Usama, and Tariq Al- Saadi. Artificial intelligence in epilepsy: a systemic review. Journal of Epilepsy Research, 15(1):2–22, 2025

  22. [22]

    Artificial intelligence in epilepsy—applications and pathways to the clinic

    Alfredo Lucas, Andrew Revell, and Kathryn A Davis. Artificial intelligence in epilepsy—applications and pathways to the clinic. Nature Reviews Neurology, 20(6):319– 336, 2024

  23. [23]

    Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions

    Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, and Wei Wang. Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions. arXiv preprint arXiv, 2406, 2024

  24. [24]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

  25. [25]

    Diagnosisarena: benchmarking diagnostic reasoning for large language models

    Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, and Xiaofan Zhang. Diagnosisarena: benchmarking diagnostic reasoning for large language models. arXiv preprint arXiv:2505.14107, 2025

  26. [26]

    Benchmarking retrieval-augmented generation for medicine

    Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251, 2024

  27. [27]

    Optimizing eeg graph structure for seizure detection: An information bottleneck and self-supervised learning approach

    Lincan Li, Rikuto Kotoge, Xihao Piao, Zheng Chen, and Yushun Dong. Optimizing eeg graph structure for seizure detection: An information bottleneck and self-supervised learning approach. arXiv.2604.01595, 2026

  28. [28]

    Medical subject headings (MeSH), 2024

    National Library of Medicine. Medical subject headings (MeSH), 2024. Accessed: 2025

  29. [29]

    Ilae official report: a practical clinical definition of epilepsy

    Robert S Fisher, Carlos Acevedo, Alexis Arzimanoglou, Alicia Bogacz, J Helen Cross, Chris- tian E Elger, Jerome Engel Jr, Lars Forsgren, Jacqueline A French, Mike Glynn, et al. Ilae official report: a practical clinical definition of epilepsy. Epilepsia, 55(4):475–482, 2014

  30. [30]

    Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders

    Ada Hamosh, Alan F Scott, Joanna S Amberger, Carol A Bocchini, and Victor A McKusick. Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(suppl_1):D514–D517, 2005

  31. [31]

    Chebi in 2016: Improved services and an expanding collection of metabolites

    Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. Chebi in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research, 44(D1):D1214–D1219, 2016. 11

  32. [32]

    The human phenotype ontology in 2021

    Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis- Smith, Nicole A Vasilevsky, Daniel Danis, Ganna Balagura, Gareth Baynam, Amy M Brower, et al. The human phenotype ontology in 2021. Nucleic acids research, 49(D1):D1207–D1217, 2021

  33. [33]

    AES clinical practice guidelines, 2024

    American Epilepsy Society. AES clinical practice guidelines, 2024. Accessed: 2025

  34. [34]

    Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016

    Ettore Beghi, Giorgia Giussani, Emma Nichols, Foad Abd-Allah, Jemal Abdela, Ahmed Abdelalim, Haftom Niguse Abraha, Mina G Adib, Sutapa Agrawal, Fares Alahdab, et al. Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016. The Lancet Neurology, 18(4):357–375, 2019

  35. [35]

    Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

    Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313, 2025

  36. [36]

    Sentence-bert: Sentence embeddings using siamese bert-networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019

  37. [37]

    American clinical neurophysiology society guideline 7: guidelines for eeg reporting

    William O Tatum IV , Olga Selioutski, Juan G Ochoa, Heidi Munger Clary, Janna Cheek, Frank W Drislane, and Tammy N Tsuchida. American clinical neurophysiology society guideline 7: guidelines for eeg reporting. The Neurodiagnostic Journal, 56(4):285–293, 2016

  38. [38]

    Neural signals generate clinical notes in the wild

    Jathurshan Pradeepkumar, Zheng Chen, and Jimeng Sun. Neural signals generate clinical notes in the wild. arXiv preprint arXiv:2601.22197, 2026

  39. [39]

    Toward expert-level medical question answering with large language models

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature medicine, 31(3):943–950, 2025

  40. [40]

    Ilae treatment guidelines: evidence-based analysis of antiepileptic drug efficacy and effectiveness as initial monotherapy for epileptic seizures and syndromes

    Tracy Glauser, Elinor Ben-Menachem, Blaise Bourgeois, Avital Cnaan, David Chadwick, Carlos Guerreiro, Reetta Kälviäinen, Richard Mattson, Emilio Perucca, and Torbjorn Tomson. Ilae treatment guidelines: evidence-based analysis of antiepileptic drug efficacy and effectiveness as initial monotherapy for epileptic seizures and syndromes. Epilepsia, 47(7):1094...

  41. [41]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

  42. [42]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  43. [43]

    A survey for large language models in biomedicine

    Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, et al. A survey for large language models in biomedicine. Artificial Intelligence in Medicine, page 103268, 2025

  44. [44]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  45. [45]

    Claude Sonnet 4

    Anthropic. Claude Sonnet 4. Technical report, Anthropic, 2024

  46. [46]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  47. [47]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 12

  48. [48]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

  49. [49]

    Mistral small 3.1, 2025

    Mistral AI. Mistral small 3.1, 2025. Accessed: 2025

  50. [50]

    Gemma 3 technical report

    Google DeepMind. Gemma 3 technical report. Technical report, Google, 2024

  51. [51]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

  52. [52]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  53. [53]

    Harvard electroencephalography database (version 4.1)

    S Zafar, T Loddenkemper, JW Lee, A Cole, D Goldenholz, J Peters, A Lam, E Amorim, C Chu, S Cash, et al. Harvard electroencephalography database (version 4.1). Brain Data Science Platform, 2025

  54. [54]

    Harvard electroencephalogra- phy database: A comprehensive clinical electroencephalographic resource from four boston hospitals

    Chenxi Sun, Jin Jing, Niels Turley, Callison Alcott, Wan-Yee Kang, Andrew J Cole, Daniel M Goldenholz, Alice Lam, Edilberto Amorim, Catherine Chu, et al. Harvard electroencephalogra- phy database: A comprehensive clinical electroencephalographic resource from four boston hospitals. Epilepsia, 2025

  55. [55]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

  56. [56]

    Greaselm: Graph reasoning enhanced language models for question answering

    Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models for question answering. arXiv preprint arXiv:2201.08860, 2022

  57. [57]

    Qa-gnn: Reasoning with language models and knowledge graphs for question answering

    Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 535–546, 2021

  58. [58]

    Think-on-graph: Deep and respon- sible reasoning of large language model with knowledge graph,

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. arXiv preprint arXiv:2307.07697, 2023

  59. [59]

    Knowledge graph-augmented language models for complex question answering

    Priyanka Sen, Sandeep Mavadia, and Amir Saffari. Knowledge graph-augmented language models for complex question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 1–8, 2023

  60. [60]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

  61. [61]

    Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022

  62. [62]

    Automated interpretation of clinical electroencephalograms using artificial intelligence

    Jesper Tveit, Harald Aurlien, Sergey Plis, Vince D Calhoun, William O Tatum, Donald L Schomer, Vibeke Arntsen, Fieke Cox, Firas Fahoum, William B Gallentine, et al. Automated interpretation of clinical electroencephalograms using artificial intelligence. JAMA neurology, 80(8):805–812, 2023

  63. [63]

    Benchmarking large language models for biomedical natural language processing applications and recommendations

    Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature communications, 16(1):3280, 2025. 13

  64. [64]

    Evaluation of retrieval-augmented generation: A survey

    Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey. In CCF Conference on Big Data, pages 102–120. Springer, 2024

  65. [65]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

  66. [66]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019. 14 Appendix Contents A Related Work16 B EpiKG Construction Details17 B.1 Ontology Sources and Coverage 17 B.2 Entity Normalisation Protocol 17 B.3 Relation Extraction Details 17 B.4 Knowledge ...