arxiv: 2605.09505 · v2 · submitted 2026-05-10 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

Yuyang Dai , Zheng Chen , Jathurshan Pradeepkumar , Yasuko Matsubara , Jimeng Sun , Yasushi Sakurai , Yushun Dong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords epilepsyknowledge graphlarge language modelsclinical reasoningpharmacogenomicsGraph-RAGbenchmarkneurology

0 comments

The pith

A new epilepsy knowledge graph boosts LLM performance on clinical reasoning tasks by up to 41 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EpiGraph, a knowledge graph that combines information from over 48,000 epilepsy papers into a structured form with entities and evidence triplets. This graph is used to augment large language models through Graph-RAG for better handling of complex epilepsy cases involving signals, genetics, and treatments. Evaluations on five tasks show consistent gains in model accuracy when the graph is added. The most significant improvements appear in pharmacogenomic reasoning, where performance rises by 30 to 41 percent. Such results indicate that explicit knowledge structures can help models reason more reliably with medical evidence.

Core claim

EpiGraph integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. When this graph augments six different LLMs on the EpiBench tasks for clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning, performance improves consistently, with the largest gains in pharmacogenomic reasoning of 30 to 41 percent.

What carries the argument

EpiGraph, the heterogeneous knowledge graph built from literature with five clinical layers and evidence-grounded triplets that supports Graph-RAG augmentation of LLMs.

If this is right

LLM performance improves across all five EpiBench tasks when EpiGraph is integrated via Graph-RAG.
The largest gains occur in pharmacogenomic reasoning, increasing by 30 to 41 percent.
Structured epilepsy knowledge from peer-reviewed sources enhances evidence-grounded clinical reasoning.
EpiBench serves as a framework for testing knowledge-augmented LLMs in neurological applications.
The approach demonstrates the value of knowledge graphs for complex medical decision-making.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method of building domain-specific graphs from large paper collections could extend to other medical fields with evidence-intensive reasoning needs.
The observed improvements may stem from reduced factual errors in LLM outputs when guided by structured triplets.
Testing the system on live clinical data or with expert neurologists would provide stronger validation beyond the benchmark.
Community contributions to the open code could expand the graph with newer research findings.

Load-bearing premise

The automatically extracted triplets and five-layer structure from the source papers accurately capture clinically reliable knowledge without systematic errors or omissions.

What would settle it

Running the same LLM evaluations on EpiBench without using EpiGraph and finding no performance improvement or even degradation would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.09505 by Jathurshan Pradeepkumar, Jimeng Sun, Yasuko Matsubara, Yasushi Sakurai, Yushun Dong, Yuyang Dai, Zheng Chen.

**Figure 2.** Figure 2: Pipeline overview of EPIKG, comprising two components. Left: paper-derived evidence graph is processed through an extraction pipeline that identifies entities, relations, and supporting evidence, mapped into relation graph. Right: An example of how EPIKG grounds the paper-derived evidence graph: given the query paper “Resistance to excitotoxin-induced seizures...”, EPIKG retrieves the supporting reasoning … view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis and ablation results of Graph-RAG. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Two impression generation examples on S0001. Three column refer to medgemma model [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Overview of EPIBENCH Running Time. Blue denotes T1 Knowledge QA, green denotes T2 Report Generation, red denotes T3 Precision Medicine, purple denotes T4 Treatment Recommendation, and olive denotes T5 Deep Research. The red star highlights Graph-RAG (Ours). Ablation and Sensitivity ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Overview of EPIBENCH Result. I Limitations and Future Work Language and ontology coverage. EPIKG is constructed from English-language ontologies and literature, limiting its coverage of epilepsy syndromes and gene–disease associations that are primarily documented in non-English sources. Rare syndromes with fewer than five supporting papers are likely underrepresented in the extracted relation set, as the … view at source ↗

read the original abstract

Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EpiGraph, a heterogeneous knowledge graph constructed from 48,166 peer-reviewed papers and seven clinical resources, yielding 24,324 entities and 32,009 evidence-grounded triplets organized into five clinical layers. It defines EpiBench, a benchmark with five tasks covering clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. Evaluations of six LLMs under standard and Graph-RAG settings report consistent performance improvements from integrating EpiGraph, with the largest gains (+30--41%) in pharmacogenomic reasoning.

Significance. If the automatically extracted triplets prove clinically reliable, the work supplies a valuable, publicly released benchmark and code base for evaluating knowledge-augmented LLMs on evidence-intensive neurological reasoning. The reported gains illustrate the potential utility of structured domain graphs for clinical tasks that require retrieval across heterogeneous sources.

major comments (2)

[EpiGraph construction and triplet extraction] The graph-construction pipeline (48k papers + 7 resources into five layers and 32k triplets) is presented only at high level. No precision, recall, inter-annotator agreement, or expert review of sampled triplets against source papers is reported. Because the headline result—largest gains in pharmacogenomic reasoning—depends directly on the clinical accuracy of these triplets, the absence of validation leaves the central claim only moderately supported.
[Evaluation and results] Results are summarized as “consistent gains” without reported statistical significance tests, confidence intervals, or exact baseline definitions (e.g., which retrieval method or prompt template constitutes the non-Graph-RAG condition). This detail is required to substantiate the quantitative claims, especially the +30--41% pharmacogenomics improvement.

minor comments (2)

[Abstract] The abstract lists five tasks but does not name the six LLMs or the primary metrics (accuracy, F1, human preference, etc.) used for each task; adding these would improve clarity.
[Results tables and figures] Figure captions and table headers should explicitly state whether reported numbers are means over multiple runs or single-run values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our methods and results. We address each major comment below and will revise the manuscript to incorporate additional details and analyses.

read point-by-point responses

Referee: [EpiGraph construction and triplet extraction] The graph-construction pipeline (48k papers + 7 resources into five layers and 32k triplets) is presented only at high level. No precision, recall, inter-annotator agreement, or expert review of sampled triplets against source papers is reported. Because the headline result—largest gains in pharmacogenomic reasoning—depends directly on the clinical accuracy of these triplets, the absence of validation leaves the central claim only moderately supported.

Authors: We agree that additional validation details would strengthen the central claims. In the revised manuscript we will expand the Methods section with a full description of the extraction pipeline (including the specific LLMs, prompting strategies, and post-processing rules used to generate the 32,009 triplets). We will also add a validation subsection reporting precision and recall on a randomly sampled set of 500 triplets manually verified against source papers by a board-certified neurologist, together with inter-annotator agreement statistics on a 100-triplet overlap subset. These additions will directly address the reliability of the pharmacogenomic triplets that drive the largest reported gains. revision: yes
Referee: [Evaluation and results] Results are summarized as “consistent gains” without reported statistical significance tests, confidence intervals, or exact baseline definitions (e.g., which retrieval method or prompt template constitutes the non-Graph-RAG condition). This detail is required to substantiate the quantitative claims, especially the +30--41% pharmacogenomics improvement.

Authors: We acknowledge the need for greater statistical rigor and transparency. In the revision we will (1) report exact baseline definitions, including the retrieval method (dense passage retrieval with the same embedding model) and prompt templates used in the standard setting, (2) add paired statistical significance tests (Wilcoxon signed-rank) with p-values for each task and model, and (3) include 95% confidence intervals computed via bootstrap resampling for all accuracy and F1 scores. These changes will allow readers to assess the robustness of the +30--41% pharmacogenomics improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: construction and evaluation remain independent

full rationale

The paper describes an external data pipeline (48k papers + 7 resources) that produces a fixed graph and benchmark tasks, then measures LLM performance on those tasks with and without the graph. No equations, fitted parameters, or predictions are defined inside the paper whose outputs are then re-used as inputs. Evaluation relies on external LLMs and newly introduced tasks rather than any internal derivation that reduces to the construction choices. Self-citations, if present, are not load-bearing for any claimed result. The central claims are therefore empirically falsifiable outside the paper's own fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions about the utility of knowledge graphs for LLM reasoning and the representativeness of the extracted clinical facts; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Retrieved graph triplets improve LLM reasoning on clinical tasks without introducing harmful noise
Central to the Graph-RAG evaluation design.

pith-pipeline@v0.9.0 · 5542 in / 1124 out tokens · 39403 ms · 2026-05-14T21:20:48.007939+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EPIKG integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Graph-RAG yields consistent MCQ gains across all six models (avg. +11.3 pp). ... largest gains observed in pharmacogenomic reasoning (+30–41%).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 11 internal anchors

[1]

Evobrain: Dynamic multi-channel EEG graph modeling for time-evolving brain networks

Rikuto Kotoge, Zheng Chen, Tasuku Kimura, Yasuko Matsubara, Takufumi Yanagisawa, Haruhiko Kishima, and Yasushi Sakurai. Evobrain: Dynamic multi-channel EEG graph modeling for time-evolving brain networks. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[2]

Drug-resistant epilepsy

Patrick Kwan, Steven C Schachter, and Martin J Brodie. Drug-resistant epilepsy. New England Journal of Medicine, 365(10):919–926, 2011

work page 2011
[3]

Long-term eeg partitioning for seizure onset detection

Zheng Chen, Yasuko Matsubara, Yasushi Sakurai, and Jimeng Sun. Long-term eeg partitioning for seizure onset detection. In Proc. AAAI Conf. Artif. Intell., pages 14221–14229, 2025

work page 2025
[4]

Review of pharmacogenetics of antiseizure medications: focusing on genetic variants of mecha- nistic targets

Chin-Wei Kuo, Ching-Yun Huang, Hsuan-Ming Chen, Jing-Jane Tsai, and Chin-Wei Huang. Review of pharmacogenetics of antiseizure medications: focusing on genetic variants of mecha- nistic targets. Frontiers in Pharmacology, 15:1411487, 2024

work page 2024
[5]

A review on knowledge graphs for healthcare: Resources, applications, and promises

Hejie Cui, Jiaying Lu, Ran Xu, Shiyu Wang, Wenjing Ma, Yue Yu, Shaojun Yu, Xuan Kan, Chen Ling, Liang Zhao, et al. A review on knowledge graphs for healthcare: Resources, applications, and promises. Journal of biomedical informatics, page 104861, 2025

work page 2025
[6]

Building a knowledge graph to enable precision medicine

Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine. Scientific data, 10(1):67, 2023

work page 2023
[7]

Biomedical knowledge graph: A survey of domains, tasks, and real-world applications

Yuxing Lu, Sin Yee Goi, Xukai Zhao, and Jinzhuo Wang. Biomedical knowledge graph: A survey of domains, tasks, and real-world applications. arXiv preprint arXiv:2501.11632, 2025

work page arXiv 2025
[8]

Sun, and Adam Cross

Lang Cao, J. Sun, and Adam Cross. Autord: An automatic and end-to-end system for rare disease knowledge graph construction based on ontology-enhanced large language models (preprint). JMIR Medical Informatics, 12, 2024

work page 2024
[9]

A review of biomedical datasets relating to drug discovery: a knowledge graph perspective

Stephen Bonner, Ian P Barrett, Cheng Ye, Rowan Swiers, Ola Engkvist, Andreas Bender, Charles Tapley Hoyt, and William L Hamilton. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Briefings in Bioinformatics, 23(6):bbac404, 2022

work page 2022
[10]

The unified medical language system (umls): integrating biomedical terminology

Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004

work page 2004
[11]

Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care

Satya S Sahoo, Samden D Lhatoo, Deepak K Gupta, Licong Cui, Meng Zhao, Catherine Jayapandian, Alireza Bozorgi, and Guo-Qiang Zhang. Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. Journal of the American Medical Informatics Association, 21(1):82–89, 2014

work page 2014
[12]

The epilepsy ontology: a community-based ontology tailored for semantic interoperability and text mining

Astghik Sargsyan, Philipp Wegner, Stephan Gebel, Abish Kaladharan, Priya Sethumadhavan, Vanessa Lage-Rupprecht, Johannes Darms, Bruce Schultz, Jürgen Klein, Marc Jacobs, et al. The epilepsy ontology: a community-based ontology tailored for semantic interoperability and text mining. Bioinformatics advances, 3(1):vbad033, 2023

work page 2023
[13]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

work page 2023
[14]

Perfor- mance of chatgpt on usmle: potential for ai-assisted medical education using large language models

Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Perfor- mance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023

work page 2023
[15]

Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023. 10

work page internal anchor Pith review arXiv 2023
[16]

Llm-empowered patient-provider communication: A data-centric survey from a clinical perspective

Ruosi Shao, Md Shamim Seraj, Kangyi Zhao, Yingtao Luo, Lincan Li, Bolin Shen, Averi Bates, Yue Zhao, Chongle Pan, Lisa Hightow-Weidman, et al. Llm-empowered patient-provider communication: A data-centric survey from a clinical perspective. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of th...

work page 2025
[17]

Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study

Yanjun Gao, Ruizhe Li, Emma Croxford, John Caskey, Brian W Patterson, Matthew Churpek, Timothy Miller, Dmitriy Dligach, and Majid Afshar. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study. Jmir Ai, 4:e58670, 2025

work page 2025
[18]

Agentic medical knowledge graphs enhance medical question answering: Bridging the gap between llms and evolving medical knowledge

Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L Parker, Rahul G Krishnan, and Milad Lankarany. Agentic medical knowledge graphs enhance medical question answering: Bridging the gap between llms and evolving medical knowledge. arXiv preprint arXiv:2502.13010, 2025

work page arXiv 2025
[19]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993

Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs. arXiv preprint arXiv:2504.00993, 2025

work page arXiv 2025
[20]

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Lincan Li, Zheng Chen, and Yushun Dong. Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis. arXiv preprint arXiv:2604.28178, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Artificial intelligence in epilepsy: a systemic review

Almuntasar Al-Breiki, Said Al-Sinani, Ahmed Elsharaawy, Mohamed Usama, and Tariq Al- Saadi. Artificial intelligence in epilepsy: a systemic review. Journal of Epilepsy Research, 15(1):2–22, 2025

work page 2025
[22]

Artificial intelligence in epilepsy—applications and pathways to the clinic

Alfredo Lucas, Andrew Revell, and Kathryn A Davis. Artificial intelligence in epilepsy—applications and pathways to the clinic. Nature Reviews Neurology, 20(6):319– 336, 2024

work page 2024
[23]

Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions

Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, and Wei Wang. Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions. arXiv preprint arXiv, 2406, 2024

work page 2024
[24]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Diagnosisarena: benchmarking diagnostic reasoning for large language models

Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, and Xiaofan Zhang. Diagnosisarena: benchmarking diagnostic reasoning for large language models. arXiv preprint arXiv:2505.14107, 2025

work page arXiv 2025
[26]

Benchmarking retrieval-augmented generation for medicine

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251, 2024

work page 2024
[27]

Optimizing eeg graph structure for seizure detection: An information bottleneck and self-supervised learning approach

Lincan Li, Rikuto Kotoge, Xihao Piao, Zheng Chen, and Yushun Dong. Optimizing eeg graph structure for seizure detection: An information bottleneck and self-supervised learning approach. arXiv.2604.01595, 2026

work page arXiv 2026
[28]

Medical subject headings (MeSH), 2024

National Library of Medicine. Medical subject headings (MeSH), 2024. Accessed: 2025

work page 2024
[29]

Ilae official report: a practical clinical definition of epilepsy

Robert S Fisher, Carlos Acevedo, Alexis Arzimanoglou, Alicia Bogacz, J Helen Cross, Chris- tian E Elger, Jerome Engel Jr, Lars Forsgren, Jacqueline A French, Mike Glynn, et al. Ilae official report: a practical clinical definition of epilepsy. Epilepsia, 55(4):475–482, 2014

work page 2014
[30]

Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders

Ada Hamosh, Alan F Scott, Joanna S Amberger, Carol A Bocchini, and Victor A McKusick. Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(suppl_1):D514–D517, 2005

work page 2005
[31]

Chebi in 2016: Improved services and an expanding collection of metabolites

Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. Chebi in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research, 44(D1):D1214–D1219, 2016. 11

work page 2016
[32]

The human phenotype ontology in 2021

Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis- Smith, Nicole A Vasilevsky, Daniel Danis, Ganna Balagura, Gareth Baynam, Amy M Brower, et al. The human phenotype ontology in 2021. Nucleic acids research, 49(D1):D1207–D1217, 2021

work page 2021
[33]

AES clinical practice guidelines, 2024

American Epilepsy Society. AES clinical practice guidelines, 2024. Accessed: 2025

work page 2024
[34]

Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016

Ettore Beghi, Giorgia Giussani, Emma Nichols, Foad Abd-Allah, Jemal Abdela, Ahmed Abdelalim, Haftom Niguse Abraha, Mina G Adib, Sutapa Agrawal, Fares Alahdab, et al. Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016. The Lancet Neurology, 18(4):357–375, 2019

work page 1990
[35]

Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025

Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313, 2025

work page arXiv 2025
[36]

Sentence-bert: Sentence embeddings using siamese bert-networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019

work page 2019
[37]

American clinical neurophysiology society guideline 7: guidelines for eeg reporting

William O Tatum IV , Olga Selioutski, Juan G Ochoa, Heidi Munger Clary, Janna Cheek, Frank W Drislane, and Tammy N Tsuchida. American clinical neurophysiology society guideline 7: guidelines for eeg reporting. The Neurodiagnostic Journal, 56(4):285–293, 2016

work page 2016
[38]

Neural signals generate clinical notes in the wild

Jathurshan Pradeepkumar, Zheng Chen, and Jimeng Sun. Neural signals generate clinical notes in the wild. arXiv preprint arXiv:2601.22197, 2026

work page internal anchor Pith review arXiv 2026
[39]

Toward expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature medicine, 31(3):943–950, 2025

work page 2025
[40]

Ilae treatment guidelines: evidence-based analysis of antiepileptic drug efficacy and effectiveness as initial monotherapy for epileptic seizures and syndromes

Tracy Glauser, Elinor Ben-Menachem, Blaise Bourgeois, Avital Cnaan, David Chadwick, Carlos Guerreiro, Reetta Kälviäinen, Richard Mattson, Emilio Perucca, and Torbjorn Tomson. Ilae treatment guidelines: evidence-based analysis of antiepileptic drug efficacy and effectiveness as initial monotherapy for epileptic seizures and syndromes. Epilepsia, 47(7):1094...

work page 2006
[41]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021

work page 2021
[42]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[43]

A survey for large language models in biomedicine

Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, et al. A survey for large language models in biomedicine. Artificial Intelligence in Medicine, page 103268, 2025

work page 2025
[44]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Claude Sonnet 4

Anthropic. Claude Sonnet 4. Technical report, Anthropic, 2024

work page 2024
[46]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Mistral small 3.1, 2025

Mistral AI. Mistral small 3.1, 2025. Accessed: 2025

work page 2025
[50]

Gemma 3 technical report

Google DeepMind. Gemma 3 technical report. Technical report, Google, 2024

work page 2024
[51]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[53]

Harvard electroencephalography database (version 4.1)

S Zafar, T Loddenkemper, JW Lee, A Cole, D Goldenholz, J Peters, A Lam, E Amorim, C Chu, S Cash, et al. Harvard electroencephalography database (version 4.1). Brain Data Science Platform, 2025

work page 2025
[54]

Harvard electroencephalogra- phy database: A comprehensive clinical electroencephalographic resource from four boston hospitals

Chenxi Sun, Jin Jing, Niels Turley, Callison Alcott, Wan-Yee Kang, Andrew J Cole, Daniel M Goldenholz, Alice Lam, Edilberto Amorim, Catherine Chu, et al. Harvard electroencephalogra- phy database: A comprehensive clinical electroencephalographic resource from four boston hospitals. Epilepsia, 2025

work page 2025
[55]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[56]

Greaselm: Graph reasoning enhanced language models for question answering

Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models for question answering. arXiv preprint arXiv:2201.08860, 2022

work page arXiv 2022
[57]

Qa-gnn: Reasoning with language models and knowledge graphs for question answering

Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 535–546, 2021

work page 2021
[58]

Think-on-graph: Deep and respon- sible reasoning of large language model with knowledge graph,

Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. arXiv preprint arXiv:2307.07697, 2023

work page arXiv 2023
[59]

Knowledge graph-augmented language models for complex question answering

Priyanka Sen, Sandeep Mavadia, and Amir Saffari. Knowledge graph-augmented language models for complex question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 1–8, 2023

work page 2023
[60]

Pubmedqa: A dataset for biomedical research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019

work page 2019
[61]

Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022

work page 2022
[62]

Automated interpretation of clinical electroencephalograms using artificial intelligence

Jesper Tveit, Harald Aurlien, Sergey Plis, Vince D Calhoun, William O Tatum, Donald L Schomer, Vibeke Arntsen, Fieke Cox, Firas Fahoum, William B Gallentine, et al. Automated interpretation of clinical electroencephalograms using artificial intelligence. JAMA neurology, 80(8):805–812, 2023

work page 2023
[63]

Benchmarking large language models for biomedical natural language processing applications and recommendations

Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature communications, 16(1):3280, 2025. 13

work page 2025
[64]

Evaluation of retrieval-augmented generation: A survey

Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey. In CCF Conference on Big Data, pages 102–120. Springer, 2024

work page 2024
[65]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

work page 2004
[66]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019. 14 Appendix Contents A Related Work16 B EpiKG Construction Details17 B.1 Ontology Sources and Coverage 17 B.2 Entity Normalisation Protocol 17 B.3 Relation Extraction Details 17 B.4 Knowledge ...

work page internal anchor Pith review Pith/arXiv arXiv 1904