Recognition: 2 theorem links
· Lean TheoremEpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
Pith reviewed 2026-05-14 21:20 UTC · model grok-4.3
The pith
A new epilepsy knowledge graph boosts LLM performance on clinical reasoning tasks by up to 41 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EpiGraph integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. When this graph augments six different LLMs on the EpiBench tasks for clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning, performance improves consistently, with the largest gains in pharmacogenomic reasoning of 30 to 41 percent.
What carries the argument
EpiGraph, the heterogeneous knowledge graph built from literature with five clinical layers and evidence-grounded triplets that supports Graph-RAG augmentation of LLMs.
If this is right
- LLM performance improves across all five EpiBench tasks when EpiGraph is integrated via Graph-RAG.
- The largest gains occur in pharmacogenomic reasoning, increasing by 30 to 41 percent.
- Structured epilepsy knowledge from peer-reviewed sources enhances evidence-grounded clinical reasoning.
- EpiBench serves as a framework for testing knowledge-augmented LLMs in neurological applications.
- The approach demonstrates the value of knowledge graphs for complex medical decision-making.
Where Pith is reading between the lines
- This method of building domain-specific graphs from large paper collections could extend to other medical fields with evidence-intensive reasoning needs.
- The observed improvements may stem from reduced factual errors in LLM outputs when guided by structured triplets.
- Testing the system on live clinical data or with expert neurologists would provide stronger validation beyond the benchmark.
- Community contributions to the open code could expand the graph with newer research findings.
Load-bearing premise
The automatically extracted triplets and five-layer structure from the source papers accurately capture clinically reliable knowledge without systematic errors or omissions.
What would settle it
Running the same LLM evaluations on EpiBench without using EpiGraph and finding no performance improvement or even degradation would indicate the claim does not hold.
Figures
read the original abstract
Epilepsy diagnosis and treatment require evidence-intensive reasoning across heterogeneous clinical knowledge, including biosignal patterns, genetic mechanisms, pharmacogenomics, treatment strategies, and patient outcomes. In this work, we present \textsc{EpiGraph}, a large-scale epilepsy knowledge graph and benchmark for evaluating knowledge-augmented clinical reasoning. \textsc{EpiGraph} integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers. Built upon this graph, \textsc{EpiBench} defines five clinically motivated tasks spanning clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. We evaluate six LLMs under both standard and Graph-RAG settings. Results show that integrating \textsc{EpiGraph} consistently improves performance across all tasks, with the largest gains observed in pharmacogenomic reasoning (+30--41\%). Our findings demonstrate that structured epilepsy knowledge substantially enhances evidence-grounded clinical reasoning and provides a practical benchmark framework for evaluating knowledge-augmented LLMs in real-world neurological settings. Our code is available at: https://github.com/LabRAI/EEG-KG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces EpiGraph, a heterogeneous knowledge graph constructed from 48,166 peer-reviewed papers and seven clinical resources, yielding 24,324 entities and 32,009 evidence-grounded triplets organized into five clinical layers. It defines EpiBench, a benchmark with five tasks covering clinical decision-making, EEG report generation, pharmacogenomic precision medicine, treatment recommendation, and deep research planning. Evaluations of six LLMs under standard and Graph-RAG settings report consistent performance improvements from integrating EpiGraph, with the largest gains (+30--41%) in pharmacogenomic reasoning.
Significance. If the automatically extracted triplets prove clinically reliable, the work supplies a valuable, publicly released benchmark and code base for evaluating knowledge-augmented LLMs on evidence-intensive neurological reasoning. The reported gains illustrate the potential utility of structured domain graphs for clinical tasks that require retrieval across heterogeneous sources.
major comments (2)
- [EpiGraph construction and triplet extraction] The graph-construction pipeline (48k papers + 7 resources into five layers and 32k triplets) is presented only at high level. No precision, recall, inter-annotator agreement, or expert review of sampled triplets against source papers is reported. Because the headline result—largest gains in pharmacogenomic reasoning—depends directly on the clinical accuracy of these triplets, the absence of validation leaves the central claim only moderately supported.
- [Evaluation and results] Results are summarized as “consistent gains” without reported statistical significance tests, confidence intervals, or exact baseline definitions (e.g., which retrieval method or prompt template constitutes the non-Graph-RAG condition). This detail is required to substantiate the quantitative claims, especially the +30--41% pharmacogenomics improvement.
minor comments (2)
- [Abstract] The abstract lists five tasks but does not name the six LLMs or the primary metrics (accuracy, F1, human preference, etc.) used for each task; adding these would improve clarity.
- [Results tables and figures] Figure captions and table headers should explicitly state whether reported numbers are means over multiple runs or single-run values.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our methods and results. We address each major comment below and will revise the manuscript to incorporate additional details and analyses.
read point-by-point responses
-
Referee: [EpiGraph construction and triplet extraction] The graph-construction pipeline (48k papers + 7 resources into five layers and 32k triplets) is presented only at high level. No precision, recall, inter-annotator agreement, or expert review of sampled triplets against source papers is reported. Because the headline result—largest gains in pharmacogenomic reasoning—depends directly on the clinical accuracy of these triplets, the absence of validation leaves the central claim only moderately supported.
Authors: We agree that additional validation details would strengthen the central claims. In the revised manuscript we will expand the Methods section with a full description of the extraction pipeline (including the specific LLMs, prompting strategies, and post-processing rules used to generate the 32,009 triplets). We will also add a validation subsection reporting precision and recall on a randomly sampled set of 500 triplets manually verified against source papers by a board-certified neurologist, together with inter-annotator agreement statistics on a 100-triplet overlap subset. These additions will directly address the reliability of the pharmacogenomic triplets that drive the largest reported gains. revision: yes
-
Referee: [Evaluation and results] Results are summarized as “consistent gains” without reported statistical significance tests, confidence intervals, or exact baseline definitions (e.g., which retrieval method or prompt template constitutes the non-Graph-RAG condition). This detail is required to substantiate the quantitative claims, especially the +30--41% pharmacogenomics improvement.
Authors: We acknowledge the need for greater statistical rigor and transparency. In the revision we will (1) report exact baseline definitions, including the retrieval method (dense passage retrieval with the same embedding model) and prompt templates used in the standard setting, (2) add paired statistical significance tests (Wilcoxon signed-rank) with p-values for each task and model, and (3) include 95% confidence intervals computed via bootstrap resampling for all accuracy and F1 scores. These changes will allow readers to assess the robustness of the +30--41% pharmacogenomics improvement. revision: yes
Circularity Check
No circularity: construction and evaluation remain independent
full rationale
The paper describes an external data pipeline (48k papers + 7 resources) that produces a fixed graph and benchmark tasks, then measures LLM performance on those tasks with and without the graph. No equations, fitted parameters, or predictions are defined inside the paper whose outputs are then re-used as inputs. Evaluation relies on external LLMs and newly introduced tasks rather than any internal derivation that reduces to the construction choices. Self-citations, if present, are not load-bearing for any claimed result. The central claims are therefore empirically falsifiable outside the paper's own fitted values.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Retrieved graph triplets improve LLM reasoning on clinical tasks without introducing harmful noise
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EPIKG integrates 48,166 peer-reviewed papers and seven clinical resources into a heterogeneous graph containing 24,324 entities and 32,009 evidence-grounded triplets across five clinical layers.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Graph-RAG yields consistent MCQ gains across all six models (avg. +11.3 pp). ... largest gains observed in pharmacogenomic reasoning (+30–41%).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evobrain: Dynamic multi-channel EEG graph modeling for time-evolving brain networks
Rikuto Kotoge, Zheng Chen, Tasuku Kimura, Yasuko Matsubara, Takufumi Yanagisawa, Haruhiko Kishima, and Yasushi Sakurai. Evobrain: Dynamic multi-channel EEG graph modeling for time-evolving brain networks. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[2]
Patrick Kwan, Steven C Schachter, and Martin J Brodie. Drug-resistant epilepsy. New England Journal of Medicine, 365(10):919–926, 2011
work page 2011
-
[3]
Long-term eeg partitioning for seizure onset detection
Zheng Chen, Yasuko Matsubara, Yasushi Sakurai, and Jimeng Sun. Long-term eeg partitioning for seizure onset detection. In Proc. AAAI Conf. Artif. Intell., pages 14221–14229, 2025
work page 2025
-
[4]
Chin-Wei Kuo, Ching-Yun Huang, Hsuan-Ming Chen, Jing-Jane Tsai, and Chin-Wei Huang. Review of pharmacogenetics of antiseizure medications: focusing on genetic variants of mecha- nistic targets. Frontiers in Pharmacology, 15:1411487, 2024
work page 2024
-
[5]
A review on knowledge graphs for healthcare: Resources, applications, and promises
Hejie Cui, Jiaying Lu, Ran Xu, Shiyu Wang, Wenjing Ma, Yue Yu, Shaojun Yu, Xuan Kan, Chen Ling, Liang Zhao, et al. A review on knowledge graphs for healthcare: Resources, applications, and promises. Journal of biomedical informatics, page 104861, 2025
work page 2025
-
[6]
Building a knowledge graph to enable precision medicine
Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine. Scientific data, 10(1):67, 2023
work page 2023
-
[7]
Biomedical knowledge graph: A survey of domains, tasks, and real-world applications
Yuxing Lu, Sin Yee Goi, Xukai Zhao, and Jinzhuo Wang. Biomedical knowledge graph: A survey of domains, tasks, and real-world applications. arXiv preprint arXiv:2501.11632, 2025
-
[8]
Lang Cao, J. Sun, and Adam Cross. Autord: An automatic and end-to-end system for rare disease knowledge graph construction based on ontology-enhanced large language models (preprint). JMIR Medical Informatics, 12, 2024
work page 2024
-
[9]
A review of biomedical datasets relating to drug discovery: a knowledge graph perspective
Stephen Bonner, Ian P Barrett, Cheng Ye, Rowan Swiers, Ola Engkvist, Andreas Bender, Charles Tapley Hoyt, and William L Hamilton. A review of biomedical datasets relating to drug discovery: a knowledge graph perspective. Briefings in Bioinformatics, 23(6):bbac404, 2022
work page 2022
-
[10]
The unified medical language system (umls): integrating biomedical terminology
Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004
work page 2004
-
[11]
Satya S Sahoo, Samden D Lhatoo, Deepak K Gupta, Licong Cui, Meng Zhao, Catherine Jayapandian, Alireza Bozorgi, and Guo-Qiang Zhang. Epilepsy and seizure ontology: towards an epilepsy informatics infrastructure for clinical research and patient care. Journal of the American Medical Informatics Association, 21(1):82–89, 2014
work page 2014
-
[12]
Astghik Sargsyan, Philipp Wegner, Stephan Gebel, Abish Kaladharan, Priya Sethumadhavan, Vanessa Lage-Rupprecht, Johannes Darms, Bruce Schultz, Jürgen Klein, Marc Jacobs, et al. The epilepsy ontology: a community-based ontology tailored for semantic interoperability and text mining. Bioinformatics advances, 3(1):vbad033, 2023
work page 2023
-
[13]
Large language models encode clinical knowledge
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023
work page 2023
-
[14]
Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Perfor- mance of chatgpt on usmle: potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023
work page 2023
-
[15]
Capabilities of GPT-4 on Medical Challenge Problems
Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023. 10
work page internal anchor Pith review arXiv 2023
-
[16]
Llm-empowered patient-provider communication: A data-centric survey from a clinical perspective
Ruosi Shao, Md Shamim Seraj, Kangyi Zhao, Yingtao Luo, Lincan Li, Bolin Shen, Averi Bates, Yue Zhao, Chongle Pan, Lisa Hightow-Weidman, et al. Llm-empowered patient-provider communication: A data-centric survey from a clinical perspective. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of th...
work page 2025
-
[17]
Yanjun Gao, Ruizhe Li, Emma Croxford, John Caskey, Brian W Patterson, Matthew Churpek, Timothy Miller, Dmitriy Dligach, and Majid Afshar. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study. Jmir Ai, 4:e58670, 2025
work page 2025
-
[18]
Mohammad Reza Rezaei, Reza Saadati Fard, Jayson L Parker, Rahul G Krishnan, and Milad Lankarany. Agentic medical knowledge graphs enhance medical question answering: Bridging the gap between llms and evolving medical knowledge. arXiv preprint arXiv:2502.13010, 2025
-
[19]
Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs. arXiv preprint arXiv:2504.00993, 2025
-
[20]
LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis
Lincan Li, Zheng Chen, and Yushun Dong. Llm as clinical graph structure refiner: Enhancing representation learning in eeg seizure diagnosis. arXiv preprint arXiv:2604.28178, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Artificial intelligence in epilepsy: a systemic review
Almuntasar Al-Breiki, Said Al-Sinani, Ahmed Elsharaawy, Mohamed Usama, and Tariq Al- Saadi. Artificial intelligence in epilepsy: a systemic review. Journal of Epilepsy Research, 15(1):2–22, 2025
work page 2025
-
[22]
Artificial intelligence in epilepsy—applications and pathways to the clinic
Alfredo Lucas, Andrew Revell, and Kathryn A Davis. Artificial intelligence in epilepsy—applications and pathways to the clinic. Nature Reviews Neurology, 20(6):319– 336, 2024
work page 2024
-
[23]
Mingyu Derek Ma, Chenchen Ye, Yu Yan, Xiaoxuan Wang, Peipei Ping, Timothy S Chang, and Wei Wang. Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions. arXiv preprint arXiv, 2406, 2024
work page 2024
-
[24]
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Diagnosisarena: benchmarking diagnostic reasoning for large language models
Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, and Xiaofan Zhang. Diagnosisarena: benchmarking diagnostic reasoning for large language models. arXiv preprint arXiv:2505.14107, 2025
-
[26]
Benchmarking retrieval-augmented generation for medicine
Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang. Benchmarking retrieval-augmented generation for medicine. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6233–6251, 2024
work page 2024
-
[27]
Lincan Li, Rikuto Kotoge, Xihao Piao, Zheng Chen, and Yushun Dong. Optimizing eeg graph structure for seizure detection: An information bottleneck and self-supervised learning approach. arXiv.2604.01595, 2026
-
[28]
Medical subject headings (MeSH), 2024
National Library of Medicine. Medical subject headings (MeSH), 2024. Accessed: 2025
work page 2024
-
[29]
Ilae official report: a practical clinical definition of epilepsy
Robert S Fisher, Carlos Acevedo, Alexis Arzimanoglou, Alicia Bogacz, J Helen Cross, Chris- tian E Elger, Jerome Engel Jr, Lars Forsgren, Jacqueline A French, Mike Glynn, et al. Ilae official report: a practical clinical definition of epilepsy. Epilepsia, 55(4):475–482, 2014
work page 2014
-
[30]
Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders
Ada Hamosh, Alan F Scott, Joanna S Amberger, Carol A Bocchini, and Victor A McKusick. Online mendelian inheritance in man (omim), a knowledgebase of human genes and genetic disorders. Nucleic acids research, 33(suppl_1):D514–D517, 2005
work page 2005
-
[31]
Chebi in 2016: Improved services and an expanding collection of metabolites
Janna Hastings, Gareth Owen, Adriano Dekker, Marcus Ennis, Namrata Kale, Venkatesh Muthukrishnan, Steve Turner, Neil Swainston, Pedro Mendes, and Christoph Steinbeck. Chebi in 2016: Improved services and an expanding collection of metabolites. Nucleic acids research, 44(D1):D1214–D1219, 2016. 11
work page 2016
-
[32]
The human phenotype ontology in 2021
Sebastian Köhler, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis- Smith, Nicole A Vasilevsky, Daniel Danis, Ganna Balagura, Gareth Baynam, Amy M Brower, et al. The human phenotype ontology in 2021. Nucleic acids research, 49(D1):D1207–D1217, 2021
work page 2021
-
[33]
AES clinical practice guidelines, 2024
American Epilepsy Society. AES clinical practice guidelines, 2024. Accessed: 2025
work page 2024
-
[34]
Ettore Beghi, Giorgia Giussani, Emma Nichols, Foad Abd-Allah, Jemal Abdela, Ahmed Abdelalim, Haftom Niguse Abraha, Mina G Adib, Sutapa Agrawal, Fares Alahdab, et al. Global, regional, and national burden of epilepsy, 1990–2016: a systematic analysis for the global burden of disease study 2016. The Lancet Neurology, 18(4):357–375, 2019
work page 1990
-
[35]
Minimax-01: Scaling foundation models with lightning attention.arXiv preprint arXiv:2501.08313, 2025
Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, et al. Minimax-01: Scaling foundation models with lightning attention. arXiv preprint arXiv:2501.08313, 2025
-
[36]
Sentence-bert: Sentence embeddings using siamese bert-networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992, 2019
work page 2019
-
[37]
American clinical neurophysiology society guideline 7: guidelines for eeg reporting
William O Tatum IV , Olga Selioutski, Juan G Ochoa, Heidi Munger Clary, Janna Cheek, Frank W Drislane, and Tammy N Tsuchida. American clinical neurophysiology society guideline 7: guidelines for eeg reporting. The Neurodiagnostic Journal, 56(4):285–293, 2016
work page 2016
-
[38]
Neural signals generate clinical notes in the wild
Jathurshan Pradeepkumar, Zheng Chen, and Jimeng Sun. Neural signals generate clinical notes in the wild. arXiv preprint arXiv:2601.22197, 2026
work page internal anchor Pith review arXiv 2026
-
[39]
Toward expert-level medical question answering with large language models
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R Pfohl, Heather Cole-Lewis, et al. Toward expert-level medical question answering with large language models. Nature medicine, 31(3):943–950, 2025
work page 2025
-
[40]
Tracy Glauser, Elinor Ben-Menachem, Blaise Bourgeois, Avital Cnaan, David Chadwick, Carlos Guerreiro, Reetta Kälviäinen, Richard Mattson, Emilio Perucca, and Torbjorn Tomson. Ilae treatment guidelines: evidence-based analysis of antiepileptic drug efficacy and effectiveness as initial monotherapy for epileptic seizures and syndromes. Epilepsia, 47(7):1094...
work page 2006
-
[41]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021
work page 2021
-
[42]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[43]
A survey for large language models in biomedicine
Chong Wang, Mengyao Li, Junjun He, Zhongruo Wang, Erfan Darzi, Zan Chen, Jin Ye, Tianbin Li, Yanzhou Su, Jing Ke, et al. A survey for large language models in biomedicine. Artificial Intelligence in Medicine, page 103268, 2025
work page 2025
-
[44]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [45]
-
[46]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [49]
-
[50]
Google DeepMind. Gemma 3 technical report. Technical report, Google, 2024
work page 2024
-
[51]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[53]
Harvard electroencephalography database (version 4.1)
S Zafar, T Loddenkemper, JW Lee, A Cole, D Goldenholz, J Peters, A Lam, E Amorim, C Chu, S Cash, et al. Harvard electroencephalography database (version 4.1). Brain Data Science Platform, 2025
work page 2025
-
[54]
Chenxi Sun, Jin Jing, Niels Turley, Callison Alcott, Wan-Yee Kang, Andrew J Cole, Daniel M Goldenholz, Alice Lam, Edilberto Amorim, Catherine Chu, et al. Harvard electroencephalogra- phy database: A comprehensive clinical electroencephalographic resource from four boston hospitals. Epilepsia, 2025
work page 2025
-
[55]
Retrieval-augmented generation for knowledge-intensive nlp tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020
work page 2020
-
[56]
Greaselm: Graph reasoning enhanced language models for question answering
Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. Greaselm: Graph reasoning enhanced language models for question answering. arXiv preprint arXiv:2201.08860, 2022
-
[57]
Qa-gnn: Reasoning with language models and knowledge graphs for question answering
Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. Qa-gnn: Reasoning with language models and knowledge graphs for question answering. InProceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 535–546, 2021
work page 2021
-
[58]
Think-on-graph: Deep and respon- sible reasoning of large language model with knowledge graph,
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M Ni, Heung-Yeung Shum, and Jian Guo. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. arXiv preprint arXiv:2307.07697, 2023
-
[59]
Knowledge graph-augmented language models for complex question answering
Priyanka Sen, Sandeep Mavadia, and Amir Saffari. Knowledge graph-augmented language models for complex question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 1–8, 2023
work page 2023
-
[60]
Pubmedqa: A dataset for biomedical research question answering
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567–2577, 2019
work page 2019
-
[61]
Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR, 2022
work page 2022
-
[62]
Automated interpretation of clinical electroencephalograms using artificial intelligence
Jesper Tveit, Harald Aurlien, Sergey Plis, Vince D Calhoun, William O Tatum, Donald L Schomer, Vibeke Arntsen, Fieke Cox, Firas Fahoum, William B Gallentine, et al. Automated interpretation of clinical electroencephalograms using artificial intelligence. JAMA neurology, 80(8):805–812, 2023
work page 2023
-
[63]
Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature communications, 16(1):3280, 2025. 13
work page 2025
-
[64]
Evaluation of retrieval-augmented generation: A survey
Hao Yu, Aoran Gan, Kai Zhang, Shiwei Tong, Qi Liu, and Zhaofeng Liu. Evaluation of retrieval-augmented generation: A survey. In CCF Conference on Big Data, pages 102–120. Springer, 2024
work page 2024
-
[65]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004
work page 2004
-
[66]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019. 14 Appendix Contents A Related Work16 B EpiKG Construction Details17 B.1 Ontology Sources and Coverage 17 B.2 Entity Normalisation Protocol 17 B.3 Relation Extraction Details 17 B.4 Knowledge ...
work page internal anchor Pith review Pith/arXiv arXiv 1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.