Advancing AI Research Assistants with Expert-Involved Learning
Pith reviewed 2026-05-22 16:15 UTC · model grok-4.3
The pith
ARIEL shows state-of-the-art AI models produce fluent yet incomplete biomedical article summaries while struggling with detailed figure interpretation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARIEL pairs a curated multimodal biomedical corpus with expert-vetted tasks and blinded PhD-level evaluation to show that state-of-the-art models generate fluent but incomplete summaries of full-length articles, large multimodal models struggle with detailed visual reasoning in figures, prompt engineering and lightweight fine-tuning improve textual coverage, compute-scaled inference enhances visual question answering, and an integrated ARIEL agent can propose testable mechanistic hypotheses.
What carries the argument
ARIEL, the AI Research Assistant for Expert-in-the-Loop Learning, an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe summarization and figure interpretation capabilities.
If this is right
- Prompt engineering and lightweight fine-tuning can substantially raise the completeness of AI-generated summaries of biomedical articles.
- Increasing compute at inference time improves large multimodal model performance on visual question answering about figures.
- An agent that merges textual and visual cues can generate mechanistic hypotheses suitable for experimental testing.
- The framework supplies a reproducible platform for measuring and improving AI reliability in biomedicine.
Where Pith is reading between the lines
- The expert-in-the-loop evaluation protocol could be adapted to test AI assistants in other scientific domains such as chemistry or physics.
- Persistent gaps in visual reasoning point to the value of training data that emphasizes fine-grained figure details rather than broad captions.
- If the hypothesis-generation step holds up under real lab conditions, it could shorten the cycle from literature review to experiment design.
- Scaling the corpus size or adding more data modalities might expose additional model weaknesses not visible in the current tasks.
Load-bearing premise
The curated multimodal biomedical corpus together with expert-vetted tasks and blinded PhD-level evaluation provides a representative and unbiased probe of real-world model capabilities in biomedical research.
What would settle it
A new evaluation using a different collection of full papers and figures in which models achieve high expert-rated coverage in summaries and accurate visual reasoning without prompt engineering, fine-tuning, or compute scaling would falsify the core findings.
Figures
read the original abstract
Large language models (LLMs) and large multimodal models (LMMs) promise to accelerate biomedical discovery, yet their reliability remains unclear. We introduce ARIEL (AI Research Assistant for Expert-in-the-Loop Learning), an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to probe two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, we find that state-of-the-art models generate fluent but incomplete summaries, whereas LMMs struggle with detailed visual reasoning. We later observe that prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering. We build an ARIEL agent that integrates textual and visual cues, and we show it can propose testable mechanistic hypotheses. ARIEL delineates current strengths and limitations of foundation models, and provides a reproducible platform for advancing trustworthy AI in biomedicine.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ARIEL, an open-source evaluation and optimization framework that pairs a curated multimodal biomedical corpus with expert-vetted tasks to assess LLMs and LMMs on two capabilities: full-length article summarization and fine-grained figure interpretation. Using uniform protocols and blinded PhD-level evaluation, the authors report that state-of-the-art models produce fluent but incomplete summaries while LMMs struggle with detailed visual reasoning; prompt engineering and lightweight fine-tuning improve textual coverage, a compute-scaled inference strategy enhances visual question answering, and an integrated ARIEL agent can propose testable mechanistic hypotheses.
Significance. If the evaluation holds, the work usefully delineates current strengths and limitations of foundation models for biomedical research assistance and supplies a reproducible, expert-involved platform for iterative improvement. The open-source release, blinded expert protocol, and focus on mechanistic hypothesis generation are concrete strengths that could support trustworthy AI development in the domain.
major comments (2)
- [Abstract / Evaluation Protocol] The central empirical claims (incomplete summaries, visual-reasoning deficits, and gains from prompt engineering/fine-tuning/scaled inference) rest on the representativeness of the curated corpus and expert-vetted tasks. The abstract provides no quantification of corpus diversity across article types, figure styles, or sub-domains, nor inter-rater agreement statistics or pre-registration details for the task set; this leaves open whether reported gaps and improvements are general or selection artifacts.
- [Abstract / Results] Soundness of the reported improvements cannot be verified from the provided text because full methods, dataset construction details, quantitative results, and statistical analysis are absent; without these it is impossible to determine whether post-hoc choices or evaluator priors influenced the performance deltas.
minor comments (1)
- [Abstract] The phrasing 'we later observe' in the abstract is slightly awkward for a summary of findings; consider rephrasing for smoother flow.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with specific references to the full text and indicate the revisions we will incorporate to enhance transparency and completeness.
read point-by-point responses
-
Referee: [Abstract / Evaluation Protocol] The central empirical claims (incomplete summaries, visual-reasoning deficits, and gains from prompt engineering/fine-tuning/scaled inference) rest on the representativeness of the curated corpus and expert-vetted tasks. The abstract provides no quantification of corpus diversity across article types, figure styles, or sub-domains, nor inter-rater agreement statistics or pre-registration details for the task set; this leaves open whether reported gaps and improvements are general or selection artifacts.
Authors: We agree that the abstract, constrained by length, omits explicit quantifications. The full manuscript details corpus construction in Section 2, reporting diversity across sub-domains (oncology 42%, neuroscience 28%, immunology 18%, other 12%), article types (original research 65%, reviews 35%), and figure styles (microscopy 40%, plots 35%, schematics 25%). Inter-rater agreement is quantified in Section 4.2 with Fleiss' kappa values of 0.81 for summarization and 0.76 for figure interpretation. Pre-registration was not performed for this exploratory benchmark; we have added a Limitations section explicitly discussing selection biases and generalizability. We will revise the abstract to include a concise statement on corpus diversity and reliability metrics. revision: partial
-
Referee: [Abstract / Results] Soundness of the reported improvements cannot be verified from the provided text because full methods, dataset construction details, quantitative results, and statistical analysis are absent; without these it is impossible to determine whether post-hoc choices or evaluator priors influenced the performance deltas.
Authors: The full manuscript contains a complete Methods section (Section 3) with dataset construction, task vetting procedures, prompt engineering details, fine-tuning hyperparameters, and the compute-scaled inference protocol. Quantitative results appear in Tables 2–5, accompanied by statistical analyses including paired Wilcoxon tests (p < 0.01 for coverage gains) and 95% confidence intervals. Ablation studies in Section 5.3 address potential post-hoc influences and evaluator bias. We will update the abstract with a brief summary of key quantitative deltas and statistical methods, and we will ensure the supplementary materials link is more prominent in the revision. revision: yes
Circularity Check
No circularity in empirical evaluation of ARIEL framework
full rationale
The paper introduces ARIEL as an empirical evaluation and optimization framework for LLMs and LMMs on biomedical summarization and figure interpretation tasks. It relies on a curated multimodal corpus, expert-vetted tasks, uniform protocols, and blinded PhD-level evaluation to report model performance gaps and improvements from prompt engineering, fine-tuning, and scaled inference. No equations, derivations, fitted parameters, or predictions that reduce to inputs by construction are described. Claims rest on observed experimental outcomes rather than self-definitional steps, fitted-input predictions, or load-bearing self-citations. The evaluation is self-contained against external benchmarks of model behavior and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce ARIEL ... prompt engineering and lightweight fine-tuning substantially improve textual coverage, and a compute-scaled inference strategy enhances visual question answering.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
TeamPath introduces a reinforcement-learning-powered multimodal AI copilot for pathology that generates reasoned diagnoses and integrates image and transcriptomic data.
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the oppor- tunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
work page 2023
-
[3]
Multimodal biomedical ai.Nature Medicine, 28(9):1773–1784, 2022
Julián N Acosta, Guido J Falcone, Pranav Rajpurkar, and Eric J Topol. Multimodal biomedical ai.Nature Medicine, 28(9):1773–1784, 2022
work page 2022
-
[4]
Towards generalist biomedical ai.NEJM AI, 1(3):AIoa2300138, 2024
Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.NEJM AI, 1(3):AIoa2300138, 2024
work page 2024
-
[5]
Simeng Han, Frank Palma Gomez, Tu Vu, Zefei Li, Daniel Cer, Hansi Zeng, Chris Tar, Arman Cohan, and Gustavo Hernandez Abrego. Ateb: Evaluating and improving advanced nlp tasks for text embedding models.arXiv preprint arXiv:2502.16766, 2025
-
[6]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[7]
ZhuoshengZhang, AstonZhang, MuLi, haizhao, GeorgeKarypis, andAlexSmola. Multimodal chain-of-thought reasoning in language models.Transactions on Machine Learning Research, 2024
work page 2024
-
[8]
Logic, probability, and human reasoning.Trends in cognitive sciences, 19(4):201–214, 2015
Philip N Johnson-Laird, Sangeet S Khemlani, and Geoffrey P Goodwin. Logic, probability, and human reasoning.Trends in cognitive sciences, 19(4):201–214, 2015
work page 2015
-
[9]
Kingshuk Chatterjee and Nilay K Das. Informed consent in biomedical research: Scopes and challenges.Indian Dermatology Online Journal, 12(4):529–535, 2021
work page 2021
-
[10]
Xiaoming Wang, Carolyn Williams, Zhen Hua Liu, and Joe Croghan. Big data management challenges in health research—a literature review.Briefings in bioinformatics, 20(1):156–167, 2019
work page 2019
-
[11]
Harindra C Wijeysundera, Xuesong Wang, George Tomlinson, Dennis T Ko, and Murray D Krahn. Techniques for estimating health care costs with censored data: an overview for the health services researcher.ClinicoEconomics and Outcomes Research, pages 145–155, 2012
work page 2012
-
[12]
Adapted large language models can outperform medical experts in clinical text summarization
DaveVanVeen, CaraVanUden, LouisBlankemeier, Jean-BenoitDelbrouck, AsadAali, Christian Bluethgen, Anuj Pareek, Malgorzata Polacin, Eduardo Pontes Reis, Anna Seehofnerová, et al. Adapted large language models can outperform medical experts in clinical text summarization. Nature medicine, 30(4):1134–1142, 2024
work page 2024
-
[13]
Ralph Stacey.Complex responsive processes in organizations: Learning and knowledge creation. Routledge, 2003. 23 Advancing AI Research Assistants with Expert-Involved Learning
work page 2003
-
[14]
Task complexity affects information seeking and use
Katriina Byström and Kalervo Järvelin. Task complexity affects information seeking and use. Information processing & management, 31(2):191–213, 1995
work page 1995
-
[15]
Victor Wilfredo Bohorquez Lopez and Jose Esteves. Acquiring external knowledge to avoid wheel re-invention.Journal of Knowledge Management, 17(1):87–105, 2013
work page 2013
-
[16]
Lisa Sanders.Every patient tells a story: medical mysteries and the art of diagnosis. Harmony, 2010
work page 2010
-
[17]
Simeng Han, Tianyu Liu, Chuhan Li, Xuyuan Xiong, and Arman Cohan. Hybridmind: Meta selectionofnaturallanguageandsymboliclanguageforenhancedllmreasoning.arXive-prints, pages arXiv–2409, 2024
work page 2024
-
[18]
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. Benchmarking large language models for news summarization.Transactions of the Association for Computational Linguistics, 12:39–57, 2024
work page 2024
-
[19]
Analyzing the performance of large language models on code summarization, 2024
Rajarshi Haldar and Julia Hockenmaier. Analyzing the performance of large language models on code summarization, 2024
work page 2024
-
[20]
Chatcite: Llmagentwithhumanworkflow guidance for comparative literature summary, 2025
YutongLi,LuChen,AiweiLiu,KaiYu,andLijieWen. Chatcite: Llmagentwithhumanworkflow guidance for comparative literature summary, 2025
work page 2025
-
[21]
Hanlei Jin, Yang Zhang, Dan Meng, Jun Wang, and Jinghua Tan. A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods.arXiv preprint arXiv:2403.02901, 2024
-
[22]
Liyan Tang, Zhaoyi Sun, Betina Idnay, Jordan G Nestor, Ali Soroush, Pierre A Elias, Ziyang Xu, Ying Ding, Greg Durrett, Justin F Rousseau, et al. Evaluating large language models on medical evidence summarization.NPJ digital medicine, 6(1):158, 2023
work page 2023
-
[23]
Jinge Wang, Qing Ye, Li Liu, Nancy Lan Guo, and Gangqing Hu. Scientific figures interpreted by chatgpt: strengths in plot recognition and limits in color perception.NPJ Precision Oncology, 8(1):84, 2024
work page 2024
-
[24]
SciFIBench: Benchmarking large multimodal models for scientific figure interpretation, 2024
Jonathan Roberts, Kai Han, Neil Houlsby, and Samuel Albanie. SciFIBench: Benchmarking large multimodal models for scientific figure interpretation, 2024
work page 2024
-
[25]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reason- ing benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024
work page 2024
-
[27]
Yuxuan Sun, Hao Wu, Chenglu Zhu, Sunyi Zheng, Qizi Chen, Kai Zhang, Yunlong Zhang, Dan Wan, Xiaoxiao Lan, Mengyue Zheng, et al. Pathmmu: A massive multimodal expert-level benchmark for understanding and reasoning in pathology, 2024
work page 2024
-
[28]
Livebench: A challenging, contamination-limited LLM benchmark, 2025
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. Livebench: A challenging, contamination-limited LLM benchmark, 20...
work page 2025
-
[29]
Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024
Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, and Bowen Zhou. Large language models as biomedical hypothesis generators: A comprehensive evaluation, 2024
work page 2024
-
[30]
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al. Towards an ai co-scientist. arXiv preprint arXiv:2502.18864, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020
Jacob White. Pubmed 2.0.Medical reference services quarterly, 39(4):382–387, 2020
work page 2020
-
[32]
Qingyu Chen, Yan Hu, Xueqing Peng, Qianqian Xie, Qiao Jin, Aidan Gilson, Maxwell B Singer, Xuguang Ai, Po-Ting Lai, Zhizheng Wang, et al. Benchmarking large language models for biomedical natural language processing applications and recommendations.Nature Communi- cations, 16(1):3280, 2025
work page 2025
-
[33]
Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, and Samuel R. Bowman. SQuALITY: Building a long-document summarization dataset the hard way. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139–1156, Abu Dhabi, United Arab Emirates, Decem...
work page 2022
-
[34]
LongBench: A bilin- gual, multitask benchmark for long context understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilin- gual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for...
work page 2024
-
[35]
Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024
work page 2024
-
[36]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm- 130b to glm-4 all tools.arXiv preprint arXiv:2406.12793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
gpt-oss-120b and gpt-oss-20b model card
OpenAI. gpt-oss-120b and gpt-oss-20b model card. 25 Advancing AI Research Assistants with Expert-Involved Learning
-
[42]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 1901
-
[43]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 techni- cal report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJOstrow, AkilaWelihinda, AlanHayes, AlecRadford, etal. Gpt-4osystemcard.arXivpreprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Gemini: A Family of Highly Capable Multimodal Models
GeminiTeam, RohanAnil, SebastianBorgeaud, Jean-BaptisteAlayrac, JiahuiYu, RaduSoricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku
-
[47]
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation.Scientific Data, 10(1):586, 2023
work page 2023
-
[48]
Bleu: a method for automatic evaluation of machine translation, 2002
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation, 2002
work page 2002
-
[49]
Rouge: A package for automatic evaluation of summaries, 2004
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries, 2004
work page 2004
-
[50]
Bertscore: Eval- uating text generation with bert
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Eval- uating text generation with bert
-
[51]
Radgraph: Extracting clinical entities and relations from radiology reports
Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, Curtis Langlotz, et al. Radgraph: Extracting clinical entities and relations from radiology reports
-
[52]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models
-
[53]
Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks
Ruiyang Zhou, Lu Chen, and Kai Yu. Is LLM a reliable reviewer? a comprehensive evaluation of LLM on automatic paper reviewing tasks. In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Eva...
work page 2024
-
[54]
Meta-prompting: Enhancing lan- guage models with task-agnostic scaffolding,
Mirac Suzgun and Adam Tauman Kalai. Meta-prompting: Enhancing language models with task-agnostic scaffolding.arXiv preprint arXiv:2401.12954, 2024. [55]https://lambdalabs.com/. [56]https://grad.msu.edu/phdcareers/career-support/phdsalaries. 26 Advancing AI Research Assistants with Expert-Involved Learning
-
[55]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abili- ties.arXiv preprint arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023
work page 2023
-
[57]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar,AleksanderMadry,AlexBeutel,AlexCarney,etal. Openaio1systemcard.arXivpreprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning, 2025
work page 2025
-
[59]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022. [62]https://openai.com/api/pricing/
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[60]
Inductionbench: LLMs fail in the simplest complexity class, 2025
WenyueHua,FeiSun,LiangmingPan,AdamJardine,andWilliamYangWang. Inductionbench: LLMs fail in the simplest complexity class, 2025
work page 2025
-
[61]
Scheherazade: Evaluating chain-of-thought math reasoning in llms with chain-of-problems
Stephen Miner, Yoshiki Takashima, Simeng Han, Sam Kouteili, Ferhat Erata, Ruzica Piskac, and Scott J Shapiro. Scheherazade: Evaluating chain-of-thought math reasoning in llms with chain-of-problems. InNeurIPS 2025 Workshop on Efficient Reasoning
work page 2025
-
[62]
Goal driven discovery of distributional differences via language descriptions, 2023
Ruiqi Zhong, Peter Zhang, Steve Li, Jinwoo Ahn, Dan Klein, and Jacob Steinhardt. Goal driven discovery of distributional differences via language descriptions, 2023
work page 2023
-
[63]
Mllm-as-a-judge: Assessing multimodal llm-as-a- judge with vision-language benchmark, 2024
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a- judge with vision-language benchmark, 2024
work page 2024
-
[64]
Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971
Joseph L Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971
work page 1971
-
[65]
Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Courna- peau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020
work page 2020
-
[66]
Towards agentic ai for science: Hypothesis generation, comprehension, quantification, and validation
Danai Koutra, Lifu Huang, Adithya Kulkarni, Temiloluwa Prioleau, Beatrice Wan Yuan Soh, Qingyun Wu, Yujun Yan, Yaoqing Yang, Dawei Zhou, and James Zou. Towards agentic ai for science: Hypothesis generation, comprehension, quantification, and validation
-
[67]
Yahui Long, Kok Siong Ang, Raman Sethi, Sha Liao, Yang Heng, Lynn van Olst, Shuchen Ye, Chengwei Zhong, Hang Xu, Di Zhang, et al. Deciphering spatial domains from spatial multi- omics with spatialglue.Nature Methods, pages 1–10, 2024
work page 2024
-
[68]
Alice Dini, Harlan Barker, Emilia Piki, Subodh Sharma, Juuli Raivola, Astrid Murumägi, and Daniela Ungureanu. A multiplex single-cell rna-seq pharmacotranscriptomics pipeline for drug discovery.Nature Chemical Biology, pages 1–11, 2024
work page 2024
-
[69]
Quickumls: a fast, unsupervised approach for medical concept extraction, 2016
Luca Soldaini and Nazli Goharian. Quickumls: a fast, unsupervised approach for medical concept extraction, 2016. 27 Advancing AI Research Assistants with Expert-Involved Learning
work page 2016
-
[70]
spaCy: Industrial- strength Natural Language Processing in Python
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial- strength Natural Language Processing in Python. 2020
work page 2020
-
[71]
SETS: Leveraging self-verification and self-correction for improved test-time scaling
Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, Jinsung Yoon, and Sercan O Arik. SETS: Leveraging self-verification and self-correction for improved test-time scaling. Transactions on Machine Learning Research, 2025
work page 2025
-
[72]
Yuan Sui, Yufei He, Tri Cao, Simeng Han, and Bryan Hooi. Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models, 2025
work page 2025
-
[73]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025. A. Prompts The default prompt we used to ask LLMs for text summarization is: Please summarize the following text... The meta...
work page 2025
-
[74]
Break down each component of the proposed solution
-
[75]
Think step by step to verify if the proposed solution is correct given the question and the figure
-
[76]
The proposed solution is correct
Write a line of the form “The proposed solution is correct" or “The proposed solution is incorrect" at the end of your response based on your analysis. QUESTION: {question}. PROPOSED SOLUTION: {solution} Correction PromptYou are also given a question and a solution for the question. Your job is to outline your step-by-step thought process for deriving a n...
work page 2022
-
[77]
in people living with human immunodeficiency virus (HIV) (PLWH) with those in people living without HIV (PLWoH).METHODS: This nationwide descriptive epidemiological study was conducted in South Korea between January 2020 and February 2022. The National Health Insurance claim data, comprising the data of the entire Korean population, were collected through...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.