RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

A. Ali Heydari; Ahmed A. Metwally; Ben Graef; Chloe Zhang; Daniel McDuff; Erik Schenck; Hamid Palangi; Lindsey Sunden; Mark Malhotra; Menglian Zhou

arxiv: 2606.18203 · v1 · pith:Q5TOTEGSnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Weizhi Zhang , Zechen Li , Hamid Palangi , Ben Graef , A. Ali Heydari , Simon A. Lee , Salman Rahman , Ray Luo

show 11 more authors

Zeinab Esmaeilpour Erik Schenck Chloe Zhang Yamin Li Menglian Zhou Philip S. Yu Daniel McDuff Lindsey Sunden Mark Malhotra Shwetak Patel Ahmed A. Metwally

This is my paper

Pith reviewed 2026-06-27 01:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords personal health agentsevaluation frameworkBoolean rubricsLLM evaluationhealth AIopen-ended assessmentmeta-evaluationmodel optimization

0 comments

The pith

RubricsTree supplies a growing hierarchy of over 100 Boolean rubrics that align LLM evaluation of personal health agents with physicians at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RubricsTree to overcome the bottleneck in evaluating open-ended responses from personal health AI agents that incorporate user sensor data. It constructs a hierarchical taxonomy of more than 100 atomic, clinically verifiable Boolean rubrics drawn from 4,000 real user queries through repeated human-in-the-loop curation led by a physician panel. A context-aware router selects and weights only the relevant rubric subset for each query, delivering scalable throughput without sacrificing alignment. Meta-evaluation shows the method outperforms a strong baseline in matching expert judgments on difficult queries, consistently downgrades contextually poor answers, and produces up to 66 percent relative gains on HealthBench when the rubrics are reused as instructions, feedback, or training rewards for Gemini, GPT, and Qwen models.

Core claim

RubricsTree is a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics that evolve from 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query. Systematic meta-evaluation demonstrates that RubricsTree substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries, reliably penalizes contextually degraded responses, and yields up to ~66% relative gains on HealthBench when used as stru

What carries the argument

RubricsTree, the hierarchical taxonomy of atomic Boolean rubrics together with its context-aware adaptive router that selects and weights relevant subsets for each query.

If this is right

Evaluation throughput increases enough to handle product-scale volumes of open-ended health queries while preserving physician-level alignment.
Models from multiple families improve measurably on HealthBench when the rubrics are applied as instructions, feedback, or reinforcement signals.
Contextually degraded or memory-ignoring responses receive consistent penalties that standard evaluators miss.
The rubric set can evolve continuously as new user queries arrive without restarting the evaluation infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Boolean-rubric structure could be applied to other expert-heavy domains such as legal or financial advice agents.
Over repeated cycles the curation process might require progressively less physician time as the tree stabilizes.
The auditable rubric outputs could serve as evidence in regulatory reviews of deployed health AI systems.
Linking rubric activation directly to incoming sensor streams might produce more personalized evaluation criteria per user.

Load-bearing premise

The iterative human-in-the-loop curation protocol with an expertise panel produces a set of atomic, clinically-verifiable Boolean rubrics that remain expert-aligned and free of significant curation bias across evolving user queries.

What would settle it

Independent physicians rate a fresh sample of 500 agent responses on the same open-ended queries and the resulting scores diverge from RubricsTree more than they diverge from the baseline evaluator.

read the original abstract

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RubricsTree gives a workable hierarchical rubric system for scaling expert-aligned eval of health agents, with reported gains, but the validation details are too thin to judge robustness yet.

read the letter

The main point on this paper is that RubricsTree builds a tree of over 100 Boolean rubrics from 4,000 real user queries, uses an adaptive router to apply only the relevant subset, and reports better expert alignment plus up to 66% relative gains on HealthBench when the rubrics feed into instructions or rewards for Gemini, GPT, and Qwen models.

What is actually new is the specific combination of query-driven rubric evolution, hierarchical Boolean structure, and context-aware routing aimed at health memory and medical skills. The human-in-the-loop curation with a physician panel is presented as the mechanism that keeps the rubrics clinically verifiable and evolving.

It does a reasonable job naming the real bottleneck—expensive physician labels versus inconsistent LLM judges—and showing that the router can penalize degraded responses. The meta-evaluation claim of exceeding a strong baseline in alignment is the part that could matter if the numbers check out.

The soft spots are in the evidence. The abstract gives no sample sizes, inter-rater stats, exact baseline construction, or statistical tests, which leaves the 66% gains and alignment improvements hard to assess. The central assumption that the curation protocol produces bias-free, expert-aligned rubrics is load-bearing, and without seeing those numbers or ablations on the router it is difficult to know how much the results depend on the particular panel or query set. If the full paper supplies those details it improves things; if not, the claims stay under-supported.

This is for people working on evaluation infrastructure for personal health agents or medical LLM benchmarks. A reader who needs concrete ways to structure open-ended checks would get practical ideas from the rubric design and routing.

It deserves a serious referee because the problem is important and the framework has enough concrete pieces to be reviewed and improved. Recommendation: send it to review and ask for the missing methodological numbers and controls in the first round.

Referee Report

2 major / 0 minor

Summary. The paper introduces RubricsTree, a scalable evaluation framework for LLM-based personal health agents. It consists of a hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics evolved from 4,000 real user queries via an iterative human-in-the-loop protocol with a physician-led expertise panel, combined with a context-aware adaptive router that selects relevant rubric subsets. Through meta-evaluation, the work claims that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on open-ended queries, (ii) reliably penalizes contextually degraded responses, and (iii) yields up to ~66% relative gains on HealthBench when used as structured instructions, text feedback, or training rewards for Gemini, GPT, and Qwen model families.

Significance. If the meta-evaluation holds, the framework offers a practical advance in addressing the evaluation bottleneck for health AI by providing an auditable, scalable alternative that maintains clinical verifiability through Boolean rubrics while supporting continuous evolution from real queries. Explicit strengths include the multi-use demonstration (instructions/feedback/rewards) and the grounding in actual user data rather than synthetic benchmarks.

major comments (2)

[Abstract] Abstract: The abstract reports positive meta-evaluation outcomes including 66% gains and superior expert alignment but supplies no information on evaluation methodology, sample sizes, statistical tests, baseline details, or HealthBench construction. This prevents verification that the data support the central claims (i)-(iii) and is load-bearing for the paper's primary contribution.
[Curation Protocol and Meta-Evaluation] Curation and meta-evaluation sections: The iterative human-in-the-loop protocol with the expertise panel is asserted to produce expert-aligned, bias-free Boolean rubrics, yet no quantitative measures (e.g., inter-expert agreement rates, bias audits, or held-out validation results) are provided to substantiate this across the 4,000 queries. This assumption underpins all three meta-evaluation claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where additional detail would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract reports positive meta-evaluation outcomes including 66% gains and superior expert alignment but supplies no information on evaluation methodology, sample sizes, statistical tests, baseline details, or HealthBench construction. This prevents verification that the data support the central claims (i)-(iii) and is load-bearing for the paper's primary contribution.

Authors: We agree that the abstract, constrained by length, omits methodological specifics required to evaluate the central claims. In the revised manuscript we will expand the abstract to include concise statements on the meta-evaluation design, sample sizes for the expert-alignment studies, statistical tests performed, the identity and construction of the large-scale baseline, and the composition of HealthBench. These additions will make the support for claims (i)–(iii) verifiable from the abstract itself. revision: yes
Referee: [Curation Protocol and Meta-Evaluation] Curation and meta-evaluation sections: The iterative human-in-the-loop protocol with the expertise panel is asserted to produce expert-aligned, bias-free Boolean rubrics, yet no quantitative measures (e.g., inter-expert agreement rates, bias audits, or held-out validation results) are provided to substantiate this across the 4,000 queries. This assumption underpins all three meta-evaluation claims.

Authors: The referee is correct that the manuscript presents the curation protocol qualitatively without accompanying quantitative validation statistics. We will add these measures in the revised version: inter-expert agreement rates computed across the expertise panel’s reviews of the 4,000 queries, results of any bias audits performed, and performance on a held-out validation subset. These statistics will be reported in the Curation Protocol and Meta-Evaluation sections to provide direct empirical support for the expert alignment of the rubrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on an external iterative human-in-the-loop curation protocol involving physicians on 4,000 queries, followed by meta-evaluation against independent baselines and the external HealthBench benchmark. No derivation step reduces by construction to fitted parameters, self-referential definitions, or self-citation chains; the rubrics and router are presented as outputs of the protocol, with performance gains shown via comparison to non-derived external references rather than internal tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that automatically evaluable Boolean rubrics can capture clinical verifiability at scale and that the physician-led curation process produces unbiased alignment; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Boolean rubrics can be automatically scored while preserving clinical verifiability and expert alignment.
Required for the scalability and meta-evaluation claims.

pith-pipeline@v0.9.1-grok · 5861 in / 1225 out tokens · 35877 ms · 2026-06-27T01:13:58.274368+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 5 linked inside Pith

[1]

Automatic evaluation of health- care llms beyond question-answering

Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lu- cia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez- Cuena, Sergio Alvarez-Napagao, and Dario Garcia-Gasulla. Automatic evaluation of health- care llms beyond question-answering. InProceedings of the 2025 Conference of the Nations of the A...

2025
[2]

Healthbench: Evaluating large language models towards improved human health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

Pith/arXiv arXiv 2025
[3]

Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022

Samantha G Auty and Kevin N Griffith. Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022

2022
[4]

When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of t...

2026
[5]

Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026

Tamara Beetham, Trisha Marsh, Michael L Barnett, Ruby M Aaron, Emmanuel Greenberg, Alexandra Do, and Jane M Zhu. Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026. 11 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Healt...

2026
[6]

Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025

Yuanchen Bei, Weizhi Zhang, Siwen Wang, Weizhi Chen, Sheng Zhou, Hao Chen, Yong Li, Jiajun Bu, Shirui Pan, Yizhou Yu, et al. Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025

arXiv 2025
[7]

Bitterman

Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. Medbrowsecomp: Bench- marking medical deep research and computer use, 2025

2025
[8]

Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G

Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotr...

2024
[9]

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025

Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam H Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025

2025
[10]

LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA

Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...

2025
[11]

Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022

Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022

2022
[12]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[13]

The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025

A Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025

arXiv 2025
[14]

Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026

Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, and Eiji Aramaki. Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026

2026
[15]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021
[16]

A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025

Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, et al. A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025

2025
[17]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

1977
[18]

Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, et al. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025. 12 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across H...

arXiv 2025
[19]

Zechen Li, Baiyu Chen, Hao Xue, and Flora D. Salim. Zara: Training-free motion time-series reasoning via evidence-grounded llm agents.arXiv preprint arXiv:2508.04038, 2026

Pith/arXiv arXiv 2026
[20]

SensorLLM:Aligning large language models with motion sensors for human activity recognition

ZechenLi, ShohrehDeldari, LinyaoChen, HaoXue, andFloraD.Salim. SensorLLM:Aligning large language models with motion sensors for human activity recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 354–379, 2025

2025
[21]

Lee, Yuwei Zhang, Maxwell A

Zechen Li, Keerthana Natarajan, Weizhi Zhang, Menglian Zhou, Simon A. Lee, Yuwei Zhang, Maxwell A. Xu, Zeinab Esmaeilpour, Flora D. Salim, Mark Malhotra, Lindsey Sunden, Shwetak Patel, Yuzhe Yang, and Ahmed A. Metwally. Glucofm: A dual-stream foundation model for continuous glucose monitoring.arXiv preprint arXiv:2605.30865, 2026

Pith/arXiv arXiv 2026
[22]

Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

Pith/arXiv arXiv 2025
[23]

A scalable framework for evaluating health language models.npj Digital Medicine, 2026

Neil Mallinar, A Ali Heydari, Xin Liu, Anthony Z Faranesh, Brent Winslow, Nova Ham- merquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, et al. A scalable framework for evaluating health language models.npj Digital Medicine, 2026

2026
[24]

Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, and Xin Liu. Transforming wearable data into personal health insights using large langu...

2025
[25]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering. InConfer- ence on health, inference, and learning, pages 248–260. PMLR, 2022

2022
[26]

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May D. Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records, 2024

2024
[27]

It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025

Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, and Eric Karl Oermann. It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025

2025
[28]

Sara Mahdavi, Joelle Barral, Dale Webster, Greg S

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mo- hamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Domi- nowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahd...

2023
[29]

Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023

Ching-Fang Sun, Christoph U Correll, Robert L Trestman, Yezhe Lin, Hui Xie, Maria Stack Hankey, Raymond Paglinawan Uymatiao, Riya T Patel, Vemmy L Metsutnan, Erin Corinne McDaid, et al. Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023

2023
[30]

Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A

Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks, 2024. 13 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

2024
[31]

Towards conversational diagnostic ai, 2024

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Sem- turs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, an...

2024
[32]

A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025

Brent Winslow, Jacqueline Shreibati, Javier Perez, Hao-Wei Su, Nichole Young-Lin, Nova Hammerquist, Daniel McDuff, Jason Guss, Jenny Vafeiadou, Nick Cain, et al. A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025

arXiv 2025
[33]

An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025

Kevin Wu, Eric Wu, Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Riantawan, Patricia Shi, Daniel Ho, and James Zou. An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025

2025
[34]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

2023
[35]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[36]

From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025

arXiv 2025
[37]

Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

arXiv 2026
[38]

Llminit: A free lunch from large language models for selective initialization of recommendation

Weizhi Zhang, Liangwei Yang, Wooseong Yang, Henry Peng Zou, Yuqing Liu, Ke Xu, Sourav Medya, and Philip S Yu. Llminit: A free lunch from large language models for selective initialization of recommendation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2016–2024, 2025

2025
[39]

LLM-as-a-Judge

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, et al. Personaagent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025. A. Related Work TheRiseofOpen-EndedPersonalHealthAgents.With the emerging capabilities of...

Pith/arXiv arXiv 2025
[40]

It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query

Contextual Triage:The agent parses the user’s query against its available tool schema. It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query
[41]

For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline

Execution (Action):Generation is temporarily halted to emit a structured function call. For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline
[42]

Observation:The external tool executes the requested routine against the data backend, returning a serialized string of the requested telemetry (e.g., longitudinal laboratory results or 17 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills 7-day rolling sensor trends)
[43]

It then evaluates if the aggregated data is sufficient to formulate a clinically sound response

Synthesis & Response:The agent ingests the observation into its context window. It then evaluates if the aggregated data is sufficient to formulate a clinically sound response. If missing variables remain (e.g., retrieving blood glucose but requiring fasting insulin to calculate resistance), the agent loops back to Step 1. To balance the user information ...
[44]

Embedding Similarity.Each leaf 𝑙𝑖 is represented by a dense embedding of its textual description. The relevance score𝑔(𝑞, 𝑐, 𝑙 𝑖) is computed as the cosine similarity between the query embedding and the leaf embedding, with a threshold𝜏 tuned globally on a held-out development set
[45]

relevant / not relevant

Binary Per-Leaf Judge.For every leaf𝑙𝑖, an LLM is prompted with the tuple(𝑞, 𝑐, 𝑙 𝑖) and asked to emit a binary “relevant / not relevant” decision. This is structurally the most direct way to instantiate𝑔∈ {0,1}but requires|𝐿|independent LLM calls per query
[46]

How do I improve hypertension?

Hierarchical Tree Traversal (Ours).An LLM router traverses the curated taxonomic DAG fromtheroot, expandingonlythechildrenofnodeswhoseparenthasbeenjudgedcontextually relevant; this is conceptually related to Tree-of-Thought prompting [34], but operates over a fixed, expert-given tree rather than a router-generated one. The per-query activation threshold 𝜏...
[47]

Response Utility:The response would be significantly clinically improved by incorporating information from the rubric
[48]

adversarial

Trigger Condition Matching:The query fits the specific trigger condition of the rubric (e.g., specific biomarker inquiries). I.2.2. Phase 2: Criteria-Specific Base Evaluation (The Clinical Auditor) Once the relevant rubrics are selected, the framework evaluates the baseline model response against each selected rubric independently. In this phase, the LLM ...
[49]

Seamlessly insert necessary additions or follow-up questions to address missing context flagged by the auditor
[50]

The output of this phase is the final, optimized response

Only delete or modify original statements if the auditor explicitly flagged them as incorrect, unsafe, or definitively harmful. The output of this phase is the final, optimized response. I.3. Per-Axis Optimization Results To complement the family-level summary in Figure 6, we report the per-axis decomposition of Response Optimization on Gemini-2.5-Flash a...
[51]

Implicit Context: The user query does not need to directly mention ’me’, ’my’, or ’personal’ to be relevant to personal health data/rubrics
[52]

Trigger Condition

Relevance Criteria: A rubric is relevant to the user query if and only if the following are all true (for the user health memory, only focus on around 5-8 relevant KEY metrics (could be more only if the user query asks for a broader range of information): a) The rubrics is strongly related to the user query keywords. AND b) The response would be significa...
[53]

Analyze the user’s query against every rubric provided in the catalog
[54]

Identify ALL rubrics that meet the relevance criteria
[55]

Only select more relevant rubrics to evaluate the user query
[56]

xxx.xxx.01

Output in the correct ID list format. ### Output Format Provide your final evaluation STRICTLY as a list array of the selected Rubric IDs in the bracket. Do not include any markdown formatting, explanations, or conversational text. Ohterwise, the evaluation is invalid. Example Output: ["xxx.xxx.01", "yyy.zzz.02", "xxx.ddd.01"] Prompt for Base Evaluation (...
[57]

Your job is to find it

**Do NOT Assume the Response is Perfect:** Approach every response as if it contains a dangerous omission. Your job is to find it
[58]

Safety Disclaimer

**The "Safety Disclaimer" Trap:** Do NOT give a high score just because the model included a disclaimer. A disclaimer does not excuse all the responsibility
[59]

**Strict Usefulness Adherence:** If the response ignores even ONE aspects that your think is useful, it must be penalized for poor usefulness
[60]

You MUST find the risks in the response

**Zero-Tolerance for Assumptions:** Do not assume any reponse is safe, useful, accurate, helful, and personalized. You MUST find the risks in the response
[61]

Adversarial

**Scientific Rigor:** Any claim not backed by current clinical consensus is a factual error. –- ### STEP-BY-STEP AUDIT PROCESS For each question in the evaluation criteria, you must follow this "Adversarial" logic:
[62]

**Search for Omissions:** What did the response FAIL to mention that a human doctor would have flagged?
[63]

[1]", otherwise

**Assess and Focus on the Provided Aspect:** Focus on the aspect provided in the evaluation criteria. ### Evaluation Protocol You will be given the user query and the instructions provided to the LLM to generate the response. You will also be given the response generated by LLMs. Finally, you will be given the evaluation criteria to judge a specific rubri...
[64]

Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications

Preserve the content of original response. Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications
[65]

If the feedback indicates a failure (e.g., missing user context in a rubric criterion), augment your response by adding relevant follow-up questions or information
[66]

Please response with only the number for the rating you choose

**Important** Delete or modify original statements part if the feedback specifically flags (e.g., definitive statement) them as incorrect, unsafe, or necessary to avoid, other wise please keep the original contents. 31 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills Output the new ...

[1] [1]

Automatic evaluation of health- care llms beyond question-answering

Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lu- cia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez- Cuena, Sergio Alvarez-Napagao, and Dario Garcia-Gasulla. Automatic evaluation of health- care llms beyond question-answering. InProceedings of the 2025 Conference of the Nations of the A...

2025

[2] [2]

Healthbench: Evaluating large language models towards improved human health

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

Pith/arXiv arXiv 2025

[3] [3]

Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022

Samantha G Auty and Kevin N Griffith. Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022

2022

[4] [4]

When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of t...

2026

[5] [5]

Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026

Tamara Beetham, Trisha Marsh, Michael L Barnett, Ruby M Aaron, Emmanuel Greenberg, Alexandra Do, and Jane M Zhu. Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026. 11 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Healt...

2026

[6] [6]

Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025

Yuanchen Bei, Weizhi Zhang, Siwen Wang, Weizhi Chen, Sheng Zhou, Hao Chen, Yong Li, Jiajun Bu, Shirui Pan, Yizhou Yu, et al. Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025

arXiv 2025

[7] [7]

Bitterman

Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. Medbrowsecomp: Bench- marking medical deep research and computer use, 2025

2025

[8] [8]

Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G

Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotr...

2024

[9] [9]

Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025

Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam H Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025

2025

[10] [10]

LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA

Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...

2025

[11] [11]

Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022

Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022

2022

[12] [12]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[13] [13]

The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025

A Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025

arXiv 2025

[14] [14]

Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026

Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, and Eiji Aramaki. Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026

2026

[15] [15]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

2021

[16] [16]

A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025

Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, et al. A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025

2025

[17] [17]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

1977

[18] [18]

Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, et al. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025. 12 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across H...

arXiv 2025

[19] [19]

Zechen Li, Baiyu Chen, Hao Xue, and Flora D. Salim. Zara: Training-free motion time-series reasoning via evidence-grounded llm agents.arXiv preprint arXiv:2508.04038, 2026

Pith/arXiv arXiv 2026

[20] [20]

SensorLLM:Aligning large language models with motion sensors for human activity recognition

ZechenLi, ShohrehDeldari, LinyaoChen, HaoXue, andFloraD.Salim. SensorLLM:Aligning large language models with motion sensors for human activity recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 354–379, 2025

2025

[21] [21]

Lee, Yuwei Zhang, Maxwell A

Zechen Li, Keerthana Natarajan, Weizhi Zhang, Menglian Zhou, Simon A. Lee, Yuwei Zhang, Maxwell A. Xu, Zeinab Esmaeilpour, Flora D. Salim, Mark Malhotra, Lindsey Sunden, Shwetak Patel, Yuzhe Yang, and Ahmed A. Metwally. Glucofm: A dual-stream foundation model for continuous glucose monitoring.arXiv preprint arXiv:2605.30865, 2026

Pith/arXiv arXiv 2026

[22] [22]

Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

Pith/arXiv arXiv 2025

[23] [23]

A scalable framework for evaluating health language models.npj Digital Medicine, 2026

Neil Mallinar, A Ali Heydari, Xin Liu, Anthony Z Faranesh, Brent Winslow, Nova Ham- merquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, et al. A scalable framework for evaluating health language models.npj Digital Medicine, 2026

2026

[24] [24]

Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y

Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, and Xin Liu. Transforming wearable data into personal health insights using large langu...

2025

[25] [25]

Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering

Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering. InConfer- ence on health, inference, and learning, pages 248–260. PMLR, 2022

2022

[26] [26]

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May D. Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records, 2024

2024

[27] [27]

It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025

Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, and Eric Karl Oermann. It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025

2025

[28] [28]

Sara Mahdavi, Joelle Barral, Dale Webster, Greg S

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mo- hamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Domi- nowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahd...

2023

[29] [29]

Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023

Ching-Fang Sun, Christoph U Correll, Robert L Trestman, Yezhe Lin, Hui Xie, Maria Stack Hankey, Raymond Paglinawan Uymatiao, Riya T Patel, Vemmy L Metsutnan, Erin Corinne McDaid, et al. Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023

2023

[30] [30]

Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A

Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks, 2024. 13 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

2024

[31] [31]

Towards conversational diagnostic ai, 2024

Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Sem- turs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, an...

2024

[32] [32]

A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025

Brent Winslow, Jacqueline Shreibati, Javier Perez, Hao-Wei Su, Nichole Young-Lin, Nova Hammerquist, Daniel McDuff, Jason Guss, Jenny Vafeiadou, Nick Cain, et al. A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025

arXiv 2025

[33] [33]

An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025

Kevin Wu, Eric Wu, Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Riantawan, Patricia Shi, Daniel Ho, and James Zou. An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025

2025

[34] [34]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

2023

[35] [35]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[36] [36]

From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025

Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025

arXiv 2025

[37] [37]

Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

arXiv 2026

[38] [38]

Llminit: A free lunch from large language models for selective initialization of recommendation

Weizhi Zhang, Liangwei Yang, Wooseong Yang, Henry Peng Zou, Yuqing Liu, Ke Xu, Sourav Medya, and Philip S Yu. Llminit: A free lunch from large language models for selective initialization of recommendation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2016–2024, 2025

2025

[39] [39]

LLM-as-a-Judge

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, et al. Personaagent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025. A. Related Work TheRiseofOpen-EndedPersonalHealthAgents.With the emerging capabilities of...

Pith/arXiv arXiv 2025

[40] [40]

It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query

Contextual Triage:The agent parses the user’s query against its available tool schema. It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query

[41] [41]

For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline

Execution (Action):Generation is temporarily halted to emit a structured function call. For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline

[42] [42]

Observation:The external tool executes the requested routine against the data backend, returning a serialized string of the requested telemetry (e.g., longitudinal laboratory results or 17 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills 7-day rolling sensor trends)

[43] [43]

It then evaluates if the aggregated data is sufficient to formulate a clinically sound response

Synthesis & Response:The agent ingests the observation into its context window. It then evaluates if the aggregated data is sufficient to formulate a clinically sound response. If missing variables remain (e.g., retrieving blood glucose but requiring fasting insulin to calculate resistance), the agent loops back to Step 1. To balance the user information ...

[44] [44]

Embedding Similarity.Each leaf 𝑙𝑖 is represented by a dense embedding of its textual description. The relevance score𝑔(𝑞, 𝑐, 𝑙 𝑖) is computed as the cosine similarity between the query embedding and the leaf embedding, with a threshold𝜏 tuned globally on a held-out development set

[45] [45]

relevant / not relevant

Binary Per-Leaf Judge.For every leaf𝑙𝑖, an LLM is prompted with the tuple(𝑞, 𝑐, 𝑙 𝑖) and asked to emit a binary “relevant / not relevant” decision. This is structurally the most direct way to instantiate𝑔∈ {0,1}but requires|𝐿|independent LLM calls per query

[46] [46]

How do I improve hypertension?

Hierarchical Tree Traversal (Ours).An LLM router traverses the curated taxonomic DAG fromtheroot, expandingonlythechildrenofnodeswhoseparenthasbeenjudgedcontextually relevant; this is conceptually related to Tree-of-Thought prompting [34], but operates over a fixed, expert-given tree rather than a router-generated one. The per-query activation threshold 𝜏...

[47] [47]

Response Utility:The response would be significantly clinically improved by incorporating information from the rubric

[48] [48]

adversarial

Trigger Condition Matching:The query fits the specific trigger condition of the rubric (e.g., specific biomarker inquiries). I.2.2. Phase 2: Criteria-Specific Base Evaluation (The Clinical Auditor) Once the relevant rubrics are selected, the framework evaluates the baseline model response against each selected rubric independently. In this phase, the LLM ...

[49] [49]

Seamlessly insert necessary additions or follow-up questions to address missing context flagged by the auditor

[50] [50]

The output of this phase is the final, optimized response

Only delete or modify original statements if the auditor explicitly flagged them as incorrect, unsafe, or definitively harmful. The output of this phase is the final, optimized response. I.3. Per-Axis Optimization Results To complement the family-level summary in Figure 6, we report the per-axis decomposition of Response Optimization on Gemini-2.5-Flash a...

[51] [51]

Implicit Context: The user query does not need to directly mention ’me’, ’my’, or ’personal’ to be relevant to personal health data/rubrics

[52] [52]

Trigger Condition

Relevance Criteria: A rubric is relevant to the user query if and only if the following are all true (for the user health memory, only focus on around 5-8 relevant KEY metrics (could be more only if the user query asks for a broader range of information): a) The rubrics is strongly related to the user query keywords. AND b) The response would be significa...

[53] [53]

Analyze the user’s query against every rubric provided in the catalog

[54] [54]

Identify ALL rubrics that meet the relevance criteria

[55] [55]

Only select more relevant rubrics to evaluate the user query

[56] [56]

xxx.xxx.01

Output in the correct ID list format. ### Output Format Provide your final evaluation STRICTLY as a list array of the selected Rubric IDs in the bracket. Do not include any markdown formatting, explanations, or conversational text. Ohterwise, the evaluation is invalid. Example Output: ["xxx.xxx.01", "yyy.zzz.02", "xxx.ddd.01"] Prompt for Base Evaluation (...

[57] [57]

Your job is to find it

**Do NOT Assume the Response is Perfect:** Approach every response as if it contains a dangerous omission. Your job is to find it

[58] [58]

Safety Disclaimer

**The "Safety Disclaimer" Trap:** Do NOT give a high score just because the model included a disclaimer. A disclaimer does not excuse all the responsibility

[59] [59]

**Strict Usefulness Adherence:** If the response ignores even ONE aspects that your think is useful, it must be penalized for poor usefulness

[60] [60]

You MUST find the risks in the response

**Zero-Tolerance for Assumptions:** Do not assume any reponse is safe, useful, accurate, helful, and personalized. You MUST find the risks in the response

[61] [61]

Adversarial

**Scientific Rigor:** Any claim not backed by current clinical consensus is a factual error. –- ### STEP-BY-STEP AUDIT PROCESS For each question in the evaluation criteria, you must follow this "Adversarial" logic:

[62] [62]

**Search for Omissions:** What did the response FAIL to mention that a human doctor would have flagged?

[63] [63]

[1]", otherwise

**Assess and Focus on the Provided Aspect:** Focus on the aspect provided in the evaluation criteria. ### Evaluation Protocol You will be given the user query and the instructions provided to the LLM to generate the response. You will also be given the response generated by LLMs. Finally, you will be given the evaluation criteria to judge a specific rubri...

[64] [64]

Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications

Preserve the content of original response. Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications

[65] [65]

If the feedback indicates a failure (e.g., missing user context in a rubric criterion), augment your response by adding relevant follow-up questions or information

[66] [66]

Please response with only the number for the rating you choose

**Important** Delete or modify original statements part if the feedback specifically flags (e.g., definitive statement) them as incorrect, unsafe, or necessary to avoid, other wise please keep the original contents. 31 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills Output the new ...