pith. sign in

arxiv: 2606.18203 · v1 · pith:Q5TOTEGSnew · submitted 2026-06-16 · 💻 cs.CL · cs.AI

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

Pith reviewed 2026-06-27 01:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords personal health agentsevaluation frameworkBoolean rubricsLLM evaluationhealth AIopen-ended assessmentmeta-evaluationmodel optimization
0
0 comments X

The pith

RubricsTree supplies a growing hierarchy of over 100 Boolean rubrics that align LLM evaluation of personal health agents with physicians at scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RubricsTree to overcome the bottleneck in evaluating open-ended responses from personal health AI agents that incorporate user sensor data. It constructs a hierarchical taxonomy of more than 100 atomic, clinically verifiable Boolean rubrics drawn from 4,000 real user queries through repeated human-in-the-loop curation led by a physician panel. A context-aware router selects and weights only the relevant rubric subset for each query, delivering scalable throughput without sacrificing alignment. Meta-evaluation shows the method outperforms a strong baseline in matching expert judgments on difficult queries, consistently downgrades contextually poor answers, and produces up to 66 percent relative gains on HealthBench when the rubrics are reused as instructions, feedback, or training rewards for Gemini, GPT, and Qwen models.

Core claim

RubricsTree is a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics that evolve from 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query. Systematic meta-evaluation demonstrates that RubricsTree substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries, reliably penalizes contextually degraded responses, and yields up to ~66% relative gains on HealthBench when used as stru

What carries the argument

RubricsTree, the hierarchical taxonomy of atomic Boolean rubrics together with its context-aware adaptive router that selects and weights relevant subsets for each query.

If this is right

  • Evaluation throughput increases enough to handle product-scale volumes of open-ended health queries while preserving physician-level alignment.
  • Models from multiple families improve measurably on HealthBench when the rubrics are applied as instructions, feedback, or reinforcement signals.
  • Contextually degraded or memory-ignoring responses receive consistent penalties that standard evaluators miss.
  • The rubric set can evolve continuously as new user queries arrive without restarting the evaluation infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Boolean-rubric structure could be applied to other expert-heavy domains such as legal or financial advice agents.
  • Over repeated cycles the curation process might require progressively less physician time as the tree stabilizes.
  • The auditable rubric outputs could serve as evidence in regulatory reviews of deployed health AI systems.
  • Linking rubric activation directly to incoming sensor streams might produce more personalized evaluation criteria per user.

Load-bearing premise

The iterative human-in-the-loop curation protocol with an expertise panel produces a set of atomic, clinically-verifiable Boolean rubrics that remain expert-aligned and free of significant curation bias across evolving user queries.

What would settle it

Independent physicians rate a fresh sample of 500 agent responses on the same open-ended queries and the resulting scores diverge from RubricsTree more than they diverge from the baseline evaluator.

read the original abstract

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces RubricsTree, a scalable evaluation framework for LLM-based personal health agents. It consists of a hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics evolved from 4,000 real user queries via an iterative human-in-the-loop protocol with a physician-led expertise panel, combined with a context-aware adaptive router that selects relevant rubric subsets. Through meta-evaluation, the work claims that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on open-ended queries, (ii) reliably penalizes contextually degraded responses, and (iii) yields up to ~66% relative gains on HealthBench when used as structured instructions, text feedback, or training rewards for Gemini, GPT, and Qwen model families.

Significance. If the meta-evaluation holds, the framework offers a practical advance in addressing the evaluation bottleneck for health AI by providing an auditable, scalable alternative that maintains clinical verifiability through Boolean rubrics while supporting continuous evolution from real queries. Explicit strengths include the multi-use demonstration (instructions/feedback/rewards) and the grounding in actual user data rather than synthetic benchmarks.

major comments (2)
  1. [Abstract] Abstract: The abstract reports positive meta-evaluation outcomes including 66% gains and superior expert alignment but supplies no information on evaluation methodology, sample sizes, statistical tests, baseline details, or HealthBench construction. This prevents verification that the data support the central claims (i)-(iii) and is load-bearing for the paper's primary contribution.
  2. [Curation Protocol and Meta-Evaluation] Curation and meta-evaluation sections: The iterative human-in-the-loop protocol with the expertise panel is asserted to produce expert-aligned, bias-free Boolean rubrics, yet no quantitative measures (e.g., inter-expert agreement rates, bias audits, or held-out validation results) are provided to substantiate this across the 4,000 queries. This assumption underpins all three meta-evaluation claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for identifying areas where additional detail would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract reports positive meta-evaluation outcomes including 66% gains and superior expert alignment but supplies no information on evaluation methodology, sample sizes, statistical tests, baseline details, or HealthBench construction. This prevents verification that the data support the central claims (i)-(iii) and is load-bearing for the paper's primary contribution.

    Authors: We agree that the abstract, constrained by length, omits methodological specifics required to evaluate the central claims. In the revised manuscript we will expand the abstract to include concise statements on the meta-evaluation design, sample sizes for the expert-alignment studies, statistical tests performed, the identity and construction of the large-scale baseline, and the composition of HealthBench. These additions will make the support for claims (i)–(iii) verifiable from the abstract itself. revision: yes

  2. Referee: [Curation Protocol and Meta-Evaluation] Curation and meta-evaluation sections: The iterative human-in-the-loop protocol with the expertise panel is asserted to produce expert-aligned, bias-free Boolean rubrics, yet no quantitative measures (e.g., inter-expert agreement rates, bias audits, or held-out validation results) are provided to substantiate this across the 4,000 queries. This assumption underpins all three meta-evaluation claims.

    Authors: The referee is correct that the manuscript presents the curation protocol qualitatively without accompanying quantitative validation statistics. We will add these measures in the revised version: inter-expert agreement rates computed across the expertise panel’s reviews of the 4,000 queries, results of any bias audits performed, and performance on a held-out validation subset. These statistics will be reported in the Curation Protocol and Meta-Evaluation sections to provide direct empirical support for the expert alignment of the rubrics. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on an external iterative human-in-the-loop curation protocol involving physicians on 4,000 queries, followed by meta-evaluation against independent baselines and the external HealthBench benchmark. No derivation step reduces by construction to fitted parameters, self-referential definitions, or self-citation chains; the rubrics and router are presented as outputs of the protocol, with performance gains shown via comparison to non-derived external references rather than internal tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that automatically evaluable Boolean rubrics can capture clinical verifiability at scale and that the physician-led curation process produces unbiased alignment; no explicit free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Boolean rubrics can be automatically scored while preserving clinical verifiability and expert alignment.
    Required for the scalability and meta-evaluation claims.

pith-pipeline@v0.9.1-grok · 5861 in / 1225 out tokens · 35877 ms · 2026-06-27T01:13:58.274368+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 5 linked inside Pith

  1. [1]

    Automatic evaluation of health- care llms beyond question-answering

    Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lu- cia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez- Cuena, Sergio Alvarez-Napagao, and Dario Garcia-Gasulla. Automatic evaluation of health- care llms beyond question-answering. InProceedings of the 2025 Conference of the Nations of the A...

  2. [2]

    Healthbench: Evaluating large language models towards improved human health

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

  3. [3]

    Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022

    Samantha G Auty and Kevin N Griffith. Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022

  4. [4]

    When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation

    Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of t...

  5. [5]

    Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026

    Tamara Beetham, Trisha Marsh, Michael L Barnett, Ruby M Aaron, Emmanuel Greenberg, Alexandra Do, and Jane M Zhu. Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026. 11 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Healt...

  6. [6]

    Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025

    Yuanchen Bei, Weizhi Zhang, Siwen Wang, Weizhi Chen, Sheng Zhou, Hao Chen, Yong Li, Jiajun Bu, Shirui Pan, Yizhou Yu, et al. Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025

  7. [7]

    Bitterman

    Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. Medbrowsecomp: Bench- marking medical deep research and computer use, 2025

  8. [8]

    Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G

    Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotr...

  9. [9]

    Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025

    Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam H Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025

  10. [10]

    LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA

    Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...

  11. [11]

    Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022

    Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022

  12. [12]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  13. [13]

    The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025

    A Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025

  14. [14]

    Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026

    Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, and Eiji Aramaki. Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026

  15. [15]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  16. [16]

    A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025

    Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, et al. A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025

  17. [17]

    The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

  18. [18]

    Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025

    Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, et al. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025. 12 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across H...

  19. [19]

    Zechen Li, Baiyu Chen, Hao Xue, and Flora D. Salim. Zara: Training-free motion time-series reasoning via evidence-grounded llm agents.arXiv preprint arXiv:2508.04038, 2026

  20. [20]

    SensorLLM:Aligning large language models with motion sensors for human activity recognition

    ZechenLi, ShohrehDeldari, LinyaoChen, HaoXue, andFloraD.Salim. SensorLLM:Aligning large language models with motion sensors for human activity recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 354–379, 2025

  21. [21]

    Lee, Yuwei Zhang, Maxwell A

    Zechen Li, Keerthana Natarajan, Weizhi Zhang, Menglian Zhou, Simon A. Lee, Yuwei Zhang, Maxwell A. Xu, Zeinab Esmaeilpour, Flora D. Salim, Mark Malhotra, Lindsey Sunden, Shwetak Patel, Yuzhe Yang, and Ahmed A. Metwally. Glucofm: A dual-stream foundation model for continuous glucose monitoring.arXiv preprint arXiv:2605.30865, 2026

  22. [22]

    Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

    Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025

  23. [23]

    A scalable framework for evaluating health language models.npj Digital Medicine, 2026

    Neil Mallinar, A Ali Heydari, Xin Liu, Anthony Z Faranesh, Brent Winslow, Nova Ham- merquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, et al. A scalable framework for evaluating health language models.npj Digital Medicine, 2026

  24. [24]

    Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y

    Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, and Xin Liu. Transforming wearable data into personal health insights using large langu...

  25. [25]

    Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering. InConfer- ence on health, inference, and learning, pages 248–260. PMLR, 2022

  26. [26]

    Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May D. Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records, 2024

  27. [27]

    It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025

    Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, and Eric Karl Oermann. It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025

  28. [28]

    Sara Mahdavi, Joelle Barral, Dale Webster, Greg S

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mo- hamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Domi- nowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahd...

  29. [29]

    Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023

    Ching-Fang Sun, Christoph U Correll, Robert L Trestman, Yezhe Lin, Hui Xie, Maria Stack Hankey, Raymond Paglinawan Uymatiao, Riya T Patel, Vemmy L Metsutnan, Erin Corinne McDaid, et al. Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023

  30. [30]

    Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A

    Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks, 2024. 13 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

  31. [31]

    Towards conversational diagnostic ai, 2024

    Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Sem- turs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, an...

  32. [32]

    A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025

    Brent Winslow, Jacqueline Shreibati, Javier Perez, Hao-Wei Su, Nichole Young-Lin, Nova Hammerquist, Daniel McDuff, Jason Guss, Jenny Vafeiadou, Nick Cain, et al. A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025

  33. [33]

    An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025

    Kevin Wu, Eric Wu, Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Riantawan, Patricia Shi, Daniel Ho, and James Zou. An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025

  34. [34]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023

  35. [35]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023

  36. [36]

    From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025

    Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025

  37. [37]

    Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

    Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026

  38. [38]

    Llminit: A free lunch from large language models for selective initialization of recommendation

    Weizhi Zhang, Liangwei Yang, Wooseong Yang, Henry Peng Zou, Yuqing Liu, Ke Xu, Sourav Medya, and Philip S Yu. Llminit: A free lunch from large language models for selective initialization of recommendation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2016–2024, 2025

  39. [39]

    LLM-as-a-Judge

    Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, et al. Personaagent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025. A. Related Work TheRiseofOpen-EndedPersonalHealthAgents.With the emerging capabilities of...

  40. [40]

    It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query

    Contextual Triage:The agent parses the user’s query against its available tool schema. It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query

  41. [41]

    For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline

    Execution (Action):Generation is temporarily halted to emit a structured function call. For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline

  42. [42]

    Observation:The external tool executes the requested routine against the data backend, returning a serialized string of the requested telemetry (e.g., longitudinal laboratory results or 17 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills 7-day rolling sensor trends)

  43. [43]

    It then evaluates if the aggregated data is sufficient to formulate a clinically sound response

    Synthesis & Response:The agent ingests the observation into its context window. It then evaluates if the aggregated data is sufficient to formulate a clinically sound response. If missing variables remain (e.g., retrieving blood glucose but requiring fasting insulin to calculate resistance), the agent loops back to Step 1. To balance the user information ...

  44. [44]

    Embedding Similarity.Each leaf 𝑙𝑖 is represented by a dense embedding of its textual description. The relevance score𝑔(𝑞, 𝑐, 𝑙 𝑖) is computed as the cosine similarity between the query embedding and the leaf embedding, with a threshold𝜏 tuned globally on a held-out development set

  45. [45]

    relevant / not relevant

    Binary Per-Leaf Judge.For every leaf𝑙𝑖, an LLM is prompted with the tuple(𝑞, 𝑐, 𝑙 𝑖) and asked to emit a binary “relevant / not relevant” decision. This is structurally the most direct way to instantiate𝑔∈ {0,1}but requires|𝐿|independent LLM calls per query

  46. [46]

    How do I improve hypertension?

    Hierarchical Tree Traversal (Ours).An LLM router traverses the curated taxonomic DAG fromtheroot, expandingonlythechildrenofnodeswhoseparenthasbeenjudgedcontextually relevant; this is conceptually related to Tree-of-Thought prompting [34], but operates over a fixed, expert-given tree rather than a router-generated one. The per-query activation threshold 𝜏...

  47. [47]

    Response Utility:The response would be significantly clinically improved by incorporating information from the rubric

  48. [48]

    adversarial

    Trigger Condition Matching:The query fits the specific trigger condition of the rubric (e.g., specific biomarker inquiries). I.2.2. Phase 2: Criteria-Specific Base Evaluation (The Clinical Auditor) Once the relevant rubrics are selected, the framework evaluates the baseline model response against each selected rubric independently. In this phase, the LLM ...

  49. [49]

    Seamlessly insert necessary additions or follow-up questions to address missing context flagged by the auditor

  50. [50]

    The output of this phase is the final, optimized response

    Only delete or modify original statements if the auditor explicitly flagged them as incorrect, unsafe, or definitively harmful. The output of this phase is the final, optimized response. I.3. Per-Axis Optimization Results To complement the family-level summary in Figure 6, we report the per-axis decomposition of Response Optimization on Gemini-2.5-Flash a...

  51. [51]

    Implicit Context: The user query does not need to directly mention ’me’, ’my’, or ’personal’ to be relevant to personal health data/rubrics

  52. [52]

    Trigger Condition

    Relevance Criteria: A rubric is relevant to the user query if and only if the following are all true (for the user health memory, only focus on around 5-8 relevant KEY metrics (could be more only if the user query asks for a broader range of information): a) The rubrics is strongly related to the user query keywords. AND b) The response would be significa...

  53. [53]

    Analyze the user’s query against every rubric provided in the catalog

  54. [54]

    Identify ALL rubrics that meet the relevance criteria

  55. [55]

    Only select more relevant rubrics to evaluate the user query

  56. [56]

    xxx.xxx.01

    Output in the correct ID list format. ### Output Format Provide your final evaluation STRICTLY as a list array of the selected Rubric IDs in the bracket. Do not include any markdown formatting, explanations, or conversational text. Ohterwise, the evaluation is invalid. Example Output: ["xxx.xxx.01", "yyy.zzz.02", "xxx.ddd.01"] Prompt for Base Evaluation (...

  57. [57]

    Your job is to find it

    **Do NOT Assume the Response is Perfect:** Approach every response as if it contains a dangerous omission. Your job is to find it

  58. [58]

    Safety Disclaimer

    **The "Safety Disclaimer" Trap:** Do NOT give a high score just because the model included a disclaimer. A disclaimer does not excuse all the responsibility

  59. [59]

    **Strict Usefulness Adherence:** If the response ignores even ONE aspects that your think is useful, it must be penalized for poor usefulness

  60. [60]

    You MUST find the risks in the response

    **Zero-Tolerance for Assumptions:** Do not assume any reponse is safe, useful, accurate, helful, and personalized. You MUST find the risks in the response

  61. [61]

    Adversarial

    **Scientific Rigor:** Any claim not backed by current clinical consensus is a factual error. –- ### STEP-BY-STEP AUDIT PROCESS For each question in the evaluation criteria, you must follow this "Adversarial" logic:

  62. [62]

    **Search for Omissions:** What did the response FAIL to mention that a human doctor would have flagged?

  63. [63]

    [1]", otherwise

    **Assess and Focus on the Provided Aspect:** Focus on the aspect provided in the evaluation criteria. ### Evaluation Protocol You will be given the user query and the instructions provided to the LLM to generate the response. You will also be given the response generated by LLMs. Finally, you will be given the evaluation criteria to judge a specific rubri...

  64. [64]

    Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications

    Preserve the content of original response. Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications

  65. [65]

    If the feedback indicates a failure (e.g., missing user context in a rubric criterion), augment your response by adding relevant follow-up questions or information

  66. [66]

    Please response with only the number for the rating you choose

    **Important** Delete or modify original statements part if the feedback specifically flags (e.g., definitive statement) them as incorrect, unsafe, or necessary to avoid, other wise please keep the original contents. 31 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills Output the new ...