RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
Pith reviewed 2026-06-27 01:13 UTC · model grok-4.3
The pith
RubricsTree supplies a growing hierarchy of over 100 Boolean rubrics that align LLM evaluation of personal health agents with physicians at scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RubricsTree is a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics that evolve from 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query. Systematic meta-evaluation demonstrates that RubricsTree substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries, reliably penalizes contextually degraded responses, and yields up to ~66% relative gains on HealthBench when used as stru
What carries the argument
RubricsTree, the hierarchical taxonomy of atomic Boolean rubrics together with its context-aware adaptive router that selects and weights relevant subsets for each query.
If this is right
- Evaluation throughput increases enough to handle product-scale volumes of open-ended health queries while preserving physician-level alignment.
- Models from multiple families improve measurably on HealthBench when the rubrics are applied as instructions, feedback, or reinforcement signals.
- Contextually degraded or memory-ignoring responses receive consistent penalties that standard evaluators miss.
- The rubric set can evolve continuously as new user queries arrive without restarting the evaluation infrastructure.
Where Pith is reading between the lines
- The same Boolean-rubric structure could be applied to other expert-heavy domains such as legal or financial advice agents.
- Over repeated cycles the curation process might require progressively less physician time as the tree stabilizes.
- The auditable rubric outputs could serve as evidence in regulatory reviews of deployed health AI systems.
- Linking rubric activation directly to incoming sensor streams might produce more personalized evaluation criteria per user.
Load-bearing premise
The iterative human-in-the-loop curation protocol with an expertise panel produces a set of atomic, clinically-verifiable Boolean rubrics that remain expert-aligned and free of significant curation bias across evolving user queries.
What would settle it
Independent physicians rate a fresh sample of 500 agent responses on the same open-ended queries and the resulting scores diverge from RubricsTree more than they diverge from the baseline evaluator.
read the original abstract
The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RubricsTree, a scalable evaluation framework for LLM-based personal health agents. It consists of a hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics evolved from 4,000 real user queries via an iterative human-in-the-loop protocol with a physician-led expertise panel, combined with a context-aware adaptive router that selects relevant rubric subsets. Through meta-evaluation, the work claims that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on open-ended queries, (ii) reliably penalizes contextually degraded responses, and (iii) yields up to ~66% relative gains on HealthBench when used as structured instructions, text feedback, or training rewards for Gemini, GPT, and Qwen model families.
Significance. If the meta-evaluation holds, the framework offers a practical advance in addressing the evaluation bottleneck for health AI by providing an auditable, scalable alternative that maintains clinical verifiability through Boolean rubrics while supporting continuous evolution from real queries. Explicit strengths include the multi-use demonstration (instructions/feedback/rewards) and the grounding in actual user data rather than synthetic benchmarks.
major comments (2)
- [Abstract] Abstract: The abstract reports positive meta-evaluation outcomes including 66% gains and superior expert alignment but supplies no information on evaluation methodology, sample sizes, statistical tests, baseline details, or HealthBench construction. This prevents verification that the data support the central claims (i)-(iii) and is load-bearing for the paper's primary contribution.
- [Curation Protocol and Meta-Evaluation] Curation and meta-evaluation sections: The iterative human-in-the-loop protocol with the expertise panel is asserted to produce expert-aligned, bias-free Boolean rubrics, yet no quantitative measures (e.g., inter-expert agreement rates, bias audits, or held-out validation results) are provided to substantiate this across the 4,000 queries. This assumption underpins all three meta-evaluation claims.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for identifying areas where additional detail would strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract reports positive meta-evaluation outcomes including 66% gains and superior expert alignment but supplies no information on evaluation methodology, sample sizes, statistical tests, baseline details, or HealthBench construction. This prevents verification that the data support the central claims (i)-(iii) and is load-bearing for the paper's primary contribution.
Authors: We agree that the abstract, constrained by length, omits methodological specifics required to evaluate the central claims. In the revised manuscript we will expand the abstract to include concise statements on the meta-evaluation design, sample sizes for the expert-alignment studies, statistical tests performed, the identity and construction of the large-scale baseline, and the composition of HealthBench. These additions will make the support for claims (i)–(iii) verifiable from the abstract itself. revision: yes
-
Referee: [Curation Protocol and Meta-Evaluation] Curation and meta-evaluation sections: The iterative human-in-the-loop protocol with the expertise panel is asserted to produce expert-aligned, bias-free Boolean rubrics, yet no quantitative measures (e.g., inter-expert agreement rates, bias audits, or held-out validation results) are provided to substantiate this across the 4,000 queries. This assumption underpins all three meta-evaluation claims.
Authors: The referee is correct that the manuscript presents the curation protocol qualitatively without accompanying quantitative validation statistics. We will add these measures in the revised version: inter-expert agreement rates computed across the expertise panel’s reviews of the 4,000 queries, results of any bias audits performed, and performance on a held-out validation subset. These statistics will be reported in the Curation Protocol and Meta-Evaluation sections to provide direct empirical support for the expert alignment of the rubrics. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's central claims rest on an external iterative human-in-the-loop curation protocol involving physicians on 4,000 queries, followed by meta-evaluation against independent baselines and the external HealthBench benchmark. No derivation step reduces by construction to fitted parameters, self-referential definitions, or self-citation chains; the rubrics and router are presented as outputs of the protocol, with performance gains shown via comparison to non-derived external references rather than internal tautologies.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Boolean rubrics can be automatically scored while preserving clinical verifiability and expert alignment.
Reference graph
Works this paper leans on
-
[1]
Automatic evaluation of health- care llms beyond question-answering
Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lu- cia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez- Cuena, Sergio Alvarez-Napagao, and Dario Garcia-Gasulla. Automatic evaluation of health- care llms beyond question-answering. InProceedings of the 2025 Conference of the Nations of the A...
2025
-
[2]
Healthbench: Evaluating large language models towards improved human health
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, et al. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025
Pith/arXiv arXiv 2025
-
[3]
Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022
Samantha G Auty and Kevin N Griffith. Medicaid expansion increased appointment wait times in maine and virginia.Journal of General Internal Medicine, 37(10):2594–2596, 2022
2022
-
[4]
When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation
Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Prathiba Dhanesh, Jimmy Huang, Frank Rudzicz, and Elham Dolatabadi. When can we trust LLMs in mental health? large-scale benchmarks for reliable LLM eval- uation. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of t...
2026
-
[5]
Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026
Tamara Beetham, Trisha Marsh, Michael L Barnett, Ruby M Aaron, Emmanuel Greenberg, Alexandra Do, and Jane M Zhu. Medicare appointment availability and wait times vary considerably across four large us urban markets.Health Affairs Scholar, 4(3):qxag054, 2026. 11 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Healt...
2026
-
[6]
Yuanchen Bei, Weizhi Zhang, Siwen Wang, Weizhi Chen, Sheng Zhou, Hao Chen, Yong Li, Jiajun Bu, Shirui Pan, Yizhou Yu, et al. Graphs meet ai agents: Taxonomy, progress, and future opportunities.arXiv preprint arXiv:2506.18019, 2025
arXiv 2025
-
[7]
Bitterman
Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. Medbrowsecomp: Bench- marking medical deep research and computer use, 2025
2025
-
[8]
Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G
Justin Cosentino, Anastasiya Belyaeva, Xin Liu, Nicholas A. Furlotte, Zhun Yang, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, Robby Bryant, Ryan G. Gomes, Allen Jiang, Roy Lee, Yun Liu, Javier Perez, Jameson K. Rogers, Cathy Speed, Shyam Tailor, Megan Walker, Jeffrey Yu, Tim Althoff, Conor Heneghan, John Hernandez, Mark Malhotr...
2024
-
[9]
Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025
Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, and Nigam H Shah. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records.npj Digital Medicine, 8(1):577, 2025
2025
-
[10]
LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA
Yella Diekmann, Chase Fensore, Rodrigo Carrillo-Larco, Eduard Castejon Rosales, Sakshi Shiromani, Rima Pai, Megha Shah, and Joyce Ho. LLMs as medical safety judges: Evaluating alignment with human annotation in patient-facing QA. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedic...
2025
-
[11]
Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022
Jean C Digitale, Jeffrey N Martin, and Medellena Maria Glymour. Tutorial on directed acyclic graphs.Journal of clinical epidemiology, 142:264–267, 2022
2022
-
[12]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025
2025
-
[13]
The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025
A Ali Heydari, Ken Gu, Vidya Srinivas, Hong Yu, Zhihan Zhang, Yuwei Zhang, Akshay Paruchuri, Qian He, Hamid Palangi, Nova Hammerquist, et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148, 2025
arXiv 2025
-
[14]
Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026
Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, and Eiji Aramaki. Filling in the clinical gaps in benchmark: Case for healthbench for the japanese medical system, 2026
2026
-
[15]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021
2021
-
[16]
A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025
Justin Khasentino, Anastasiya Belyaeva, Xin Liu, Zhun Yang, Nicholas A Furlotte, Chace Lee, Erik Schenck, Yojan Patel, Jian Cui, Logan Douglas Schneider, et al. A personal health large language model for sleep and fitness coaching.Nature Medicine, 31(10):3394–3403, 2025
2025
-
[17]
The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977
J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977
1977
-
[18]
Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, et al. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llms.arXiv preprint arXiv:2507.09477, 2, 2025. 12 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across H...
arXiv 2025
-
[19]
Zechen Li, Baiyu Chen, Hao Xue, and Flora D. Salim. Zara: Training-free motion time-series reasoning via evidence-grounded llm agents.arXiv preprint arXiv:2508.04038, 2026
Pith/arXiv arXiv 2026
-
[20]
SensorLLM:Aligning large language models with motion sensors for human activity recognition
ZechenLi, ShohrehDeldari, LinyaoChen, HaoXue, andFloraD.Salim. SensorLLM:Aligning large language models with motion sensors for human activity recognition. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 354–379, 2025
2025
-
[21]
Zechen Li, Keerthana Natarajan, Weizhi Zhang, Menglian Zhou, Simon A. Lee, Yuwei Zhang, Maxwell A. Xu, Zeinab Esmaeilpour, Flora D. Salim, Mark Malhotra, Lindsey Sunden, Shwetak Patel, Yuzhe Yang, and Ahmed A. Metwally. Glucofm: A dual-stream foundation model for continuous glucose monitoring.arXiv preprint arXiv:2605.30865, 2026
Pith/arXiv arXiv 2026
-
[22]
Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025
Pith/arXiv arXiv 2025
-
[23]
A scalable framework for evaluating health language models.npj Digital Medicine, 2026
Neil Mallinar, A Ali Heydari, Xin Liu, Anthony Z Faranesh, Brent Winslow, Nova Ham- merquist, Benjamin Graef, Cathy Speed, Mark Malhotra, Shwetak Patel, et al. A scalable framework for evaluating health language models.npj Digital Medicine, 2026
2026
-
[24]
Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y
Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, and Xin Liu. Transforming wearable data into personal health insights using large langu...
2025
-
[25]
Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large- scale multi-subject multi-choice dataset for medical domain question answering. InConfer- ence on health, inference, and learning, pages 248–260. PMLR, 2022
2022
-
[26]
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, and May D. Wang. Ehragent: Code empowers large language models for few-shot complex tabular reasoning on electronic health records, 2024
2024
-
[27]
It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025
Shrutika Singh, Anton Alyakin, Daniel Alexander Alber, Jaden Stryker, Ai Phuong S Tong, Karl Sangwon, Nicolas Goff, Mathew de la Paz, Miguel Hernandez-Rovira, Ki Yun Park, Eric Claude Leuthardt, and Eric Karl Oermann. It is too many options: Pitfalls of multiple- choice questions in generative ai and medical education, 2025
2025
-
[28]
Sara Mahdavi, Joelle Barral, Dale Webster, Greg S
Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, Mike Schaekermann, Amy Wang, Mo- hamed Amin, Sami Lachgar, Philip Mansfield, Sushant Prakash, Bradley Green, Ewa Domi- nowska, Blaise Aguera y Arcas, Nenad Tomasev, Yun Liu, Renee Wong, Christopher Semturs, S. Sara Mahd...
2023
-
[29]
Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023
Ching-Fang Sun, Christoph U Correll, Robert L Trestman, Yezhe Lin, Hui Xie, Maria Stack Hankey, Raymond Paglinawan Uymatiao, Riya T Patel, Vemmy L Metsutnan, Erin Corinne McDaid, et al. Low availability, long wait times, and high geographic disparity of psychiatric outpatient care in the us.General Hospital Psychiatry, 84:12–17, 2023
2023
-
[30]
Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A
Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A. Metoyer. Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks, 2024. 13 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
2024
-
[31]
Towards conversational diagnostic ai, 2024
Tao Tu, Anil Palepu, Mike Schaekermann, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Nenad Tomasev, Shekoofeh Azizi, Karan Singhal, Yong Cheng, Le Hou, Albert Webson, Kavita Kulkarni, S Sara Mahdavi, Christopher Sem- turs, Juraj Gottweis, Joelle Barral, Katherine Chou, Greg S Corrado, Yossi Matias, Alan Karthikesalingam, an...
2024
-
[32]
Brent Winslow, Jacqueline Shreibati, Javier Perez, Hao-Wei Su, Nichole Young-Lin, Nova Hammerquist, Daniel McDuff, Jason Guss, Jenny Vafeiadou, Nick Cain, et al. A principle- based framework for the development and evaluation of large language models for health and wellness.arXiv preprint arXiv:2512.08936, 2025
arXiv 2025
-
[33]
An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025
Kevin Wu, Eric Wu, Kevin Wei, Angela Zhang, Allison Casasola, Teresa Nguyen, Sith Riantawan, Patricia Shi, Daniel Ho, and James Zou. An automated framework for assessing how well llms cite relevant medical references.Nature Communications, 16(1):3615, 2025
2025
-
[34]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809–11822, 2023
2023
-
[35]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023
2023
-
[36]
Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, et al. From web search towards agentic deep research: Incentivizing search with reasoning agents.arXiv preprint arXiv:2506.18959, 2025
arXiv 2025
-
[37]
Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, and Philip S Yu. Memorycd: Benchmarking long-context user memory of llm agents for lifelong cross-domain personalization.arXiv preprint arXiv:2603.25973, 2026
arXiv 2026
-
[38]
Llminit: A free lunch from large language models for selective initialization of recommendation
Weizhi Zhang, Liangwei Yang, Wooseong Yang, Henry Peng Zou, Yuqing Liu, Ke Xu, Sourav Medya, and Philip S Yu. Llminit: A free lunch from large language models for selective initialization of recommendation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 2016–2024, 2025
2025
-
[39]
Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, et al. Personaagent: When large language model agents meet personalization at test time.arXiv preprint arXiv:2506.06254, 2025. A. Related Work TheRiseofOpen-EndedPersonalHealthAgents.With the emerging capabilities of...
Pith/arXiv arXiv 2025
-
[40]
It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query
Contextual Triage:The agent parses the user’s query against its available tool schema. It iden- tifies knowledge gaps and determines the specific physiological data or baseline demographics required to safely address the query
-
[41]
For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline
Execution (Action):Generation is temporarily halted to emit a structured function call. For example, the agent may invoke the wearable database to fetch specific metrics over a defined timeline
-
[42]
Observation:The external tool executes the requested routine against the data backend, returning a serialized string of the requested telemetry (e.g., longitudinal laboratory results or 17 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills 7-day rolling sensor trends)
-
[43]
It then evaluates if the aggregated data is sufficient to formulate a clinically sound response
Synthesis & Response:The agent ingests the observation into its context window. It then evaluates if the aggregated data is sufficient to formulate a clinically sound response. If missing variables remain (e.g., retrieving blood glucose but requiring fasting insulin to calculate resistance), the agent loops back to Step 1. To balance the user information ...
-
[44]
Embedding Similarity.Each leaf 𝑙𝑖 is represented by a dense embedding of its textual description. The relevance score𝑔(𝑞, 𝑐, 𝑙 𝑖) is computed as the cosine similarity between the query embedding and the leaf embedding, with a threshold𝜏 tuned globally on a held-out development set
-
[45]
relevant / not relevant
Binary Per-Leaf Judge.For every leaf𝑙𝑖, an LLM is prompted with the tuple(𝑞, 𝑐, 𝑙 𝑖) and asked to emit a binary “relevant / not relevant” decision. This is structurally the most direct way to instantiate𝑔∈ {0,1}but requires|𝐿|independent LLM calls per query
-
[46]
How do I improve hypertension?
Hierarchical Tree Traversal (Ours).An LLM router traverses the curated taxonomic DAG fromtheroot, expandingonlythechildrenofnodeswhoseparenthasbeenjudgedcontextually relevant; this is conceptually related to Tree-of-Thought prompting [34], but operates over a fixed, expert-given tree rather than a router-generated one. The per-query activation threshold 𝜏...
-
[47]
Response Utility:The response would be significantly clinically improved by incorporating information from the rubric
-
[48]
adversarial
Trigger Condition Matching:The query fits the specific trigger condition of the rubric (e.g., specific biomarker inquiries). I.2.2. Phase 2: Criteria-Specific Base Evaluation (The Clinical Auditor) Once the relevant rubrics are selected, the framework evaluates the baseline model response against each selected rubric independently. In this phase, the LLM ...
-
[49]
Seamlessly insert necessary additions or follow-up questions to address missing context flagged by the auditor
-
[50]
The output of this phase is the final, optimized response
Only delete or modify original statements if the auditor explicitly flagged them as incorrect, unsafe, or definitively harmful. The output of this phase is the final, optimized response. I.3. Per-Axis Optimization Results To complement the family-level summary in Figure 6, we report the per-axis decomposition of Response Optimization on Gemini-2.5-Flash a...
-
[51]
Implicit Context: The user query does not need to directly mention ’me’, ’my’, or ’personal’ to be relevant to personal health data/rubrics
-
[52]
Trigger Condition
Relevance Criteria: A rubric is relevant to the user query if and only if the following are all true (for the user health memory, only focus on around 5-8 relevant KEY metrics (could be more only if the user query asks for a broader range of information): a) The rubrics is strongly related to the user query keywords. AND b) The response would be significa...
-
[53]
Analyze the user’s query against every rubric provided in the catalog
-
[54]
Identify ALL rubrics that meet the relevance criteria
-
[55]
Only select more relevant rubrics to evaluate the user query
-
[56]
xxx.xxx.01
Output in the correct ID list format. ### Output Format Provide your final evaluation STRICTLY as a list array of the selected Rubric IDs in the bracket. Do not include any markdown formatting, explanations, or conversational text. Ohterwise, the evaluation is invalid. Example Output: ["xxx.xxx.01", "yyy.zzz.02", "xxx.ddd.01"] Prompt for Base Evaluation (...
-
[57]
Your job is to find it
**Do NOT Assume the Response is Perfect:** Approach every response as if it contains a dangerous omission. Your job is to find it
-
[58]
Safety Disclaimer
**The "Safety Disclaimer" Trap:** Do NOT give a high score just because the model included a disclaimer. A disclaimer does not excuse all the responsibility
-
[59]
**Strict Usefulness Adherence:** If the response ignores even ONE aspects that your think is useful, it must be penalized for poor usefulness
-
[60]
You MUST find the risks in the response
**Zero-Tolerance for Assumptions:** Do not assume any reponse is safe, useful, accurate, helful, and personalized. You MUST find the risks in the response
-
[61]
Adversarial
**Scientific Rigor:** Any claim not backed by current clinical consensus is a factual error. –- ### STEP-BY-STEP AUDIT PROCESS For each question in the evaluation criteria, you must follow this "Adversarial" logic:
-
[62]
**Search for Omissions:** What did the response FAIL to mention that a human doctor would have flagged?
-
[63]
[1]", otherwise
**Assess and Focus on the Provided Aspect:** Focus on the aspect provided in the evaluation criteria. ### Evaluation Protocol You will be given the user query and the instructions provided to the LLM to generate the response. You will also be given the response generated by LLMs. Finally, you will be given the evaluation criteria to judge a specific rubri...
-
[64]
Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications
Preserve the content of original response. Instead of rewrite, please augment the response by seamlessly inserting the necessary additions or clarifications
-
[65]
If the feedback indicates a failure (e.g., missing user context in a rubric criterion), augment your response by adding relevant follow-up questions or information
-
[66]
Please response with only the number for the rating you choose
**Important** Delete or modify original statements part if the feedback specifically flags (e.g., definitive statement) them as incorrect, unsafe, or necessary to avoid, other wise please keep the original contents. 31 RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills Output the new ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.