TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning

Junkai Li; Tianyi Zhu; Weizhi Ma; Yang Liu; Yunghwei Lai; Zheng Long Lee

arxiv: 2605.05963 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.CL

TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning

Junkai Li , Yunghwei Lai , Tianyi Zhu , Zheng Long Lee , Weizhi Ma , Yang Liu This is my paper

Pith reviewed 2026-05-08 10:46 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords therapeutic agentstreatment planningLLM agentsiterative refinementclinical AIHealthBenchAI safety in medicine

0 comments

The pith

An iterative generate-judge-refine pipeline turns coarse treatment plans into precise and safer regimens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large language models produce rough, incomplete, and unsafe treatment plans when they generate them in a single pass. TheraAgent counters this by replacing one-shot output with a repeating cycle that generates a draft, runs it through TheraJudge for clinical evaluation, and refines the plan until it meets accuracy and safety criteria. This mirrors how physicians revise their own work and leads to measurable gains on the HealthBench benchmark. Expert reviewers preferred the resulting plans over those written by physicians in 86 percent of cases, citing stronger targeting and harm control. High agreement between the internal TheraJudge scores and the external benchmark supports the claim that the loop is reliable.

Core claim

TheraAgent replaces one-shot generation with an iterative generate-judge-refine pipeline that progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens by integrating TheraJudge, a treatment-specific evaluation module, into the inference loop to enforce clinical standards. Experiments show state-of-the-art results on HealthBench with leading accuracy and completeness scores, an 86 percent win rate in expert evaluations against physicians with superior targeting and harm control, and strong agreement between TheraJudge and HealthBench that confirms the reliability of the framework.

What carries the argument

The iterative generate-judge-refine pipeline with TheraJudge, a treatment-specific evaluation module embedded in the inference loop that scores drafts against clinical standards and drives refinement.

If this is right

Treatment plans reach higher accuracy and completeness than one-shot LLM methods on HealthBench.
Expert evaluators select the AI-generated plans over physician plans 86 percent of the time with better targeting and harm control.
TheraJudge evaluations align closely with external benchmark scores, allowing the system to self-monitor without constant external checks.
Initial coarse drafts can be turned into regimens that better satisfy clinical standards through repeated internal refinement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generate-judge-refine structure could extend to other domains that require iterative revision of high-stakes outputs, such as legal or engineering documents.
If the judge model carries systematic blind spots, repeated refinement might lock in those biases rather than correct them.
Real-world deployment would need tests on live patient data to check whether benchmark gains translate when case details are incomplete or noisy.
The 86 percent preference rate raises the possibility of using such agents to review or challenge human plans, provided oversight mechanisms remain in place.

Load-bearing premise

TheraJudge provides an accurate and unbiased proxy for clinical safety, and the iterative loop improves plans without introducing new errors or overfitting to the judge.

What would settle it

An independent set of physician ratings on the same cases where TheraJudge scores and expert win rates diverge from the reported 86 percent preference or from HealthBench results.

Figures

Figures reproduced from arXiv: 2605.05963 by Junkai Li, Tianyi Zhu, Weizhi Ma, Yang Liu, Yunghwei Lai, Zheng Long Lee.

**Figure 1.** Figure 1: Comparison of treatment plan generation sce view at source ↗

**Figure 2.** Figure 2: Overview of the TheraAgent framework. TheraAgent performs treatment planning through a selfimproving inference pipeline. Given a patient case P, the Planner generates a therapeutic regimen Tk at iteration k, which is subsequently assessed by TheraJudge that gives multi-dimensional scores using RAG and Few-shots. The generated schedule and its evaluation are incorporated into the Memorizer to form Mk , whi… view at source ↗

**Figure 3.** Figure 3: Generalization analysis of TheraAgent across four medical departments. The plot compares the Health view at source ↗

**Figure 4.** Figure 4: Expert evaluation on Real Medical Cases. Top: Three-way preference rankings (left) and 5-point rating distributions (right), with numbers indicating the absolute count for each score. Bottom: Pairwise comparisons across seven clinical dimensions against human physicians (left) and DeepSeek-R1 (right). ing from +8.6% to +14.6%. Notably, every model in every department exhibits a positive trajectory from its… view at source ↗

**Figure 5.** Figure 5: Inference-time scaling in TheraAgent: performance progressively improves across inference steps. Each point denotes the mean HealthBench score over cases, and the red dashed line ( - - ) shows an overall positive performance trend cross iterations. Method Calls Tokens Time (s) Relative Cost DeepSeek-R1 1 1,358 30.6 1.0× Kimi-K2 1 1,764 16.2 2.1× Claude-4-Sonnet 1 1,295 23.6 6.2× Gemini-2.5-Pro 1 3,925 50.… view at source ↗

**Figure 6.** Figure 6: Department distribution of the HealthBench view at source ↗

**Figure 7.** Figure 7: Theme distribution of the HealthBench Dataset. view at source ↗

**Figure 8.** Figure 8: Disease distribution of the Real-World Case Dataset. view at source ↗

**Figure 9.** Figure 9: Demographic information of the Real-World Case Dataset. view at source ↗

**Figure 10.** Figure 10: Medical Judgement Dimensions. ment, where it scores 1.5 points lower than the best-performing model. Furthermore, TheraAgent also shows strong performance in multiple dimensions, especially on Completeness, surpassing every model in every department. These results highlight the outstanding capability of TheraAgent in ensuring completeness and avoiding critical omissions in its treatment plans, which c… view at source ↗

**Figure 11.** Figure 11: Comparison questions of the annotation interface. view at source ↗

**Figure 12.** Figure 12: Rating questions of the annotation interface. view at source ↗

**Figure 13.** Figure 13: Comparison of high-quality rating proportions across clinical dimensions. Data represents the percentage of expert ratings ≥ 4 (on a 5-point scale) for all real-world medical cases. multi-dimensional judging are optional and only included when the respective functions are enabled view at source ↗

**Figure 14.** Figure 14: All results on HealthBench across four departments. view at source ↗

read the original abstract

Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TheraAgent's generate-judge-refine loop with a domain-specific TheraJudge is a reasonable way to add verification to LLM treatment planning, but the abstract supplies almost no methods or data to back the reported gains.

read the letter

TheraAgent applies an agentic generate-judge-refine pipeline to therapeutic treatment planning, using a custom TheraJudge to iteratively improve LLM outputs. This is the main takeaway: it's a direct attempt to move beyond one-shot generation in a domain where errors matter. The framework does well by explicitly modeling the revision process that human experts use, and by building a treatment-specific judge into the loop rather than relying on generic evaluation. That setup could help catch incompleteness or safety issues that a single pass misses. The abstract also highlights strong results on HealthBench and an 86% expert win rate, which if backed up would be notable for this area. The soft spots are mostly around the missing pieces. There are no descriptions of the experimental setup, datasets, baselines, or how the expert study was run. The high agreement between TheraJudge and HealthBench is presented as confirmation of reliability, but since both are internal to the system it leaves open the possibility that they share the same blind spots on things like rare contraindications. The iterative loop could also amplify any weaknesses in the judge over multiple steps. The stress-test concern about overfitting or error introduction seems valid given what's shown. This paper is for researchers working on LLM agents for medical applications. Someone exploring ways to add verification to generation tasks would get value from the high-level design, though they'd need to implement and test the details themselves. It deserves a serious referee because the problem is important and the proposed solution is coherent, even if the current evidence is preliminary. I'd recommend sending it for peer review with the expectation that the authors add full methods, ablations, and more rigorous validation to make the claims convincing.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TheraAgent, an agentic framework that replaces one-shot LLM generation of treatment plans with an iterative generate-judge-refine pipeline. A custom TheraJudge module is inserted into the loop to enforce clinical standards, with the goal of producing progressively more precise, complete, and safer plans. The central empirical claims are state-of-the-art results on HealthBench (leading in Accuracy and Completeness), an 86% win rate against physicians in expert evaluations (with superior Targeting and Harm Control), and high agreement between TheraJudge and HealthBench that purportedly validates the framework.

Significance. If the reported gains are reproducible and the iterative loop demonstrably improves clinical safety without introducing new errors, the work would represent a useful step toward reliable agentic systems for therapeutic planning. The explicit mirroring of human iterative revision and the embedding of a domain-specific judge are conceptually sound strengths that could generalize beyond the current benchmark.

major comments (3)

[Abstract and Experiments] Abstract and Experiments section: The SOTA claims on HealthBench (leading Accuracy and Completeness) and the 86% expert win rate are stated without any description of the experimental protocol, baselines, statistical tests, error bars, or ablation studies that isolate the contribution of the generate-judge-refine loop versus one-shot generation.
[Expert Evaluations] Expert evaluation paragraph: The load-bearing claim of superior Targeting and Harm Control (and overall 86% win rate) cannot be assessed because no information is supplied on blinding, number of experts, sample size, inter-rater reliability, or whether the criteria were outcome-linked rather than subjective preference.
[TheraJudge and Evaluation] TheraJudge validation: The reported high agreement between TheraJudge and HealthBench is used to confirm reliability of the framework, yet no analysis of shared blind spots, failure modes (e.g., rare contraindications), or resistance to exploitation across iterations is provided; this leaves the self-improvement loop vulnerable to the circularity concern raised in the stress-test note.

minor comments (2)

[Abstract] Abstract: 'the highly agreement' is grammatically incorrect and should read 'the high agreement'.
[Method] The manuscript would benefit from an explicit diagram or pseudocode of the generate-judge-refine loop and the precise scoring rubric used by TheraJudge.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the acknowledgment of the conceptual strengths of the iterative generate-judge-refine pipeline and its potential to generalize. We address each major comment below, agreeing where additional detail or analysis is needed, and describe the planned revisions.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The SOTA claims on HealthBench (leading Accuracy and Completeness) and the 86% expert win rate are stated without any description of the experimental protocol, baselines, statistical tests, error bars, or ablation studies that isolate the contribution of the generate-judge-refine loop versus one-shot generation.

Authors: We agree that the current manuscript lacks sufficient detail on these elements. In the revised version, we will expand the Experiments section to describe the full experimental protocol, specify all baselines (including one-shot LLM variants), report the statistical tests used along with p-values and effect sizes, include error bars or confidence intervals, and present ablation studies that isolate the contribution of the iterative loop and TheraJudge versus one-shot generation. revision: yes
Referee: [Expert Evaluations] Expert evaluation paragraph: The load-bearing claim of superior Targeting and Harm Control (and overall 86% win rate) cannot be assessed because no information is supplied on blinding, number of experts, sample size, inter-rater reliability, or whether the criteria were outcome-linked rather than subjective preference.

Authors: We acknowledge the omission of these methodological details. We will revise the expert evaluation section to report the number and qualifications of the experts, the blinding procedure, the sample size of evaluated cases, inter-rater reliability metrics (such as Cohen's or Fleiss' kappa), and clarification that Targeting and Harm Control criteria were derived from predefined clinical outcome standards rather than subjective preference alone. Additional expert review will be conducted if required to complete the reporting. revision: yes
Referee: [TheraJudge and Evaluation] TheraJudge validation: The reported high agreement between TheraJudge and HealthBench is used to confirm reliability of the framework, yet no analysis of shared blind spots, failure modes (e.g., rare contraindications), or resistance to exploitation across iterations is provided; this leaves the self-improvement loop vulnerable to the circularity concern raised in the stress-test note.

Authors: We agree that the validation requires strengthening to address potential circularity and shared limitations. In the revision, we will add analysis of shared blind spots between TheraJudge and HealthBench, case studies on failure modes including rare contraindications, and tests of the iterative loop's resistance to exploitation (e.g., via adversarial inputs and tracking of genuine improvement across iterations). This will be grounded in independent clinical guidelines to mitigate circularity concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper describes an agentic generate-judge-refine framework (TheraAgent) and an integrated evaluator (TheraJudge) but advances no mathematical derivations, first-principles results, or fitted-parameter predictions. All load-bearing claims—SOTA performance on HealthBench, 86% expert win rate, superior Targeting/Harm Control, and TheraJudge-HealthBench agreement—are presented as direct empirical outcomes from external benchmarks and human evaluations. No equations, self-definitional loops, or self-citation chains reduce the reported improvements to the framework's own inputs by construction. The internal use of TheraJudge for refinement is a design choice whose validity is checked against independent HealthBench and expert scores rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; all technical content is absent.

pith-pipeline@v0.9.0 · 5496 in / 1146 out tokens · 41542 ms · 2026-05-08T10:46:34.008186+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Med42-v2: A suite of clinical llms

Towards medical complex reasoning with LLMs through medical verifiable problems. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 14552–14573, Vienna, Austria. As- sociation for Computational Linguistics. Clément Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, and Marco AF Pimentel. 2024. Med42-v2: A suite of clin...

work page arXiv 2025
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.Preprint, arXiv:2501.12948. Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, and Zhuosheng Zhang. 2025. GuideBench: Bench- marking domain-oriented guideline following for LLM agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guis...

work page internal anchor Pith review arXiv 2025
[3]

Frontiers in Oncology, 15

Exploring the role of artificial intelligence in chemotherapy development, cancer diagnosis, and treatment: present achievements and future outlook. Frontiers in Oncology, 15. hongzhou yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, and Xiaobo Zhang. 2025. FineMedLM-o1: Enhanc- ing medical knowledge reasoning abilit...

work page 2025
[4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 5014–5035, Bangkok, Thailand

QueryAgent: A reliable and efficient reason- ing framework with environmental feedback based self-correction. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 5014–5035, Bangkok, Thailand. Association for Computational Linguistics. Hrishikesh Khude and Pravin Shende. 2025. AI-driven...

work page 2025
[5]

Performance analysis of large language mod- els Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.Clinical and Experimental Medicine, 25(1). Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, Tianpei Hong, Jin Yang, Tianrun Gao, Jiangj...

work page
[6]

Towards accurate differential diagnosis with large language models.Nature, 642(8067). Abdul M. Mohammed, Iqtidar Mansoor, Sarah Blythe, and Dennis Trujillo. 2025. Developing an artificial intelligence tool for personalized breast cancer treat- ment plans based on the NCCN guidelines.Preprint, arXiv:2502.15698. OpenAI. 2025a. Introducing GPT-4.1 in the API...

work page arXiv 2025
[7]

Qwen3 Technical Report

Meta-Reflection: A feedback-free reflection learning framework. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3958– 3976, Vienna, Austria. Association for Computa- tional Linguistics. xAI. 2025. Grok 3 Beta — the age of reasoning agents. xAI news release. Accessed 2025-12-30. Cil...

work page internal anchor Pith review arXiv 2025
[8]

Traditional lexical metrics (BLEU and ROUGE) are computed by directly comparing each model- generated output against the corresponding ideal_completion

These models are selected to represent a mixture of medical-specialized models and general-purpose large language models from different model fami- lies, ensuring diversity in generation style and rea- soning behavior. Traditional lexical metrics (BLEU and ROUGE) are computed by directly comparing each model- generated output against the corresponding ide...

work page 2018
[9]

Scientific Consensus Compliance ( To what extent is the treatment plan consistent with established scientific and clinical consensus ?)

work page
[10]

Plan Completeness ( To what extent does the plan comprehensively address all necessary components without omission ?)

work page
[11]

Situation Targeting ( To what extent does the plan accurately reflect and address the patient ’ s specific condition ?)

work page
[12]

Rationale - Measure Coherence ( To what extent is the reasoning behind the treatment plan logically connected to the proposed measures ?)

work page
[13]

Harm Potential ( What is the extent and likelihood of potential harm to the patient ?)

work page
[14]

Information Accuracy & Relevance ( To what extent does the plan contain inaccurate or irrelevant information ?)

work page
[15]

Bias in Medical Content ( To what extent does the plan exhibit bias or inapplicability to specific patient demographics ?) ###Patient Case Details: { query } ###Treatment Plan to Evaluate: { treatment_plan } Please answer using the following format: < reason >[ detailed explanation ] </ reason > < dimension_scores >[ all dimension scores from 0 to 100] </...

work page

[1] [1]

Med42-v2: A suite of clinical llms

Towards medical complex reasoning with LLMs through medical verifiable problems. InFind- ings of the Association for Computational Linguistics: ACL 2025, pages 14552–14573, Vienna, Austria. As- sociation for Computational Linguistics. Clément Christophe, Praveen K Kanithi, Tathagata Raha, Shadab Khan, and Marco AF Pimentel. 2024. Med42-v2: A suite of clin...

work page arXiv 2025

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing reasoning capa- bility in LLMs via reinforcement learning.Preprint, arXiv:2501.12948. Lingxiao Diao, Xinyue Xu, Wanxuan Sun, Cheng Yang, and Zhuosheng Zhang. 2025. GuideBench: Bench- marking domain-oriented guideline following for LLM agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Lin- guis...

work page internal anchor Pith review arXiv 2025

[3] [3]

Frontiers in Oncology, 15

Exploring the role of artificial intelligence in chemotherapy development, cancer diagnosis, and treatment: present achievements and future outlook. Frontiers in Oncology, 15. hongzhou yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, and Xiaobo Zhang. 2025. FineMedLM-o1: Enhanc- ing medical knowledge reasoning abilit...

work page 2025

[4] [4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 5014–5035, Bangkok, Thailand

QueryAgent: A reliable and efficient reason- ing framework with environmental feedback based self-correction. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 5014–5035, Bangkok, Thailand. Association for Computational Linguistics. Hrishikesh Khude and Pravin Shende. 2025. AI-driven...

work page 2025

[5] [5]

Performance analysis of large language mod- els Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.Clinical and Experimental Medicine, 25(1). Xiaohong Liu, Hao Liu, Guoxing Yang, Zeyu Jiang, Shuguang Cui, Zhaoze Zhang, Huan Wang, Liyuan Tao, Yongchang Sun, Zhu Song, Tianpei Hong, Jin Yang, Tianrun Gao, Jiangj...

work page

[6] [6]

Towards accurate differential diagnosis with large language models.Nature, 642(8067). Abdul M. Mohammed, Iqtidar Mansoor, Sarah Blythe, and Dennis Trujillo. 2025. Developing an artificial intelligence tool for personalized breast cancer treat- ment plans based on the NCCN guidelines.Preprint, arXiv:2502.15698. OpenAI. 2025a. Introducing GPT-4.1 in the API...

work page arXiv 2025

[7] [7]

Qwen3 Technical Report

Meta-Reflection: A feedback-free reflection learning framework. InProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3958– 3976, Vienna, Austria. Association for Computa- tional Linguistics. xAI. 2025. Grok 3 Beta — the age of reasoning agents. xAI news release. Accessed 2025-12-30. Cil...

work page internal anchor Pith review arXiv 2025

[8] [8]

Traditional lexical metrics (BLEU and ROUGE) are computed by directly comparing each model- generated output against the corresponding ideal_completion

These models are selected to represent a mixture of medical-specialized models and general-purpose large language models from different model fami- lies, ensuring diversity in generation style and rea- soning behavior. Traditional lexical metrics (BLEU and ROUGE) are computed by directly comparing each model- generated output against the corresponding ide...

work page 2018

[9] [9]

Scientific Consensus Compliance ( To what extent is the treatment plan consistent with established scientific and clinical consensus ?)

work page

[10] [10]

Plan Completeness ( To what extent does the plan comprehensively address all necessary components without omission ?)

work page

[11] [11]

Situation Targeting ( To what extent does the plan accurately reflect and address the patient ’ s specific condition ?)

work page

[12] [12]

Rationale - Measure Coherence ( To what extent is the reasoning behind the treatment plan logically connected to the proposed measures ?)

work page

[13] [13]

Harm Potential ( What is the extent and likelihood of potential harm to the patient ?)

work page

[14] [14]

Information Accuracy & Relevance ( To what extent does the plan contain inaccurate or irrelevant information ?)

work page

[15] [15]

Bias in Medical Content ( To what extent does the plan exhibit bias or inapplicability to specific patient demographics ?) ###Patient Case Details: { query } ###Treatment Plan to Evaluate: { treatment_plan } Please answer using the following format: < reason >[ detailed explanation ] </ reason > < dimension_scores >[ all dimension scores from 0 to 100] </...

work page