arxiv: 2605.02552 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Recurrent Deep Reinforcement Learning for Chemotherapy Control under Partial Observability

Firas Mohamed Elamine Kiram , Imane Youkana , Rachida Saouli , Gian Antonio Susto , Laid Kahloul

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords chemotherapy optimizationreinforcement learningpartial observabilityrecurrent neural networksTD3LSTMdynamic treatment regimestumor suppression

0 comments

The pith

Recurrent reinforcement learning policies achieve steadier tumor control and better healthy-cell preservation when chemotherapy dosing decisions must be made from incomplete patient observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Chemotherapy optimization involves choosing doses over time to shrink tumors while limiting harm to normal cells, but real patient data is rarely complete. Most reinforcement learning approaches assume the full state is known at every step, an assumption that rarely holds in practice. The paper tests whether adding memory through recurrent networks lets agents handle the missing information better. On the AhnChemoEnv benchmark, recurrent TD3 agents show only modest gains when the full state is given but deliver substantially more stable and effective performance when observations are partial and noisy, producing more reliable tumor reduction and less damage to healthy cells. This points to memory as a practical way to compensate for uncertainty in sequential treatment decisions.

Core claim

The paper shows that recurrent TD3 agents equipped with separate LSTM actor and critic networks outperform both feed-forward TD3 and Soft Actor-Critic baselines under partial observability on the AhnChemoEnv benchmark. Across ten random seeds, recurrence produces only modest improvement when the full state is available, yet yields markedly stronger and more stable results when observations contain noise and hidden-state uncertainty, with more consistent tumor suppression and improved normal-cell preservation while pharmacokinetic and pharmacodynamic variability remain fixed.

What carries the argument

Recurrent TD3 with separate LSTM actor and critic networks that maintain hidden state across time steps to compensate for incomplete observations.

If this is right

Recurrent policies deliver only modest gains when the full patient state is observable.
Under partial observability the same recurrent policies produce substantially stronger and more stable performance across random seeds.
Memory augmentation leads to more consistent tumor suppression across treatment episodes.
Recurrent agents preserve normal cells better than feed-forward counterparts when observations are noisy.
The performance difference is isolated to observation uncertainty because pharmacokinetic and pharmacodynamic parameters are held fixed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar memory mechanisms could help reinforcement learning in other medical domains where key state variables are unobserved or delayed.
Real clinical systems might improve by feeding patient history directly into policies rather than relying on instantaneous measurements alone.
Testing the same recurrent architecture on environments that also vary patient-specific parameters would clarify whether the benefit persists beyond controlled benchmarks.

Load-bearing premise

The AhnChemoEnv benchmark with fixed pharmacokinetic and pharmacodynamic variability plus added observation noise adequately captures the partial observability and uncertainty found in actual clinical chemotherapy practice.

What would settle it

A follow-up experiment on real patient monitoring data or a simulation that includes realistic inter-patient variability showing no advantage in tumor control or toxicity for recurrent over non-recurrent policies.

Figures

Figures reproduced from arXiv: 2605.02552 by Firas Mohamed Elamine Kiram, Gian Antonio Susto, Imane Youkana, Laid Kahloul, Rachida Saouli.

**Figure 1.** Figure 1: Recurrent TD3 architecture adapted from [19] for the view at source ↗

**Figure 2.** Figure 2: Trajectory-level evaluation under partial observability over 30 view at source ↗

**Figure 3.** Figure 3: Evaluation performance under full and partial observ view at source ↗

**Figure 4.** Figure 4: Action and immune-cell trajectories under partial observability over 30 view at source ↗

**Figure 5.** Figure 5: Drug-concentration trajectories under partial observ view at source ↗

read the original abstract

Chemotherapy dose optimization can be formulated as a dynamic treatment regime, requiring sequential decisions under uncertainty that must balance tumor suppression against toxicity. However, most reinforcement learning approaches assume full observability of the patient state, a condition rarely met in clinical practice. We investigate whether memory-augmented policies can improve chemotherapy control under partial observability. To this end, we employ a recurrent TD3-based approach with separate LSTM actor-critic networks and evaluate it on the AhnChemoEnv benchmark from DTR-Bench, considering both off-policy and on-policy recurrent architectures against feed-forward TD3 and Soft Actor-Critic. Pharmacokinetic and pharmacodynamic variability are held fixed to isolate hidden-state uncertainty and observation noise and to avoid confounding effects from inter-patient variability. Across ten random seeds, recurrence yields modest benefit under full observability but substantially stronger and more stable performance under partial observability, with more consistent tumor suppression and improved normal-cell preservation. These findings indicate that memory-based policies are particularly beneficial when clinically relevant state information is incomplete or noisy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Recurrent TD3 shows a bigger stability gain under noisy observations than full ones in this fixed-parameter chemo sim, but the isolation limits how much it says about real patient variability.

read the letter

The paper finds that adding recurrence to TD3 produces only modest improvement when the full state is observed, but substantially better and more stable tumor suppression plus normal-cell preservation once observation noise is introduced. They run the comparison on the public AhnChemoEnv benchmark, keep pharmacokinetic and pharmacodynamic parameters fixed, and report results across ten random seeds for both recurrent and feed-forward TD3 plus SAC baselines. The recurrent actor and critic both use LSTMs, and the setup deliberately avoids inter-patient variability to focus on hidden-state effects from noise alone. This is a clean, incremental application of established recurrent actor-critic methods to an existing medical-control benchmark rather than a new algorithm. The multiple seeds and explicit isolation of the observability factor are the parts that hold up best; they give a clear picture of when memory helps in this particular environment. The central limitation is that clinical partial observability usually mixes noisy measurements with large unknown differences in how individual patients metabolize the drugs. By holding those parameters constant the experiment removes one major source of uncertainty, so the reported advantage for recurrence could shrink or shift if the policy also had to track patient-specific rates from noisy trajectories. The abstract still ties the results to “clinically relevant state information,” which makes the benchmark choice load-bearing for that interpretation. Details on statistical testing, hyperparameter ranges, and exact network sizes are thin in the text, leaving reproducibility questions for a reviewer to chase. This work is mainly useful for people already running RL experiments on DTR-Bench or similar dosing simulators who want to see a side-by-side of recurrent versus feed-forward policies under controlled noise. It is a straightforward empirical comparison on a public benchmark and deserves a serious referee who can verify the implementation and press on the scope of the clinical claim. I would send it to peer review rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper formulates chemotherapy dose optimization as a dynamic treatment regime under uncertainty and examines whether memory-augmented policies improve performance when patient state is only partially observable. It implements a recurrent TD3 variant using separate LSTM actor and critic networks, compares it to feed-forward TD3 and SAC (both off- and on-policy) on the AhnChemoEnv benchmark, and deliberately fixes pharmacokinetic/pharmacodynamic parameters while adding observation noise to isolate hidden-state effects. Across ten random seeds the results indicate modest gains from recurrence under full observability but substantially stronger and more stable tumor suppression together with better normal-cell preservation under partial observability.

Significance. If the empirical findings are robust, the work supplies concrete evidence that recurrent architectures can mitigate the performance degradation caused by incomplete or noisy state information in a clinically motivated control task. The explicit isolation of observation noise from inter-patient variability is a methodological strength that permits clear attribution of benefits to memory. The study therefore contributes to the growing literature on partial-observability handling in medical RL and offers a reproducible benchmark comparison that future work can extend.

major comments (2)

[Abstract] Abstract: the headline claim that recurrence produces 'substantially stronger and more stable performance under partial observability' is presented without accompanying effect sizes, confidence intervals, or statistical tests across the ten seeds; this absence makes it impossible to judge whether the reported stability is statistically distinguishable from noise or from the feed-forward baselines.
[Abstract] Abstract: by holding PK/PD parameters fixed while adding only observation noise, the experimental design deliberately excludes inter-patient variability; the interpretation that the observed gains address 'clinically relevant state information' therefore rests on an assumption that real partial observability is dominated by additive noise rather than by the need to infer patient-specific parameters from noisy trajectories, an assumption whose validity is not tested within the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below, indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that recurrence produces 'substantially stronger and more stable performance under partial observability' is presented without accompanying effect sizes, confidence intervals, or statistical tests across the ten seeds; this absence makes it impossible to judge whether the reported stability is statistically distinguishable from noise or from the feed-forward baselines.

Authors: We agree that quantitative support is needed to substantiate the headline claim. In the revised manuscript we will expand the abstract to report effect sizes (mean differences in tumor volume reduction and normal-cell preservation), standard deviations across the ten seeds, and p-values from paired statistical tests (t-tests or Wilcoxon signed-rank) comparing recurrent versus feed-forward agents under partial observability. These additions will allow readers to assess whether the observed improvements exceed what could be attributed to random variation. revision: yes
Referee: [Abstract] Abstract: by holding PK/PD parameters fixed while adding only observation noise, the experimental design deliberately excludes inter-patient variability; the interpretation that the observed gains address 'clinically relevant state information' therefore rests on an assumption that real partial observability is dominated by additive noise rather than by the need to infer patient-specific parameters from noisy trajectories, an assumption whose validity is not tested within the manuscript.

Authors: The referee correctly notes that our design fixes PK/PD parameters to isolate observation noise and hidden-state effects. This choice was intentional, as stated in the manuscript, to prevent confounding from inter-patient variability and to attribute performance differences specifically to the recurrent architecture's memory. We do not claim the setup captures every clinical source of partial observability. In the revision we will rephrase the abstract to avoid over-generalization, explicitly state that parameters are held fixed, and add a short limitations paragraph in the discussion acknowledging that future work should examine joint inference of patient-specific parameters from noisy trajectories. revision: partial

Circularity Check

0 steps flagged

Empirical RL benchmark evaluation with no derivation chain or self-referential reductions

full rationale

The paper reports an empirical comparison of recurrent TD3 and other RL agents on the external AhnChemoEnv benchmark from DTR-Bench. It evaluates performance under full vs. partial observability by holding PK/PD parameters fixed and adding observation noise. No mathematical derivations, predictions, or first-principles results are claimed that could reduce to fitted inputs or self-citations. Training follows standard off-policy RL optimization; results are reported across random seeds without reuse of target metrics in the objective. The fixed-variability design is an explicit experimental control, not a circular definition. This is a self-contained empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard reinforcement-learning assumptions for POMDPs and the fidelity of the chosen benchmark environment; no new entities or ad-hoc parameters are introduced beyond typical RL hyperparameters.

axioms (2)

domain assumption The chemotherapy environment can be modeled as a partially observable Markov decision process with fixed pharmacokinetic and pharmacodynamic parameters.
Invoked when isolating hidden-state uncertainty and observation noise while holding PK/PD variability fixed.
standard math Standard TD3 and Soft Actor-Critic training procedures remain valid when actor and critic are replaced by LSTM networks.
Used without additional justification when comparing recurrent and feed-forward variants.

pith-pipeline@v0.9.0 · 5491 in / 1438 out tokens · 57891 ms · 2026-05-08T18:45:44.270728+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost (Jcost) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the step reward is given by r_t = N_t/N_0 − T_t/T_0 − u_t, which rewards preservation of normal cells, penalizes tumor burden, and discourages excessive drug administration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Cancer statistics, 2026,

R. L. Siegel, K. D. Miller, N. S. Wagle, and A. Jemal, “Cancer statistics, 2026,”CA: A Cancer Journal for Clinicians, vol. 76, no. 1, pp. 17–48, 2026

2026
[2]

Optimal dosing of cancer chemotherapy using model predictive control and moving horizon state/parameter estimation,

T. Chen, N. F. Kirkby, and R. Jena, “Optimal dosing of cancer chemotherapy using model predictive control and moving horizon state/parameter estimation,” vol. 108, no. 3, pp. 973–983, 2012

2012
[3]

Personalized medicine: Progress and promise,

I. S. Chan and G. S. Ginsburg, “Personalized medicine: Progress and promise,”Annual Review of Genomics and Human Genetics, vol. 12, no. V olume 12, 2011, pp. 217–244, 2011

2011
[4]

Optimal dynamic treatment regimes,

S. A. Murphy, “Optimal dynamic treatment regimes,”Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 65, no. 2, pp. 331–355, 2003

2003
[5]

Dynamic treatment regimes,

B. Chakraborty and S. A. Murphy, “Dynamic treatment regimes,”Annual Review of Statistics and Its Application, vol. 1, no. V olume 1, 2014, pp. 447–464, 2014

2014
[6]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed., Cambridge, MA, USA, 2018

2018
[7]

Reinforcement learning for sequential decision making in population research,

N. Deliu, “Reinforcement learning for sequential decision making in population research,”Quality & Quantity, vol. 58, pp. 5057–5080,
[8]

Available: https://doi.org/10.1007/s11135-023-01755-z

[Online]. Available: https://doi.org/10.1007/s11135-023-01755-z

work page doi:10.1007/s11135-023-01755-z
[9]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,”arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review arXiv 2015
[10]

Deep reinforce- ment learning for personalized chemotherapy treatment,

H.-E. Tseng, C.-Y . Liao, C.-H. Hsu, and L.-C. Fu, “Deep reinforce- ment learning for personalized chemotherapy treatment,” in2017 IEEE Healthcare Innovations and Point of Care Technologies (HI-POCT), 2017, pp. 176–179

2017
[11]

Reinforcement learning-based control of drug dosing for cancer chemotherapy treat- ment,

R. Padmanabhan, N. Meskin, and W. M. Haddad, “Reinforcement learning-based control of drug dosing for cancer chemotherapy treat- ment,”Mathematical Biosciences, vol. 293, pp. 11–20, 2017

2017
[12]

A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients,

J. D. Mart ´ın-Guerrero, F. Gomez, E. Soria-Olivas, J. Schmidhuber, M. Climente-Mart´ı, and N. V . Jim´enez-Torres, “A reinforcement learning approach for individualizing erythropoietin dosages in hemodialysis patients,”Expert Systems with Applications, no. 6, pp. 9737–9742
[13]

Dtr-bench: An in silico environment and benchmark platform for reinforcement learning based dynamic treatment regime,

Z. Luo, M. Zhu, F. Liu, J. Li, Y . Pan, J. Zhou, and T. Zhu, “Dtr-bench: An in silico environment and benchmark platform for reinforcement learning based dynamic treatment regime,” 2024

2024
[14]

A systematic review of dynamic treatment regime methods in healthcare,

Y . Lianget al., “A systematic review of dynamic treatment regime methods in healthcare,”Computer Meth- ods and Programs in Biomedicine, 2025, available at: https://dspace.library.uu.nl/server/api/core/bitstreams/e5067da0-a232- 465a-b58a-729fa0890aa7/content

2025
[15]

Deep reinforcement learning-based control of chemo-drug dose in cancer treatment,

H. Mashayekhi, M. Nazari, F. Jafarinejad, and N. Meskin, “Deep reinforcement learning-based control of chemo-drug dose in cancer treatment,”Computer Methods and Programs in Biomedicine, vol. 243, p. 107884, 2024

2024
[16]

An inverse reinforcement learning algorithm for partially observable domains with application on health- care dialogue management,

H. R. Chinaei and B. Chaib-Draa, “An inverse reinforcement learning algorithm for partially observable domains with application on health- care dialogue management,” in2012 11th International Conference on Machine Learning and Applications, vol. 1, 2012, pp. 144–149

2012
[17]

Planning and acting in partially observable stochastic domains,

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, “Planning and acting in partially observable stochastic domains,”Artificial Intelligence, pp. 99–134, 1998

1998
[18]

Ordinary differential equa- tion models for adoptive immunotherapy,

A. Talkington, C. Dantoin, and R. Durrett, “Ordinary differential equa- tion models for adoptive immunotherapy,”Bulletin of Mathematical Biology, vol. 80, pp. 1059–1083, 2018

2018
[19]

A mathematical tumor model with immune resistance and drug therapy: An optimal control approach,

L. G. De Pillis and A. Radunskaya, “A mathematical tumor model with immune resistance and drug therapy: An optimal control approach,” Computational and Mathematical Methods in Medicine, vol. 3, no. 2, p. 318436, 2001

2001
[20]

Ground-Truth Models

T. Ni, B. Eysenbach, and R. Salakhutdinov, “Recurrent model-free rl can be a strong baseline for many pomdps,” 2022. [Online]. Available: https://arxiv.org/abs/2110.05038

work page arXiv 2022
[22]

Addressing Function Approximation Error in Actor-Critic Methods

[Online]. Available: http://arxiv.org/abs/1802.09477

work page Pith review arXiv
[23]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[24]

Stable-baselines3: Reliable reinforcement learning implementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021. [Online]. Available: http://jmlr.org/papers/v22/ 20-1364.html

2021
[25]

Recurrent experience replay in distributed reinforcement learning,

S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney, “Recurrent experience replay in distributed reinforcement learning,” in International Conference on Learning Representations (ICLR), 2019. [Online]. Available: https://openreview.net/forum?id=r1lyTjAqYX

2019
[26]

The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care,

M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal, “The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care,”Nature Medicine, vol. 24, no. 11, pp. 1716– 1720, 2018

2018
[27]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” arXiv preprint arXiv:2005.01643, 2020. [Online]. Available: https: //arxiv.org/abs/2005.01643

work page internal anchor Pith review arXiv 2005
[28]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 1179–1191

2020