Compositional Q-learning for electrolyte repletion with imbalanced patient sub-populations

Aishwarya Mandyam; Andrew Jones; Barbara Engelhardt; Jiayu Yao; Krzysztof Laudanski

arxiv: 2110.02879 · v2 · submitted 2021-10-06 · 💻 cs.LG · cs.AI

Compositional Q-learning for electrolyte repletion with imbalanced patient sub-populations

Aishwarya Mandyam , Andrew Jones , Jiayu Yao , Krzysztof Laudanski , Barbara Engelhardt This is my paper

Pith reviewed 2026-05-24 12:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningcompositional learningQ-learningmedical decision makingelectrolyte repletionclass imbalancepatient heterogeneityrenal disease

0 comments

The pith

Compositional fitted Q-iteration learns distinct policies for patient subgroups while sharing knowledge across variants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Compositional Fitted Q-iteration to solve sequential decision-making problems in medicine where patients respond differently to treatments. It structures tasks as compositional variants of increasing difficulty that correspond to different patient subpopulations, such as those with and without renal disease. By using a Q-value function with separate modules for each variant, the method shares knowledge while learning distinct policies. This makes CFQI robust to class imbalance, allowing better use of data from all groups. If correct, it supports more effective personalized electrolyte repletion recommendations in clinical settings with known task structures.

Core claim

CFQI uses a compositional task structure to represent heterogeneous treatment responses in medical care settings. A compositional task consists of several variations of the same task, each progressing in difficulty; solving simpler variants enables efficient solving of harder variants. CFQI uses a compositional Q-value function with separate modules for each task variant, allowing it to take advantage of shared knowledge while learning distinct policies for each variant. Validation on Cartpole and electrolyte repletion data for patients with and without renal disease shows robustness to class imbalance.

What carries the argument

Compositional Q-value function with separate modules for each task variant

If this is right

Robust performance in medical RL even when patient subpopulations are imbalanced.
Effective information usage across patient sub-populations with different treatment needs.
Distinct policies learned for variants corresponding to patients with chronic conditions like renal disease.
Applicability to clinical scenarios characterized by known compositional task structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other sequential medical decisions such as medication dosing if similar compositional structures are identified.
It may reduce the volume of data needed from rare patient groups to train effective policies.
Further experiments could test performance under varying imbalance ratios or on different chronic conditions.

Load-bearing premise

The medical decision problem possesses a known compositional task structure in which simpler variants can be solved to enable efficient solving of harder variants that correspond to distinct patient sub-populations.

What would settle it

Running CFQI on the electrolyte repletion data split by renal disease status and finding no performance advantage over standard fitted Q-iteration on the minority subgroup would challenge the robustness claim.

Figures

Figures reproduced from arXiv: 2110.02879 by Aishwarya Mandyam, Andrew Jones, Barbara Engelhardt, Jiayu Yao, Krzysztof Laudanski.

**Figure 2.** Figure 2: NFQI outperforms related algorithms in a nested Cartpole environment. Increasing the [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: SHAP plots for background (Panel a) and foreground (Panel b) samples from the Cartpole [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: NFQI is robust to imbalance in foreground and background sample sizes. We fix the total [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: NFQI does not estimate practically different policies for two groups when there is no [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Visualizing FQI and NFQI policies for non-renal and renal patients. Heatmaps indicate [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: NFQI mean SHAP values for renal (blue) and non-renal (red) patients. Y-axis shows [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Performance of FQI in the Cartpole environment using two different approximation func [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: A neural network-based version of NFQI outperforms related algorithms and a linear [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Reinforcement learning (RL) is an effective framework for solving sequential decision-making tasks. However, applying RL methods in medical care settings is challenging in part due to heterogeneity in treatment response among patients. Some patients can be treated with standard protocols whereas others, such as those with chronic diseases, need personalized treatment planning. Traditional RL methods often fail to account for this heterogeneity, because they assume that all patients respond to the treatment in the same way (i.e., transition dynamics are shared). We introduce Compositional Fitted $Q$-iteration (CFQI), which uses a compositional task structure to represent heterogeneous treatment responses in medical care settings. A compositional task consists of several variations of the same task, each progressing in difficulty; solving simpler variants of the task can enable efficient solving of harder variants. CFQI uses a compositional $Q$-value function with separate modules for each task variant, allowing it to take advantage of shared knowledge while learning distinct policies for each variant. We validate CFQI's performance using a Cartpole environment and use CFQI to recommend electrolyte repletion for patients with and without renal disease. Our results demonstrate that CFQI is robust even in the presence of class imbalance, enabling effective information usage across patient sub-populations. CFQI exhibits great promise for clinical applications in scenarios characterized by known compositional structures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CFQI adds per-variant modules to fitted Q-iteration for electrolyte repletion and claims robustness under imbalance, but the compositional transfer between renal and non-renal groups is asserted rather than shown.

read the letter

The paper introduces CFQI as fitted Q-iteration with a compositional Q-function that keeps separate modules for each task variant. They first check it on Cartpole, then apply it to electrolyte repletion by splitting patients into renal-disease and non-renal groups, arguing that the split lets the method share knowledge while handling the smaller subgroup better than standard approaches. That is the concrete new piece: a modular extension of FQI tied to a known patient split in a clinical sequential decision task. The motivation is sound; heterogeneity and imbalance are real obstacles when RL moves from simulation to hospital data, and the modular structure is a direct response. The Cartpole results at least confirm the algorithm runs and can solve the base task. The medical application is a reasonable next step for anyone thinking about structured RL in chronic care. The soft spot is the load-bearing assumption that renal status creates a compositional progression where the non-renal variant is simpler and its solution transfers through the shared modules. The abstract states the structure is known but supplies no derivation, ablation, or transfer metric showing that the modules actually move information from one group to the other rather than simply learning two policies in parallel. If that premise does not hold, the robustness claim reduces to ordinary multi-task learning and the compositionality label does not explain the result. No performance tables or error bars appear in the abstract, so the strength of the medical evidence cannot be judged from what is given. The work is aimed at researchers already using RL for treatment planning who want to incorporate patient subgroups explicitly. A reader working on modular or multi-task RL in healthcare would find the formulation useful to see, even if the empirical support for the compositional benefit stays thin. It is coherent enough on its own terms to deserve referee time; the central idea is stated clearly and the clinical setting is relevant.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Compositional Fitted Q-iteration (CFQI), an extension of fitted Q-iteration that represents heterogeneous patient responses via a compositional Q-value function with separate modules for each task variant. It applies CFQI to electrolyte repletion recommendations, distinguishing patients with and without renal disease, and claims that the compositional structure confers robustness to class imbalance by enabling effective information sharing across sub-populations. Validation is reported on a Cartpole environment and on patient data.

Significance. If the compositional task premise is substantiated, the approach could offer a structured way to improve sample efficiency and policy quality for RL in medical domains with known task variants and imbalanced subpopulations, extending standard multi-task RL methods.

major comments (2)

[Abstract] Abstract: The headline robustness claim requires that the electrolyte-repletion MDP possesses a known compositional structure in which the no-renal-disease variant is a simpler task whose solution transfers to the renal-disease variant via shared modules. The manuscript supplies no derivation or empirical check that the two patient groups actually stand in this difficulty-ordered, transferable relationship rather than being two independent MDPs; without such evidence the reported robustness cannot be attributed to compositionality and CFQI collapses to ordinary multi-task FQI.
[Validation and medical application sections] Validation and medical application sections: No ablation results, error bars, or quantitative comparison to non-compositional baselines (e.g., standard FQI or multi-task FQI) are described under controlled imbalance ratios, leaving the central claim that compositionality drives the robustness unverified.

minor comments (2)

[Abstract] The abstract states that 'solving simpler variants of the task can enable efficient solving of harder variants' but does not specify how the difficulty ordering or module sharing is identified or validated for a new clinical domain.
Training details, network architectures for the compositional modules, and the precise definition of the Q-function decomposition are not summarized, which hinders reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to strengthen the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The headline robustness claim requires that the electrolyte-repletion MDP possesses a known compositional structure in which the no-renal-disease variant is a simpler task whose solution transfers to the renal-disease variant via shared modules. The manuscript supplies no derivation or empirical check that the two patient groups actually stand in this difficulty-ordered, transferable relationship rather than being two independent MDPs; without such evidence the reported robustness cannot be attributed to compositionality and CFQI collapses to ordinary multi-task FQI.

Authors: The compositional premise is motivated by established clinical knowledge: patients without renal disease exhibit simpler electrolyte dynamics that can be managed with standard protocols, while renal disease introduces complications (e.g., altered clearance and higher risk of imbalances) that make the task variant harder; solutions for the simpler variant are expected to transfer via shared modules for common physiological responses. We agree that the original submission lacks an explicit derivation or empirical verification of this ordering and transfer. In revision we will add a dedicated subsection justifying the compositional structure with supporting medical references and, where data permits, a small empirical check (e.g., policy transfer experiment) to substantiate the claim. revision: yes
Referee: [Validation and medical application sections] Validation and medical application sections: No ablation results, error bars, or quantitative comparison to non-compositional baselines (e.g., standard FQI or multi-task FQI) are described under controlled imbalance ratios, leaving the central claim that compositionality drives the robustness unverified.

Authors: We acknowledge the absence of these controls in the submitted version. The Cartpole and clinical results demonstrate overall robustness, but do not isolate the contribution of compositionality versus multi-task learning. In the revision we will add controlled experiments that vary imbalance ratios, report mean performance with error bars across multiple runs, and include direct quantitative comparisons against standard FQI and multi-task FQI baselines to verify that the compositional modules are responsible for the observed robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external assumption of compositional structure rather than self-referential reduction

full rationale

The paper introduces CFQI by defining a compositional Q-value function with separate modules for task variants under the premise that the electrolyte-repletion MDP has a known compositional task structure (simpler no-renal-disease variant enabling solution of harder renal-disease variant). No equations, fitted parameters, or self-citations are presented that reduce the claimed robustness to class imbalance to a construction equivalent to the inputs. The method is a structural modification to standard FQI, and the final claim is explicitly conditioned on scenarios with known compositional structures, making the derivation self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that patient heterogeneity can be captured by a compositional task decomposition with shared structure across variants; no free parameters or invented physical entities are mentioned in the abstract.

axioms (1)

domain assumption Treatment responses differ systematically across identifiable patient sub-populations that can be ordered by task difficulty.
Invoked to justify the compositional structure for electrolyte repletion in patients with versus without renal disease.

invented entities (1)

Compositional Q-value function with separate modules per task variant no independent evidence
purpose: To represent and learn distinct policies while sharing knowledge across patient sub-populations
New modeling construct introduced to address heterogeneity; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5777 in / 1265 out tokens · 33141 ms · 2026-05-24T12:47:04.884894+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the approximating function f as f(s,a,z) = g_s(s,a) + 1{z=1} g_f(s,a)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a compositional task consists of several variations of the same task, each progressing in difficulty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

ISBN 1581138385.DOI: 10.1145/1015330.1015430

doi: 10.1145/1015330.1015430. URL http://portal.acm.org/citation. cfm?doid=1015330.1015430. Greg M. Allenby, Peter E. Rossi, and Robert E. McCulloch. Hierarchical Bayes Models: A Practi- tioners Guide. Social Science Research Network, Jan

work page doi:10.1145/1015330.1015430
[2]

com/abstract=655541

URL https://papers.ssrn. com/abstract=655541. Jordan T Ash and Ryan P Adams. On warm-starting neural network training. arXiv preprint arXiv:1910.08475,

work page arXiv 1910
[3]

doi: 10.1073/pnas.38.8.716

ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.38.8.716. Stevo Bozinovski and A Fulgosi. The inﬂuence of pattern similarity and transfer of learning upon training of a base perceptron b2. Proc. Symp. Informatica 3-121-5,

work page doi:10.1073/pnas.38.8.716
[4]

(original in Croatian: Utjecaj slicnosti likova i transfera ucenja na obucavanje baznog perceptrona B2), Proc. Symp. Informatica 3-121-5, Bled. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Damien Ernst, Pierre Geurts, and Louis Wehenkel

doi: 10.1023/A:1007379606734. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res., 6:503–556, 12 2005a. ISSN 1532-4435. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005b. Ary L Goldberger, Luis...

work page doi:10.1023/a:1007379606734
[6]

doi: 10.1109/TVT.2020. 3034800. Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv (version 0.4),

work page doi:10.1109/tvt.2020 2020
[7]

doi: 10.1038/sdata. 2016.35. Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predic- tions. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 4765–4774. Curran Associates, Inc.,

work page doi:10.1038/sdata 2016
[8]

ISBN 9781450384506

Association for Com- puting Machinery. ISBN 9781450384506. doi: 10.1145/3459930.3469536. URL https: //doi.org/10.1145/3459930.3469536. MayoClinic. Low potassium (hypokalemia): Symptom — overview covers what can cause this blood test result.,

work page doi:10.1145/3459930.3469536
[9]

Accessed: 2021-05-27

URL https://www.mayoclinic.org/symptoms/ low-potassium/basics/definition/sym-20050632. Accessed: 2021-05-27. MayoClinic. Creatinine tests,

work page 2021
[10]

Accessed: 2021-05-27

URL https://www.mayoclinic.org/ tests-procedures/creatinine-test/about/pac-20384646 . Accessed: 2021-05-27. Robert A. McLean, William L. Sanders, and Walter W. Stroup. A uniﬁed approach to mixed linear models. The American Statistician, 45(1):54, Feb

work page 2021
[11]

doi: 10.2307/2685241

ISSN 00031305. doi: 10.2307/2685241. Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. Ofﬂine meta- reinforcement learning with advantage weighting. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Pro- ceedings of Machine Learning Research , pages 77...

work page doi:10.2307/2685241
[12]

URL https://proceedings.mlr.press/v139/mitchell21a.html. G.B. Moody and R.G. Mark. A database to support development and evaluation of intelligent inten- sive care monitoring. Computers in Cardiology 1996, pages 657–660,

work page 1996
[13]

doi: 10.1109/cic.1996.542622

ISSN 0276-6547. doi: 10.1109/cic.1996.542622. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library. arXiv preprint arXiv:1912.01703,

work page doi:10.1109/cic.1996.542622 1996
[14]

Shagun Sodhani, Amy Zhang, and Joelle Pineau

doi: 10.1109/cic.2002.1166854. Shagun Sodhani, Amy Zhang, and Joelle Pineau. Multi-task reinforcement learning with context- based representations,

work page doi:10.1109/cic.2002.1166854 2002
[15]

doi: 10.1186/ s40560-016-0154-3

ISSN 2052-0492. doi: 10.1186/ s40560-016-0154-3. URL https://doi.org/10.1186/s40560-016-0154-3 . Marco Wiering and Martijn Van Otterlo. Reinforcement learning. Adaptation, learning, and opti- mization, 12(3),

work page doi:10.1186/s40560-016-0154-3 2052
[16]

push left

and use an 1https://github.com/seungjaeryanlee/implementations-nfq 13 SGD-based optimizer (Saad, 1998). For all experiments, we use 80% of our data to train and 20% of our data to test. We use a default learning rate of 10−3. We use the same hyperparameters for nested- and group-label agnostic methods. 6.4 C ARTPOLE ENVIRONMENT The Cartpole environment co...

work page 1998
[17]

We assume a ﬁnite action space throughout this study

We represent each sample by a vector containing its state and action, [s⊤ t , a⊤ t ]⊤, where st is the state vector and at is the action vector. We assume a ﬁnite action space throughout this study. Then, we have the following model: gs(st, at) = [ st at ]⊤ βs + 1⊤β0s +ϵ gf (st, at) = [ st at ]⊤ βf + [ st at ]⊤ βs + 1⊤β0f +ϵ, where 1 represents a column v...

work page 1991
[18]

We ﬁrst use the background training samples to train the shared layers in our network

We also do not consider the group label when training a transfer learning algorithm. We ﬁrst use the background training samples to train the shared layers in our network. Then, we freeze the shared layers and use the foreground training samples to train the foreground speciﬁc layers. When performing inference, this network does not consider group label; ...

work page 2008

[1] [1]

ISBN 1581138385.DOI: 10.1145/1015330.1015430

doi: 10.1145/1015330.1015430. URL http://portal.acm.org/citation. cfm?doid=1015330.1015430. Greg M. Allenby, Peter E. Rossi, and Robert E. McCulloch. Hierarchical Bayes Models: A Practi- tioners Guide. Social Science Research Network, Jan

work page doi:10.1145/1015330.1015430

[2] [2]

com/abstract=655541

URL https://papers.ssrn. com/abstract=655541. Jordan T Ash and Ryan P Adams. On warm-starting neural network training. arXiv preprint arXiv:1910.08475,

work page arXiv 1910

[3] [3]

doi: 10.1073/pnas.38.8.716

ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.38.8.716. Stevo Bozinovski and A Fulgosi. The inﬂuence of pattern similarity and transfer of learning upon training of a base perceptron b2. Proc. Symp. Informatica 3-121-5,

work page doi:10.1073/pnas.38.8.716

[4] [4]

(original in Croatian: Utjecaj slicnosti likova i transfera ucenja na obucavanje baznog perceptrona B2), Proc. Symp. Informatica 3-121-5, Bled. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Damien Ernst, Pierre Geurts, and Louis Wehenkel

doi: 10.1023/A:1007379606734. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res., 6:503–556, 12 2005a. ISSN 1532-4435. Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005b. Ary L Goldberger, Luis...

work page doi:10.1023/a:1007379606734

[6] [6]

doi: 10.1109/TVT.2020. 3034800. Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. Mimic-iv (version 0.4),

work page doi:10.1109/tvt.2020 2020

[7] [7]

doi: 10.1038/sdata. 2016.35. Scott M Lundberg and Su-In Lee. A uniﬁed approach to interpreting model predic- tions. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30 , pages 4765–4774. Curran Associates, Inc.,

work page doi:10.1038/sdata 2016

[8] [8]

ISBN 9781450384506

Association for Com- puting Machinery. ISBN 9781450384506. doi: 10.1145/3459930.3469536. URL https: //doi.org/10.1145/3459930.3469536. MayoClinic. Low potassium (hypokalemia): Symptom — overview covers what can cause this blood test result.,

work page doi:10.1145/3459930.3469536

[9] [9]

Accessed: 2021-05-27

URL https://www.mayoclinic.org/symptoms/ low-potassium/basics/definition/sym-20050632. Accessed: 2021-05-27. MayoClinic. Creatinine tests,

work page 2021

[10] [10]

Accessed: 2021-05-27

URL https://www.mayoclinic.org/ tests-procedures/creatinine-test/about/pac-20384646 . Accessed: 2021-05-27. Robert A. McLean, William L. Sanders, and Walter W. Stroup. A uniﬁed approach to mixed linear models. The American Statistician, 45(1):54, Feb

work page 2021

[11] [11]

doi: 10.2307/2685241

ISSN 00031305. doi: 10.2307/2685241. Eric Mitchell, Rafael Rafailov, Xue Bin Peng, Sergey Levine, and Chelsea Finn. Ofﬂine meta- reinforcement learning with advantage weighting. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning , volume 139 of Pro- ceedings of Machine Learning Research , pages 77...

work page doi:10.2307/2685241

[12] [12]

URL https://proceedings.mlr.press/v139/mitchell21a.html. G.B. Moody and R.G. Mark. A database to support development and evaluation of intelligent inten- sive care monitoring. Computers in Cardiology 1996, pages 657–660,

work page 1996

[13] [13]

doi: 10.1109/cic.1996.542622

ISSN 0276-6547. doi: 10.1109/cic.1996.542622. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library. arXiv preprint arXiv:1912.01703,

work page doi:10.1109/cic.1996.542622 1996

[14] [14]

Shagun Sodhani, Amy Zhang, and Joelle Pineau

doi: 10.1109/cic.2002.1166854. Shagun Sodhani, Amy Zhang, and Joelle Pineau. Multi-task reinforcement learning with context- based representations,

work page doi:10.1109/cic.2002.1166854 2002

[15] [15]

doi: 10.1186/ s40560-016-0154-3

ISSN 2052-0492. doi: 10.1186/ s40560-016-0154-3. URL https://doi.org/10.1186/s40560-016-0154-3 . Marco Wiering and Martijn Van Otterlo. Reinforcement learning. Adaptation, learning, and opti- mization, 12(3),

work page doi:10.1186/s40560-016-0154-3 2052

[16] [16]

push left

and use an 1https://github.com/seungjaeryanlee/implementations-nfq 13 SGD-based optimizer (Saad, 1998). For all experiments, we use 80% of our data to train and 20% of our data to test. We use a default learning rate of 10−3. We use the same hyperparameters for nested- and group-label agnostic methods. 6.4 C ARTPOLE ENVIRONMENT The Cartpole environment co...

work page 1998

[17] [17]

We assume a ﬁnite action space throughout this study

We represent each sample by a vector containing its state and action, [s⊤ t , a⊤ t ]⊤, where st is the state vector and at is the action vector. We assume a ﬁnite action space throughout this study. Then, we have the following model: gs(st, at) = [ st at ]⊤ βs + 1⊤β0s +ϵ gf (st, at) = [ st at ]⊤ βf + [ st at ]⊤ βs + 1⊤β0f +ϵ, where 1 represents a column v...

work page 1991

[18] [18]

We ﬁrst use the background training samples to train the shared layers in our network

We also do not consider the group label when training a transfer learning algorithm. We ﬁrst use the background training samples to train the shared layers in our network. Then, we freeze the shared layers and use the foreground training samples to train the foreground speciﬁc layers. When performing inference, this network does not consider group label; ...

work page 2008