Towards Empathic Deep Q-Learning

Bart Bussmann; Jacqueline Heinerman; Joel Lehman

arxiv: 1906.10918 · v1 · pith:C6LBBW4Mnew · submitted 2019-06-26 · 💻 cs.LG · cs.AI· cs.NE

Towards Empathic Deep Q-Learning

Bart Bussmann , Jacqueline Heinerman , Joel Lehman This is my paper

Pith reviewed 2026-05-25 15:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords empathic deep q-learningdeep q-networksreinforcement learningai safetymachine ethicsgridworld environmentscollateral harms

0 comments

The pith

Empathic DQN decreases collateral harms to other agents by combining its value estimate with an empathy term from swapped-position states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Empathic DQN as an extension to standard Deep Q-Networks aimed at reducing unintended negative effects on other agents in shared environments. The method adds an empathy component by having the learning agent evaluate constructed states in which it and the other agents have exchanged positions, then combines that estimate with its own value. The approach assumes some reward signals like harm penalties apply across agents and draws from ethical ideas such as the golden rule. A reader would care because scalable reinforcement learning needs mechanisms to limit myopic damage without requiring fully specified rewards for every interaction. Early gridworld tests show the extension can lower collateral harms while keeping the core learning process intact.

Core claim

Empathic DQN combines the typical self-centered value with the estimated value of other agents by imagining the value of it being in the other's situation through constructed states where both agents are swapped, with the goal of mitigating negative side effects from myopic goal-directed behavior where some rewards generalize across agents.

What carries the argument

Empathy term obtained by evaluating constructed states in which the learning agent and other agents have swapped positions, added to the agent's own value estimate.

If this is right

Collateral harms to other agents decrease in the two tested gridworld environments.
The method supplies a prior for agents that abide by norms without explicit per-interaction reward terms.
Extending Empathic DQN to complex environments remains non-trivial but follows the same combination of self-value and swapped-state empathy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The swapped-state construction could be tested for robustness when the number of coexisting agents grows beyond the simple cases shown.
Hybrid versions might combine this empathy term with other safety techniques that operate on different assumptions about reward sharing.
Environments where dynamics break under position swaps would expose whether the method requires additional state-construction safeguards.

Load-bearing premise

Reward signals such as negative rewards from physical harm generalize across agents, and the learning agent can accurately construct and evaluate swapped states that preserve the relevant dynamics.

What would settle it

Run the method in an environment where other agents receive distinct rewards that do not match the learner's harm penalties, and check whether collateral harms fail to decrease or increase.

Figures

Figures reproduced from arXiv: 1906.10918 by Bart Bussmann, Jacqueline Heinerman, Joel Lehman.

**Figure 2.** Figure 2: Average steps survived by the robot in the coexistence en [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: The sharing environment. The environment consists of the [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Average number of batteries collected (per episode) in the [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Equality scores (per episode) in the sharing environment, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

As reinforcement learning (RL) scales to solve increasingly complex tasks, interest continues to grow in the fields of AI safety and machine ethics. As a contribution to these fields, this paper introduces an extension to Deep Q-Networks (DQNs), called Empathic DQN, that is loosely inspired both by empathy and the golden rule ("Do unto others as you would have them do unto you"). Empathic DQN aims to help mitigate negative side effects to other agents resulting from myopic goal-directed behavior. We assume a setting where a learning agent coexists with other independent agents (who receive unknown rewards), where some types of reward (e.g. negative rewards from physical harm) may generalize across agents. Empathic DQN combines the typical (self-centered) value with the estimated value of other agents, by imagining (by its own standards) the value of it being in the other's situation (by considering constructed states where both agents are swapped). Proof-of-concept results in two gridworld environments highlight the approach's potential to decrease collateral harms. While extending Empathic DQN to complex environments is non-trivial, we believe that this first step highlights the potential of bridge-work between machine ethics and RL to contribute useful priors for norm-abiding RL agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Empathic DQN adds a position-swap term to standard DQN to reduce harms in symmetric gridworlds, but the results stay qualitative and rest on untested assumptions about reward sharing and dynamics.

read the letter

The main takeaway is a straightforward extension to DQN that adds an empathy term by constructing swapped states and averaging the agent's own Q-values with the estimated values for the other agents. This produces lower collateral damage in the two gridworld examples shown. The swap construction itself is the concrete new piece; earlier empathic RL papers did not combine it with DQN in this explicit way. The paper is clear about its assumptions, including that certain rewards like physical harm generalize across agents and that the swapped states preserve the relevant dynamics. That explicitness is useful. The implementation looks simple enough that someone could reproduce the gridworld runs without much trouble. The soft spots are the evaluation. The abstract and results section give only proof-of-concept demonstrations with no quantitative metrics, error bars, ablations, or comparisons against baselines that isolate the empathy term. The environments appear deliberately symmetric, so it is hard to tell whether the harm reduction comes from the added term or from the fact that the swap works cleanly in those specific setups. The stress-test note is right on this: without cases where the symmetry breaks, the central claim stays untested. The paper does not claim more than a first step, which is fair, but that also means the evidence is thin. This is for readers already working on multi-agent RL or AI safety who want a minimal mechanism to inject other-agent consideration into value-based methods. Someone looking for a ready-to-use prior or a bridge to machine ethics could extract the basic idea and try it themselves. It is coherent on its own terms and shows honest engagement with the limitations, so it deserves a serious referee even if the experiments need substantial strengthening.

Referee Report

2 major / 1 minor

Summary. The paper introduces Empathic DQN as an extension to standard Deep Q-Networks. It augments the agent's self-value estimate with an empathy term obtained by evaluating the same Q-network on constructed states in which the learning agent and other agents have swapped positions. The goal is to reduce collateral harms to other agents whose rewards are unknown, under the assumption that certain reward signals (e.g., physical harm) generalize across agents. Proof-of-concept demonstrations are provided in two gridworld environments.

Significance. If the core mechanism proves robust, the work supplies a concrete, self-contained way to inject a golden-rule-style prior into RL without requiring external reward models or additional parameters. The explicit construction of the empathy term from the agent's own network and the same reward function is a clear design choice that avoids hidden circularity. The bridge between machine ethics and RL is a positive contribution even at the proof-of-concept stage.

major comments (2)

[Abstract] Abstract (paragraph beginning 'We assume a setting...'): The central claim that Empathic DQN decreases collateral harms rests on the untested assumptions that (a) negative rewards from physical harm generalize across agents and (b) swapped-state construction preserves transition dynamics, action effects, and observability. No counter-examples, sensitivity analysis, or asymmetric environments are supplied, so the gridworld results cannot distinguish the contribution of the empathy term from environmental symmetry.
[Abstract] Abstract: The statement that the approach 'highlights the potential to decrease collateral harms' is supported solely by qualitative demonstrations. No quantitative metrics, error bars, ablation studies (e.g., Empathic DQN vs. standard DQN), or statistical comparisons are reported, leaving the load-bearing empirical claim without measurable evidence.

minor comments (1)

[Abstract] The abstract introduces the empathy term via natural-language description but does not supply an explicit equation or pseudocode for how the swapped-state value is combined with the self-value; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the core idea. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph beginning 'We assume a setting...'): The central claim that Empathic DQN decreases collateral harms rests on the untested assumptions that (a) negative rewards from physical harm generalize across agents and (b) swapped-state construction preserves transition dynamics, action effects, and observability. No counter-examples, sensitivity analysis, or asymmetric environments are supplied, so the gridworld results cannot distinguish the contribution of the empathy term from environmental symmetry.

Authors: The manuscript is framed as a proof-of-concept and states the assumptions explicitly in the abstract. The symmetric gridworlds were chosen deliberately to isolate the empathy mechanism. We agree the current experiments do not fully separate the empathy term from symmetry effects and will revise the abstract to qualify the claims more carefully while adding a limitations discussion on the assumptions and the need for asymmetric test cases. revision: partial
Referee: [Abstract] Abstract: The statement that the approach 'highlights the potential to decrease collateral harms' is supported solely by qualitative demonstrations. No quantitative metrics, error bars, ablation studies (e.g., Empathic DQN vs. standard DQN), or statistical comparisons are reported, leaving the load-bearing empirical claim without measurable evidence.

Authors: The demonstrations are qualitative because the work is positioned as an initial proof-of-concept. We accept that quantitative support would strengthen the empirical claim and will add ablation comparisons (Empathic DQN vs. standard DQN), quantitative collateral-harm metrics, and error bars across runs in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; Empathic DQN is an explicit design choice with stated assumptions

full rationale

The paper proposes Empathic DQN as a method that augments standard DQN value estimates with an empathy term computed by evaluating the agent's own Q-network on explicitly constructed swapped-position states. This construction is presented as a deliberate architectural extension inspired by the golden rule, not as a derivation or prediction that reduces to its own inputs. The abstract explicitly states the assumptions (reward generalization across agents and accurate swapped-state construction) rather than deriving them. No equations, fitted parameters, or self-citations are shown to create load-bearing circularity; the proof-of-concept results in gridworlds follow directly from the defined procedure without renaming or smuggling prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on two domain assumptions and one invented modeling choice with no independent evidence supplied.

axioms (2)

domain assumption Some reward signals (e.g., negative rewards from physical harm) generalize across agents.
Stated explicitly in the abstract as the setting assumption required for the empathy term to be meaningful.
domain assumption The learning agent can construct swapped states that preserve the relevant transition dynamics for other agents.
Implicit in the description of how the empathy term is computed via state swapping.

invented entities (1)

Empathy term computed via agent swapping no independent evidence
purpose: To estimate the value an other agent would receive according to the learning agent's own reward function.
New modeling device introduced in the paper; no external falsifiable handle is provided beyond the gridworld demonstrations.

pith-pipeline@v0.9.0 · 5748 in / 1447 out tokens · 14779 ms · 2026-05-25T15:31:14.462329+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Empathic DQN combines the typical (self-centered) value with the estimated value of other agents, by imagining (by its own standards) the value of it being in the other's situation (by considering constructed states where both agents are swapped).
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assume a setting where a learning agent coexists with other independent agents (who receive unknown rewards), where some types of reward (e.g. negative rewards from physical harm) may generalize across agents.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man ´e. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Low Impact Artificial Intelligences

Stuart Armstrong and Benjamin Levinstein. Low impact artiﬁcial intelligences. arXiv preprint arXiv:1705.10720 ,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Embedded Agency

Abram Demski and Scott Garrabrant. Embedded agency. arXiv preprint arXiv:1902.09469,

work page arXiv 1902
[4]

AGI Safety Literature Review

Tom Everitt, Gary Lea, and Marcus Hutter. Agi safety litera- ture review. arXiv preprint arXiv:1805.01109,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Penalizing side effects using stepwise relative reachability

Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Measuring and avoiding side effects using rel- ative reachability. arXiv preprint arXiv:1806.01186,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The surprising creativity of digital evolution: A col- lection of anecdotes from the evolutionary computation and artiﬁcial life research communities

Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A col- lection of anecdotes from the evolutionary computation and artiﬁcial life research communities. arXiv preprint arXiv:1803.03453,

work page arXiv
[7]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent re- inforcement learning. arXiv preprint arXiv:1802.09640 ,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Trial without error: Towards safe reinforce- ment learning via human intervention

William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforce- ment learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages 2067–2069. International Foundation for Autonomous Agents and Multiagent Sys- tems,

work page 2067
[10]

Third-person imitation learning

Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. arXiv preprint arXiv:1703.01703,

work page arXiv
[11]

Conservative agency via attainable utility preservation

Alexander Matt Turner, Dylan Hadﬁeld-Menell, and Prasad Tadepalli. Conservative agency via attainable utility preservation. arXiv preprint arXiv:1902.09725,

work page arXiv 1902
[12]

Towards an ethical robot: internal models, consequences and ethi- cal action selection

Alan FT Winﬁeld, Christian Blum, and Wenguo Liu. Towards an ethical robot: internal models, consequences and ethi- cal action selection. In Conference towards autonomous robotic systems, pages 85–96. Springer, 2014

work page 2014

[1] [1]

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Man ´e. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Low Impact Artificial Intelligences

Stuart Armstrong and Benjamin Levinstein. Low impact artiﬁcial intelligences. arXiv preprint arXiv:1705.10720 ,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Embedded Agency

Abram Demski and Scott Garrabrant. Embedded agency. arXiv preprint arXiv:1902.09469,

work page arXiv 1902

[4] [4]

AGI Safety Literature Review

Tom Everitt, Gary Lea, and Marcus Hutter. Agi safety litera- ture review. arXiv preprint arXiv:1805.01109,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Penalizing side effects using stepwise relative reachability

Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Measuring and avoiding side effects using rel- ative reachability. arXiv preprint arXiv:1806.01186,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The surprising creativity of digital evolution: A col- lection of anecdotes from the evolutionary computation and artiﬁcial life research communities

Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A col- lection of anecdotes from the evolutionary computation and artiﬁcial life research communities. arXiv preprint arXiv:1803.03453,

work page arXiv

[7] [7]

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent re- inforcement learning. arXiv preprint arXiv:1802.09640 ,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Trial without error: Towards safe reinforce- ment learning via human intervention

William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforce- ment learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems , pages 2067–2069. International Foundation for Autonomous Agents and Multiagent Sys- tems,

work page 2067

[10] [10]

Third-person imitation learning

Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Third-person imitation learning. arXiv preprint arXiv:1703.01703,

work page arXiv

[11] [11]

Conservative agency via attainable utility preservation

Alexander Matt Turner, Dylan Hadﬁeld-Menell, and Prasad Tadepalli. Conservative agency via attainable utility preservation. arXiv preprint arXiv:1902.09725,

work page arXiv 1902

[12] [12]

Towards an ethical robot: internal models, consequences and ethi- cal action selection

Alan FT Winﬁeld, Christian Blum, and Wenguo Liu. Towards an ethical robot: internal models, consequences and ethi- cal action selection. In Conference towards autonomous robotic systems, pages 85–96. Springer, 2014

work page 2014