Inverse reinforcement learning conditioned on brain scan

Tofara Moyo

arxiv: 1906.09770 · v1 · pith:Y6GKYH6Znew · submitted 2019-06-24 · 💻 cs.OH

Inverse reinforcement learning conditioned on brain scan

Tofara Moyo This is my paper

Pith reviewed 2026-05-25 17:05 UTC · model grok-4.3

classification 💻 cs.OH

keywords inverse reinforcement learningfMRIbrain statehumanoid robotpolicy networkgenerative modeldispositionsstate space

0 comments

The pith

An agent learns a particular person's dispositions by running inverse reinforcement learning on a state space that includes their fMRI brain scans at each time step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a method for training an agent to capture one individual's specific thoughts and internal processes by folding fMRI images into the state representation used for inverse reinforcement learning. A human expert wears a sensor suit for a fixed period so that a policy network can be trained on the resulting data, while a separate generative model learns to produce the next fMRI image from the current image and the environment state. During use the humanoid robot selects actions conditioned on the continuously updated fMRI representation together with its external observations. The approach therefore treats brain activity as a direct window onto long-term and short-term memory as well as other unobserved dynamics inside the person's mind.

Core claim

By augmenting the state space of inverse reinforcement learning with an fMRI scan that represents the individual's brain state at time t, an agent can recover that person's dispositions, because the scan information is assumed to be conditioned on their thoughts and thought processes; a generative model then produces the next scan image from the current one and the environment, allowing the robot's policy to remain conditioned on the evolving brain state.

What carries the argument

Inverse reinforcement learning whose state at each time step contains an fMRI image of the target individual's brain, paired with a generative model that predicts the subsequent fMRI image.

If this is right

The policy network is trained directly on sensor data collected while a human expert wears the suit.
A generative model is trained to output the next fMRI scan conditioned on the present scan and the environment state.
Robot actions during operation are produced by conditioning on the evolving sequence of fMRI images together with external observations.
Both long-term and short-term memory plus any other internal brain dynamics are captured inside the learned policy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning approach could be tested with other real-time brain-imaging modalities if they supply comparable state information.
Deployment would require explicit handling of consent and data privacy for the brain scans used in training and runtime.
One could measure success by checking whether the robot's behavior matches the individual's observed choices more closely than a standard IRL baseline that lacks the fMRI channel.

Load-bearing premise

The information visible in an fMRI scan is conditioned on the individual's thoughts and thought processes.

What would settle it

Run the trained robot in a controlled setting where the person's actual choices or self-reported preferences are recorded; if the robot's actions systematically diverge from those choices even when the fMRI input is supplied, the claim is falsified.

Figures

Figures reproduced from arXiv: 1906.09770 by Tofara Moyo.

read the original abstract

We outline a way for an agent to learn the dispositions of a particular individual through inverse reinforcement learning where the state space at time t includes an fMRI scan of the individual, to represent his brain state at that time. The fundamental assumption being that the information shown on an fMRI scan of an individual is conditioned on his thoughts and thought processes. The system models both long and short term memory as well any internal dynamics we may not be aware of that are in the human brain. The human expert will put on a suit for a set duration with sensors whose information will be used to train a policy network, while a generative model will be trained to produce the next fMRI scan image conditioned on the present one and the state of the environment. During operation the humanoid robots actions will be conditioned on this evolving fMRI and the environment it is in.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a bare conceptual sketch of IRL with fMRI conditioning that offers no technical substance or validation.

read the letter

The paper is basically a short proposal for using fMRI scans as part of the state in inverse reinforcement learning to personalize an agent's behavior to a specific person. That's the one thing it does: it names the idea and lists some components like a policy network and a generative model for fMRI images. It does a decent job of laying out the high-level flow. The human wears a suit with sensors, data trains the policy, a generative model predicts the next brain scan from current one and environment, and then the robot uses that to act. The assumption about fMRI reflecting thoughts is stated plainly. Beyond that, there's not much. No derivations, no algorithms specified, no mention of how to handle the high-dimensional fMRI data in practice, and no references to prior work on IRL or brain-computer interfaces. The claim that this models long and short term memory and internal dynamics is asserted without any mechanism described. The central weakness is the lack of any evidence or even a plan for validation. If the fMRI doesn't reliably capture the relevant mental states, the whole thing falls apart, but the paper doesn't engage with that issue or with known challenges in interpreting fMRI. This kind of outline might interest someone brainstorming new directions in human-robot interaction, but it doesn't give a technical reader anything concrete to work with or build on. I wouldn't bring it to a reading group or cite it. It shouldn't go to peer review in its current form because there's no result to referee.

Referee Report

3 major / 1 minor

Summary. The paper outlines a conceptual approach for an agent to learn a specific individual's dispositions via inverse reinforcement learning (IRL), by augmenting the state space at each time t with an fMRI scan representing the person's brain state. It rests on the assumption that fMRI data is conditioned on thoughts and thought processes, and proposes modeling long/short-term memory and other internal dynamics. Training uses a sensor suit on a human expert to train a policy network, paired with a generative model for next fMRI scans conditioned on the current scan and environment; at runtime, humanoid robot actions are conditioned on the evolving fMRI and environment.

Significance. If the outlined system could be realized with validated components, it would offer a novel route to highly individualized reward modeling in IRL, with potential impact on personalized robotics and cognitive agents. The manuscript, however, advances no derivations, algorithms, experiments, or benchmarks, so any significance assessment remains entirely prospective and dependent on untested assumptions about fMRI interpretability.

major comments (3)

[Abstract] Abstract: The entire proposal is load-bearing on the untested assumption that 'the information shown on an fMRI scan of an individual is conditioned on his thoughts and thought processes,' yet the manuscript supplies no supporting references, proposed validation experiments, or discussion of known limitations of fMRI (e.g., indirect hemodynamic response, low temporal resolution).
[Full text] Full text (description of system components): No mathematical formulation, state-space definition, or IRL objective is provided for the augmented state that includes fMRI; without these, it is impossible to determine whether standard IRL algorithms can be applied or what modifications would be required.
[Full text] Full text (training and operation): The generative model for next fMRI scan and the policy network are described only at the level of component names, with no architecture, loss functions, training data requirements, or handling of high-dimensional image data, rendering the outline non-actionable.

minor comments (1)

[Abstract] Abstract: Minor grammatical issues ('any internal dynamics we may not be aware of that are in the human brain' should read 'and any...').

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our conceptual outline. The manuscript is a high-level proposal rather than a fully implemented system, and we will revise to address the identified gaps in support, formalism, and detail while preserving its prospective nature.

read point-by-point responses

Referee: [Abstract] Abstract: The entire proposal is load-bearing on the untested assumption that 'the information shown on an fMRI scan of an individual is conditioned on his thoughts and thought processes,' yet the manuscript supplies no supporting references, proposed validation experiments, or discussion of known limitations of fMRI (e.g., indirect hemodynamic response, low temporal resolution).

Authors: We agree the assumption requires explicit support. The revision will add citations to fMRI studies on BOLD signal correlations with cognitive states, a discussion of limitations including hemodynamic lag and ~1-2s temporal resolution, and proposed validation via simultaneous fMRI and behavioral experiments in controlled tasks. revision: yes
Referee: [Full text] Full text (description of system components): No mathematical formulation, state-space definition, or IRL objective is provided for the augmented state that includes fMRI; without these, it is impossible to determine whether standard IRL algorithms can be applied or what modifications would be required.

Authors: We will add a formal section defining the augmented state s_t = (e_t, f_t) with e_t the environment and f_t the fMRI scan at time t. The IRL objective will be stated as recovering a reward R(s,a) explaining demonstrated actions under the augmented state, noting that standard max-ent IRL applies directly with the expanded state space. revision: yes
Referee: [Full text] Full text (training and operation): The generative model for next fMRI scan and the policy network are described only at the level of component names, with no architecture, loss functions, training data requirements, or handling of high-dimensional image data, rendering the outline non-actionable.

Authors: The revision will specify the generative model as an LSTM conditioned on current fMRI (via CNN encoder) and environment, trained with pixel-wise MSE plus adversarial loss. The policy will be a CNN-LSTM taking encoded fMRI and environment inputs. Training data requirements (synchronized fMRI, suit sensors, actions) and dimensionality reduction via autoencoders will be detailed. revision: yes

Circularity Check

0 steps flagged

No circularity: conceptual outline with no derivations or self-referential reductions

full rationale

The manuscript is a high-level conceptual proposal for augmenting IRL state spaces with fMRI data. It states a foundational assumption explicitly but advances no equations, parameter fits, predictions, uniqueness theorems, or derivations that could reduce to inputs by construction. No self-citations appear as load-bearing steps. The text sketches components (policy network, generative model) without claiming any result that is forced by its own definitions or prior author work. This is a standard non-finding for an outline paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about fMRI reflecting thoughts and introduces a generative model for evolving brain states without any independent evidence of its validity or performance.

axioms (1)

domain assumption The information shown on an fMRI scan of an individual is conditioned on his thoughts and thought processes.
Explicitly stated as the fundamental assumption in the abstract.

invented entities (1)

Generative model for next fMRI scan no independent evidence
purpose: To produce the next fMRI scan image conditioned on the present one and the state of the environment, modeling memory and internal dynamics.
Introduced in the abstract to handle evolving brain states during operation.

pith-pipeline@v0.9.0 · 5661 in / 1484 out tokens · 49329 ms · 2026-05-25T17:05:46.088643+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Ng, Algorithms for inverse reinforcement learning [2000]

Russell, Andrew Y. Ng, Algorithms for inverse reinforcement learning [2000]

work page 2000
[2]

Non linear Inverse Reinforcement Learning with Gausssian processes [2011]

Levine et al. Non linear Inverse Reinforcement Learning with Gausssian processes [2011]

work page 2011
[3]

Grubb and Bagnell , Bradley ,Boosted backpropagation learn ing for training deep modular networks [2010]

work page 2010
[4]

Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu , pixel recurrent neural networks [2016]

work page 2016
[5]

1992 Jun; 25(2):390-7

Bandettini PA,Wong EC, Hinks RS, Tikofsky RS, Hyde JS Magn Reson Med. 1992 Jun; 25(2):390-7

work page 1992

[1] [1]

Ng, Algorithms for inverse reinforcement learning [2000]

Russell, Andrew Y. Ng, Algorithms for inverse reinforcement learning [2000]

work page 2000

[2] [2]

Non linear Inverse Reinforcement Learning with Gausssian processes [2011]

Levine et al. Non linear Inverse Reinforcement Learning with Gausssian processes [2011]

work page 2011

[3] [3]

Grubb and Bagnell , Bradley ,Boosted backpropagation learn ing for training deep modular networks [2010]

work page 2010

[4] [4]

Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu , pixel recurrent neural networks [2016]

work page 2016

[5] [5]

1992 Jun; 25(2):390-7

Bandettini PA,Wong EC, Hinks RS, Tikofsky RS, Hyde JS Magn Reson Med. 1992 Jun; 25(2):390-7

work page 1992