Contextual Control without Memory Growth in a Context-Switching Task

Song-Ju Kim

arxiv: 2604.03479 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.IT· cs.LG· math.IT

Contextual Control without Memory Growth in a Context-Switching Task

Song-Ju Kim This is my paper

Pith reviewed 2026-05-13 19:34 UTC · model grok-4.3

classification 💻 cs.AI cs.ITcs.LGmath.IT

keywords contextual controlrecurrent architectureintervention modelcontext-switchingconditional mutual informationpartial observabilitysequential decision makingmemory efficiency

0 comments

The pith

Contextual dependence can be realized by intervening on a shared recurrent latent state without memory growth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes handling context in sequential decisions by intervening on a shared recurrent state rather than expanding memory or feeding context directly. A recurrent core builds a pre-intervention state, after which an additive operator applies context-specific changes. On a context-switching task with partial observability, this model performs competitively with baselines that use more memory or direct context labels. It also demonstrates positive conditional mutual information between context and outputs given the state for key outcomes. This approach could support efficient contextual control in resource-limited settings.

Core claim

The authors introduce an intervention-based recurrent architecture where the recurrent core first constructs a shared pre-intervention latent state and context then acts through an additive, context-indexed operator. On the main benchmark for a context-switching sequential decision task under partial observability, the intervention model performs strongly without additional recurrent dimensions. Using conditional mutual information as a probe, it exhibits positive conditional contextual information for task-relevant phase-1 outcomes, indicating viable contextual control without memory growth.

What carries the argument

An additive, context-indexed operator applied to the shared pre-intervention latent state produced by the recurrent core.

If this is right

The intervention model performs strongly on the benchmark without enlarging recurrent dimensions.
It shows positive conditional mutual information I(C;O | S) for task-relevant outcomes.
This provides an alternative to direct context input or memory enlargement for contextual dependence.
Contextual control is realized without direct context input to the recurrent core.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might generalize to other sequential tasks where context changes infrequently but must influence behavior.
Reducing memory size could lower computational demands in long-horizon planning problems.
Testing the approach in environments with more complex or continuous contexts would clarify its limits.
The conditional information metric offers a way to diagnose whether models internally represent context without explicit access.

Load-bearing premise

An additive context-indexed operator on a shared latent state suffices to encode contextual dependence without context reaching the recurrent core directly.

What would settle it

Observing that the intervention model fails to match baseline performance on the context-switching task or shows zero conditional mutual information for relevant outcomes would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.03479 by Song-Ju Kim.

**Figure 2.** Figure 2: FIG. 2. Overview of the benchmark conditions and model families. The upper row provides [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Main performance on AB25 and BA30. Bars show the fraction of seeds (out of 10) [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Context-dependent sequential decision making is commonly addressed either by providing context explicitly as an input or by increasing recurrent memory so that contextual information can be represented internally. We study a third alternative: realizing contextual dependence by intervening on a shared recurrent latent state, without enlarging recurrent dimensionality. To this end, we introduce an intervention-based recurrent architecture in which a recurrent core first constructs a shared pre-intervention latent state, and context then acts through an additive, context-indexed operator. We evaluate this idea on a context-switching sequential decision task under partial observability. We compare three model families: a label-assisted baseline with direct context access, a memory baseline with enlarged recurrent state, and the proposed intervention model, which uses no direct context input to the recurrent core and no memory growth. On the main benchmark, the intervention model performs strongly without additional recurrent dimensions. We also evaluate the models using the conditional mutual information (I(C;O | S)) as a theorem-motivated operational probe of contextual dependence at fixed latent state. For task-relevant phase-1 outcomes, the intervention model exhibits positive conditional contextual information. Together, these results suggest that intervention on a shared recurrent state provides a viable alternative to recurrent memory growth for contextual control in this setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Intervention on a shared recurrent state gives contextual control without memory growth or direct context input, but the abstract leaves the evidence thin.

read the letter

The main takeaway is that the paper shows a recurrent core can build a fixed shared latent state first, then an additive context-indexed operator modulates it to handle context switching without feeding context into the core or enlarging the hidden size. On their partial-observability benchmark the intervention model performs competitively against both the direct-label baseline and the enlarged-memory baseline, and the positive I(C;O|S) for phase-1 outcomes gives an independent check that context is still shaping the output at fixed state. That combination is the actual novelty: a third route that sits between the two standard families. The architecture and the conditional-mutual-information probe are the parts worth paying attention to. The soft spots are straightforward. The abstract reports no error bars, no statistical tests, and no ablation or variance numbers, so it is difficult to judge how reliable the performance edge really is or how sensitive the shared state is to the task details. The stress-test concern also lands: if the pre-intervention representation does not already contain the information the operator needs to act on, the construction will not deliver full contextual dependence. The paper would be stronger with explicit checks that the core state is rich enough before intervention. This is aimed at people working on recurrent models for context-dependent or partially observable sequential tasks. A reader in that niche would get a concrete alternative architecture and a useful probe to think with. It deserves peer review so the experimental gaps can be filled and the assumption about the shared state can be tested directly.

Referee Report

3 major / 1 minor

Summary. The paper claims that contextual dependence in sequential decision making under partial observability can be achieved without direct context input to the recurrent core or growth in recurrent dimensionality. A context-free recurrent core first produces a shared pre-intervention latent state S; context then modulates behavior via an additive, context-indexed operator applied to S. On a context-switching benchmark, the resulting intervention model matches or exceeds a label-assisted baseline and a memory-enlarged baseline. The authors further report positive conditional mutual information I(C;O|S) for task-relevant phase-1 outcomes, interpreting this as evidence that contextual information is realized at fixed latent state.

Significance. If the central construction holds, the work demonstrates a memory-efficient route to contextual control that avoids both explicit context channels and recurrent-state scaling. The conditional-mutual-information probe supplies an independent, theorem-motivated operational check that strengthens the empirical case. The approach could be relevant for resource-constrained sequential decision systems where recurrent memory growth is costly.

major comments (3)

[Abstract / Results] Abstract and main-benchmark results: the claim that the intervention model 'performs strongly' is presented without error bars, statistical tests, or explicit exclusion criteria. This quantitative gap is load-bearing for the headline performance comparison and must be addressed to substantiate superiority over the memory baseline.
[Methods / Architecture] Architecture description (shared pre-intervention state S): the additive context-indexed operator can realize contextual dependence only if S already encodes all information that context can usefully modulate. Under partial observability the context-free recurrent core may not produce a sufficiently rich S; no ablation or diagnostic is reported that verifies this assumption.
[Evaluation / Information Probe] Conditional mutual information probe (§ on I(C;O|S)): positive I(C;O|S) for phase-1 outcomes is offered as evidence that the operator supplies contextual information at fixed S. This interpretation requires that S itself is context-agnostic; the manuscript does not demonstrate that context does not leak into the recurrent core during training.

minor comments (1)

[Notation] Notation for the pre-intervention state should be introduced consistently (e.g., distinguish S_pre from any post-intervention quantity) to avoid ambiguity when discussing the fixed-S probe.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below, and indicate where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract / Results] Abstract and main-benchmark results: the claim that the intervention model 'performs strongly' is presented without error bars, statistical tests, or explicit exclusion criteria. This quantitative gap is load-bearing for the headline performance comparison and must be addressed to substantiate superiority over the memory baseline.

Authors: We agree that the absence of error bars, statistical tests, and clear exclusion criteria limits the strength of the performance claims. In the revised version, we will rerun the experiments with multiple random seeds, report means with standard error bars, include statistical significance tests (such as Welch's t-test) for comparisons between the intervention model and the memory baseline, and explicitly state the run exclusion criteria (e.g., divergence or failure to train). This will substantiate the comparisons. revision: yes
Referee: [Methods / Architecture] Architecture description (shared pre-intervention state S): the additive context-indexed operator can realize contextual dependence only if S already encodes all information that context can usefully modulate. Under partial observability the context-free recurrent core may not produce a sufficiently rich S; no ablation or diagnostic is reported that verifies this assumption.

Authors: The referee raises a valid point regarding the richness of the pre-intervention state S. Our results show that the intervention model achieves performance comparable to or better than the memory-enlarged baseline, which indirectly supports that S is sufficiently informative for context to modulate effectively. However, to directly address the concern, we will add a diagnostic in the revision: we will compute the mutual information between S and relevant task variables (e.g., phase-1 outcomes) to verify the information content in S, and potentially include an ablation with a more expressive recurrent core. revision: yes
Referee: [Evaluation / Information Probe] Conditional mutual information probe (§ on I(C;O|S)): positive I(C;O|S) for phase-1 outcomes is offered as evidence that the operator supplies contextual information at fixed S. This interpretation requires that S itself is context-agnostic; the manuscript does not demonstrate that context does not leak into the recurrent core during training.

Authors: We acknowledge that demonstrating S is context-agnostic is crucial for the interpretation of the conditional mutual information results. Although the architecture prevents direct context input, indirect leakage through optimization is possible in principle. In the revised manuscript, we will add an analysis to check for leakage, such as training a separate predictor to recover context from S and reporting its accuracy (expected to be near chance if no leakage), or comparing the I(C;O|S) values under different training regimes. This will provide stronger evidence that the contextual information is indeed realized via the intervention operator at a fixed S. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation of intervention architecture relies on independent benchmarks and probes

full rationale

The paper defines an intervention-based recurrent model explicitly (recurrent core produces shared latent state S, followed by additive context-indexed operator) and evaluates it via direct performance comparison on a context-switching task plus the operational probe I(C;O|S). These measurements are computed from model outputs on held-out data and do not reduce by construction to the architectural definition or any fitted parameter; the positive conditional information result is reported as an empirical finding rather than a definitional consequence. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central construction, and no predictions are obtained by fitting to subsets of the same quantities later reported. The derivation chain is therefore self-contained and externally falsifiable through the benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The architecture rests on standard recurrent-network assumptions plus the novel postulate that additive intervention suffices for context dependence; no free parameters or invented entities beyond the operator itself are stated in the abstract.

axioms (1)

domain assumption A recurrent core can construct a shared pre-intervention latent state from partial observations that is sufficient for subsequent contextual modulation.
Invoked in the description of the intervention architecture for the context-switching task.

invented entities (1)

additive context-indexed operator no independent evidence
purpose: To realize contextual dependence by intervening on the shared latent state without memory growth or direct context input.
Introduced as the core mechanism of the proposed architecture; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5514 in / 1219 out tokens · 31339 ms · 2026-05-13T19:34:04.702267+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contextual Chain: Single-State Ledger Design for Mobile/IoT Networks with Frequent Partitions
cs.DC 2026-04 unverdicted novelty 6.0

Simulation at N=20 across 500 seeds finds that adaptive synchronization, not quarantine, primarily drives final agreement and recovery-time improvement after partitions in noisy regimes.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

We propose anintervention-based recurrent architecturethat implements con- textual dependence without enlarging recurrent memory

work page
[2]

We introduce a controlled benchmark,the context-switching task, that isolates the architectural problem of context-dependent phase switching within a shared recurrent state

work page
[3]

We provide both behavioral evidence and an information-theoretic operational probe showing that the proposed model realizes meaningful contextual control, while clarify- ing that this should be interpreted as an empirical analogue motivated by the single- state theorem rather than as a complete numerical verification of that theorem itself. Taken together...

work page
[4]

Success” reports the number of seeds (out of 10) that solved both phases. “Phase 1

and differ only inhow contextual information enters the computation. L: label-assisted recurrent baseline.TheLmodel directly receives the context token as part of the observation. Letϕ(·) denote the feature extractor andh t−1 the recurrent hidden state. Then the latent state is computed as zt = LSTM(ϕ([xt, ct]), ht−1). Thus, the context label is explicitl...

work page
[5]

The sheaf-theoretic structure of non-locality and contextuality.New Journal of Physics, 13:113036, 2011

Samson Abramsky and Adam Brandenburger. The sheaf-theoretic structure of non-locality and contextuality.New Journal of Physics, 13:113036, 2011

work page 2011
[6]

Reinforcement learning with long short-term memory

Bram Bakker. Reinforcement learning with long short-term memory. InAdvances in Neural Information Processing Systems 14, 2001

work page 2001
[7]

Modulating early visual processing by language

Harm de Vries, Florian Strub, J´ er´ emie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. InAdvances in Neural Information Processing Systems 30, 2017

work page 2017
[8]

Feature-wise transformations.Distill, 3(7):e11, 2018

Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations.Distill, 3(7):e11, 2018

work page 2018
[9]

Deep recurrent q-learning for partially observable mdps.arXiv preprint arXiv:1507.06527,

Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable MDPs, 2015. arXiv:1507.06527 [cs.LG]

work page arXiv 2015
[10]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

work page 1997
[11]

Deep variational reinforcement learning for POMDPs

Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for POMDPs. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 2117–2126, 2018

work page 2018
[12]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

work page 1998
[13]

Recur- rent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, R´ emi Munos, and Will Dabney. Recur- rent experience replay in distributed reinforcement learning. InInternational Conference on 24 Learning Representations, 2019

work page 2019
[14]

Contextuality as an information-theoretic obstruction to classical probability,

Song-Ju Kim. Contextuality as an information-theoretic obstruction to classical probability,

work page
[15]

arXiv:2601.20167 [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Contextuality derived from minimal decision dynamics: Quantum tug-of-war decision making, 2026

Song-Ju Kim. Contextuality derived from minimal decision dynamics: Quantum tug-of-war decision making, 2026. arXiv:2601.10034 [quant-ph]

work page arXiv 2026
[17]

Contextuality from Single-State Ontological Models: An Information-Theoretic Obstruction

Song-Ju Kim. Contextuality from single-state ontological models: An information-theoretic no-go theorem, 2026. arXiv:2602.16716 [cs.AI] [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition

Taesup Kim, Inchul Song, and Yoshua Bengio. Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition. InProceedings of Interspeech 2017, pages 3317–3321, 2017

work page 2017
[19]

Littman, Richard S

Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state. InAdvances in Neural Information Processing Systems 14, pages 1555–1561, 2001

work page 2001
[20]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[21]

Spekkens

Robert W. Spekkens. Contextuality for preparations, transformations, and unsharp measure- ments.Physical Review A, 71:052108, 2005. 25

work page 2005

[1] [1]

We propose anintervention-based recurrent architecturethat implements con- textual dependence without enlarging recurrent memory

work page

[2] [2]

We introduce a controlled benchmark,the context-switching task, that isolates the architectural problem of context-dependent phase switching within a shared recurrent state

work page

[3] [3]

We provide both behavioral evidence and an information-theoretic operational probe showing that the proposed model realizes meaningful contextual control, while clarify- ing that this should be interpreted as an empirical analogue motivated by the single- state theorem rather than as a complete numerical verification of that theorem itself. Taken together...

work page

[4] [4]

Success” reports the number of seeds (out of 10) that solved both phases. “Phase 1

and differ only inhow contextual information enters the computation. L: label-assisted recurrent baseline.TheLmodel directly receives the context token as part of the observation. Letϕ(·) denote the feature extractor andh t−1 the recurrent hidden state. Then the latent state is computed as zt = LSTM(ϕ([xt, ct]), ht−1). Thus, the context label is explicitl...

work page

[5] [5]

The sheaf-theoretic structure of non-locality and contextuality.New Journal of Physics, 13:113036, 2011

Samson Abramsky and Adam Brandenburger. The sheaf-theoretic structure of non-locality and contextuality.New Journal of Physics, 13:113036, 2011

work page 2011

[6] [6]

Reinforcement learning with long short-term memory

Bram Bakker. Reinforcement learning with long short-term memory. InAdvances in Neural Information Processing Systems 14, 2001

work page 2001

[7] [7]

Modulating early visual processing by language

Harm de Vries, Florian Strub, J´ er´ emie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. InAdvances in Neural Information Processing Systems 30, 2017

work page 2017

[8] [8]

Feature-wise transformations.Distill, 3(7):e11, 2018

Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations.Distill, 3(7):e11, 2018

work page 2018

[9] [9]

Deep recurrent q-learning for partially observable mdps.arXiv preprint arXiv:1507.06527,

Matthew Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable MDPs, 2015. arXiv:1507.06527 [cs.LG]

work page arXiv 2015

[10] [10]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and J¨ urgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

work page 1997

[11] [11]

Deep variational reinforcement learning for POMDPs

Maximilian Igl, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. Deep variational reinforcement learning for POMDPs. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 2117–2126, 2018

work page 2018

[12] [12]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1–2):99–134, 1998

work page 1998

[13] [13]

Recur- rent experience replay in distributed reinforcement learning

Steven Kapturowski, Georg Ostrovski, John Quan, R´ emi Munos, and Will Dabney. Recur- rent experience replay in distributed reinforcement learning. InInternational Conference on 24 Learning Representations, 2019

work page 2019

[14] [14]

Contextuality as an information-theoretic obstruction to classical probability,

Song-Ju Kim. Contextuality as an information-theoretic obstruction to classical probability,

work page

[15] [15]

arXiv:2601.20167 [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Contextuality derived from minimal decision dynamics: Quantum tug-of-war decision making, 2026

Song-Ju Kim. Contextuality derived from minimal decision dynamics: Quantum tug-of-war decision making, 2026. arXiv:2601.10034 [quant-ph]

work page arXiv 2026

[17] [17]

Contextuality from Single-State Ontological Models: An Information-Theoretic Obstruction

Song-Ju Kim. Contextuality from single-state ontological models: An information-theoretic no-go theorem, 2026. arXiv:2602.16716 [cs.AI] [quant-ph]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition

Taesup Kim, Inchul Song, and Yoshua Bengio. Dynamic layer normalization for adaptive neural acoustic modeling in speech recognition. InProceedings of Interspeech 2017, pages 3317–3321, 2017

work page 2017

[19] [19]

Littman, Richard S

Michael L. Littman, Richard S. Sutton, and Satinder Singh. Predictive representations of state. InAdvances in Neural Information Processing Systems 14, pages 1555–1561, 2001

work page 2001

[20] [20]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018

[21] [21]

Spekkens

Robert W. Spekkens. Contextuality for preparations, transformations, and unsharp measure- ments.Physical Review A, 71:052108, 2005. 25

work page 2005