pith. sign in

arxiv: 2602.06550 · v2 · submitted 2026-02-06 · 💻 cs.LG · cs.AI

Dynamics-Aligned Shared Hypernetworks for Contextual RL under Discontinuous Shifts

Pith reviewed 2026-05-16 07:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords contextual reinforcement learningzero-shot generalizationhypernetworksdynamics predictiondiscontinuous shiftsadapter weightscontext inferencereinforcement learning
0
0 comments X

The pith

A hypernetwork trained solely on dynamics prediction generates shared adapter weights that enable zero-shot policy generalization under discontinuous context shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes DMA*-SH, a framework in which a hypernetwork is trained only to predict environment dynamics and then produces a small set of adapter weights. These weights are shared across the dynamics model, the policy network, and the action-value function. The shared modulation is designed to align the policy's responses with the underlying dynamics, so that the agent can infer latent contexts and switch control strategies when the same action produces incompatible effects across contexts. Theoretical analysis via expressivity separation and policy-gradient variance bounds shows that this within-context compression aids learning when contexts are non-overlapping. Empirically, on the new Actuator Inversion Benchmark the approach achieves zero-shot generalization, outperforming domain randomization by 58.1 percent and a standard context-aware baseline by 11.5 percent on held-out tasks.

Core claim

By training a hypernetwork exclusively through dynamics prediction and using its generated weights as shared adapters for the policy and action-value function, the framework imparts an inductive bias matched to discontinuous context-to-dynamics shifts. Input/output normalization and random input masking further stabilize context inference and promote concentrated representations. This construction is supported by expressivity separation results for hypernetwork modulation and variance decomposition bounds that quantify improved learning under non-overlapping contexts.

What carries the argument

The dynamics-aligned shared hypernetwork that generates adapter weights from inferred context to modulate policy and value functions simultaneously.

If this is right

  • Zero-shot generalization succeeds on held-out tasks in the Actuator Inversion Benchmark.
  • The method outperforms domain randomization by 58.1% and standard context-aware baselines by 11.5% on average.
  • Theoretical results establish expressivity separation and reduced policy-gradient variance through within-mode compression.
  • Normalization and masking stabilize context inference for directionally concentrated representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This shared dynamics alignment could extend to other RL domains involving latent variables or non-stationary environments.
  • Similar hypernetwork modulation might improve transfer in multi-task reinforcement learning without requiring explicit task identifiers.
  • Evaluating the approach on real robotic systems with variable actuator behaviors would test its practical applicability beyond simulation.

Load-bearing premise

A hypernetwork trained solely via dynamics prediction will impart an inductive bias sufficient for the policy and action-value function to handle discontinuous context-to-dynamics shifts on unseen contexts.

What would settle it

Performance on a newly constructed benchmark containing more extreme or previously unseen discontinuous shifts would falsify the claim if DMA*-SH shows no advantage over non-shared baselines or fails to achieve zero-shot transfer.

Figures

Figures reproduced from arXiv: 2602.06550 by Frank R\"oder, Jan Benad, Manfred Eppe, Martin V. Butz, Nihat Ay, Pradeep Kr. Banerjee.

Figure 1
Figure 1. Figure 1: (a) In vanilla DMA, the inferred context zt is concatenated to the RL inputs. (b) In DMA*-SH, a hypernetwork hη conditioned on zt generates adapter weights ω that are used by the forward dynamics model and the RL networks. The context encoder and hypernetwork are trained via the reconstruction objective Lϕ,θ,η in (2), while during RL updates gradients through ω = hη(zt) are stopped so that the policy and c… view at source ↗
Figure 2
Figure 2. Figure 2: Interquartile mean (IQM) with 95% confidence inter￾vals computed from AER scores, aggregated across environments. Results are reported for the three context sets Ctrain, Ceval-in, and Ceval-out, comparing DMA*, DMA*-SH, and all baselines. 6.4. Context-Embedding Diagnostics and Geometry To analyze how RL task performance relates to zt, we intro￾duce three evaluation criteria: Informativeness, Variability, a… view at source ↗
Figure 3
Figure 3. Figure 3: Informativeness I(zt; c), Variability, Representation￾Overlap (RO), and episodic returns for the context set Ceval-out. parameters η f , ηπ , ηQ, producing ω f , ωπ , ωQ), and the ac￾tor/critic hypernetworks can adapt directly to their respective RL objectives. To quantify how strongly the policy objec￾tive depends on the inferred context through the adapter pathway, we report the mean norm of a shadow z-s… view at source ↗
Figure 4
Figure 4. Figure 4: DI (non-overlapping): t-SNE of inferred embeddings zt (top) and cosine similarity heatmaps of per-context mean embeddings (bottom; (20)) for DMA, DMA*, and DMA*-SH. DMA*-SH shows stronger within-mode alignment across the continuous mass dimension while maintaining separation between actuator-inversion modes, consistent with the compression/separation terms in Theorem A.11. Mass clusters overlap more for DM… view at source ↗
Figure 5
Figure 5. Figure 5: Implicit regularization of RL via a dynamics-trained shared hypernetwork. Top: Mean policy context sensitivity in shared embedding space E∥∇zLπ∥. Bottom: Episodic returns. 7. Conclusions We introduced the Actuator Inversion Benchmark (AIB) as a stress test for zero-shot generalization under actuator inversion. We proposed DMA*-SH, a context-inferred RL framework that uses a shared hypernetwork to couple la… view at source ↗
Figure 6
Figure 6. Figure 6: Bottleneck adapter. A hypernetwork (a) predicts parameters that are used within the dynamic model and RL networks (b). The separation in Theorem A.1 supports a plausible mechanism for improved generalization. Hypernetworks can represent multiplicative interactions (including bilinear terms) exactly in simple cases, while concatenation MLPs typically approxi￾mate them via increasingly fine CPWL partitions. … view at source ↗
Figure 7
Figure 7. Figure 7: Held-out dynamics of mode sufficiency vs. within-mode refinement. Evolution of I(Z; S, U), I(Z; S), I(Z;U | S), Variability, and returns for DI, Cartpole, and Reacher (Hard). Across tasks, mode information I(Z; S) rises fast and approaches the binary ceiling H(S) = ln 2, while within-mode information I(Z;U | S) accumulates more slowly; kinks near mode saturation coincide with regime changes in Variability,… view at source ↗
Figure 8
Figure 8. Figure 8: Gradient analysis comparing shared hypernetworks (DMA*-SH, red) vs. separate hypernetworks (DMA*-H, black) across DI, DI-Friction, ODE, and Cartpole environments. Dashed lines indicate quantities computed via a shadow graph obtained by temporarily removing the stop-gradient through the adapter pathway (e.g., ∇ηLπ or ∇zLπ in DMA*-SH, where neither η nor z receives policy gradients during training due to det… view at source ↗
Figure 9
Figure 9. Figure 9: Architecture of the context encoder. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Probability of improvement (POI) (Agarwal et al., 2021) based on AER scores (Section 6.1) aggregated over six contextualized environments and over contexts in the three context sets Ctrain, Ceval-in and Ceval-out. For the proposed DMA* and DMA*-SH, we ablate separately the random masking, input and output normalization, or everything at once (DMA, DMA-SH). 0.7 0.8 0.9 mask: 0% mask: 10% mask: 20% mask: 30… view at source ↗
Figure 11
Figure 11. Figure 11: Interquartile mean (IQM) (Agarwal et al., 2021) based on AER scores (Section 6.1) aggregated over six contextualized environments. We distinguish results for contexts in the three context sets Ctrain, Ceval-in and Ceval-out. We compare different ratios for the random input masking. When averaging over the three context sets, best performance is achieved using a ratio of 20% for DMA* and 40% for DMA*-SH. 0… view at source ↗
Figure 12
Figure 12. Figure 12: Interquartile mean (IQM) (Agarwal et al., 2021) based on AER scores (Section 6.1) aggregated over six contextualized environments. We distinguish results for contexts in the three context sets Ctrain, Ceval-in and Ceval-out. We compare different types of input normalization. When averaging over the three context sets, best performance is achieved using AvgL1Norm in both DMA* and DMA*-SH. 0.86 0.88 output:… view at source ↗
Figure 13
Figure 13. Figure 13: Interquartile mean (IQM) (Agarwal et al., 2021) based on AER scores (Section 6.1) aggregated over six contextualized environments. We distinguish results for contexts in the three context sets Ctrain, Ceval-in and Ceval-out. We compare different types of output normalization. When averaging over the three context sets, best performance is achieved using SimNorm in both DMA* and DMA*-SH. 39 [PITH_FULL_IMA… view at source ↗
Figure 14
Figure 14. Figure 14: Interquartile mean (IQM) comparing different context window sizes justifying the choice of 24 for DI and ODE environments and 128 for DMC- and Gymnasium-based environments. 0.8500.8750.900 DMA*-SH DMA-SH DMA*-H DMA*-H (RL only) Training 0.8500.8750.900 Eval-in 0.69 0.72 0.75 Eval-out IQM normalized scores [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Interquartile mean (IQM) (Agarwal et al., 2021) based on AER scores (Section 6.1) aggregated over six contextualized environments. We distinguish results for contexts in the three context sets Ctrain, Ceval-in, and Ceval-out. We compare DMA*-SH to a variant without normalization and masking (DMA-SH) and to an architecture that does not share the hypernetwork (DMA*-H). Instead, DMA*-H uses separate hyperne… view at source ↗
Figure 16
Figure 16. Figure 16: Interquartile mean (IQM) (Agarwal et al., 2021) based on AER scores (Section 6.1) aggregated over six contextualized environments. We distinguish results for contexts in the three context sets Ctrain, Ceval-in and Ceval-out. In a) we compare the original Pearl approach aligned with the Q-function to the dynamic model-aligned variant that we are using as a baseline, DMA-Pearl. Additionally, we incorporate … view at source ↗
Figure 17
Figure 17. Figure 17: Returns over training steps, averaged over the contexts used for training. Comparison to the meta-RL approach VariBAD (Zintgraf et al., 2020). See Remark A.9. VariBAD is based on the on-policy PPO, hence we allow for more environment steps. We do not observe any improvement after 200K steps. Two KL-weights (β) are tested. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Returns over training steps, averaged over 20 contexts used for training. Comparison to baselines where context information is not explicitly available (Section 6.2). 44 [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Bar plot for DI to visualize AER for individual context instances and different methods. Bold labels refer to contexts used during training. [0.1, 0.15] (0.15, 0.2] (0.2, 0.25] (0.25, 0.3] (0.3, 0.438] (0.438, 0.575] (0.575, 0.712] (0.712, 0.85] (0.85, 1.138] (1.138, 1.425] (1.425, 1.712] (1.712, 2.0] pole length 1000 750 500 250 0 250 500 750 1000 return DMA*-SH DMA* DMA-Pearl DMA DR DA Concat -1.0 1.0 a… view at source ↗
Figure 20
Figure 20. Figure 20: Bar plot for Cartpole to visualize AER for individual context instances and different methods. Bold labels refer to contexts used during training. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Heatmaps for DI to visualize AER for individual context instances. Bold labels refer to contexts used during training. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Heatmaps for DI-Friction to visualize AER for individual context instances. Bold labels refer to contexts used during training. 47 [PITH_FULL_IMAGE:figures/full_fig_p047_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Heatmaps for ODE to visualize AER for individual context instances. Bold labels refer to contexts used during training. 48 [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Heatmaps for Cartpole to visualize AER for individual context instances. Bold labels refer to contexts used during training. 49 [PITH_FULL_IMAGE:figures/full_fig_p049_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Heatmaps for BallInCup to visualize AER for individual context instances. Bold labels refer to contexts used during training. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Heatmaps for Walker to visualize AER for individual context instances. Bold labels refer to contexts used during training. 51 [PITH_FULL_IMAGE:figures/full_fig_p051_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Interquartile mean (IQM) aggregated over six ODE variants, ODE-1, ODE-2, ... , ODE-6 to test for scalability with respect to context dimensionality. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_27.png] view at source ↗
read the original abstract

Zero-shot generalization in contextual reinforcement learning remains a core challenge, particularly when the context is latent and must be inferred from data. A canonical failure mode arises when latent context discontinuously changes how actions affect the environment, requiring incompatible control responses across contexts. We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights shared across the dynamics model, policy, and action-value function. This shared modulation imparts an inductive bias matched to discontinuous context-to-dynamics shifts, while input/output normalization and random input masking stabilize context inference, promoting directionally concentrated representations. We provide theoretical support via expressivity separation results for hypernetwork modulation, and a variance decomposition with policy-gradient variance bounds that formalize how within-mode compression improves learning under non-overlapping contexts. For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate challenging context-to-dynamics interactions, including actuator inversion, actuator permutations, and weakly non-overlapping continuous dynamics. On AIB's held-out tasks, DMA*-SH achieves zero-shot generalization, outperforming domain randomization by 58.1% and surpassing a standard context-aware baseline by 11.5% on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DMA*-SH, a shared hypernetwork framework for contextual RL under discontinuous shifts. A single hypernetwork is trained exclusively via dynamics prediction to generate a small set of adapter weights that are shared across the dynamics model, policy, and action-value function. Input/output normalization and random input masking are used to stabilize context inference. Theoretical support is provided via expressivity separation results for hypernetwork modulation and a variance decomposition with policy-gradient variance bounds. The authors introduce the Actuator Inversion Benchmark (AIB) and report that DMA*-SH achieves zero-shot generalization on held-out AIB tasks, outperforming domain randomization by 58.1% and a standard context-aware baseline by 11.5% on average.

Significance. If the central claim holds, the work would be significant for contextual RL because it offers a dynamics-aligned inductive bias that enables zero-shot transfer across non-overlapping context-to-dynamics mappings without requiring policy-specific supervision. The introduction of AIB as a targeted benchmark for actuator inversion and permutation shifts is a useful contribution, and the combination of expressivity separation with variance bounds provides a formal lens on why within-mode compression can reduce policy-gradient variance under discontinuous shifts.

major comments (2)
  1. [theoretical support section (expressivity separation and variance decomposition)] The central claim that dynamics-only hypernetwork training suffices to produce policy and Q-function adapters for zero-shot control on unseen discontinuous shifts is not fully supported by the provided theoretical results. The expressivity separation establishes representational capacity but does not address whether gradient descent on the dynamics loss recovers the specific modulation required for optimal control; this gap directly affects the zero-shot generalization claim on AIB held-out tasks.
  2. [variance decomposition with policy-gradient variance bounds] The variance decomposition and policy-gradient variance bounds assume that within-mode compression improves learning under non-overlapping contexts, yet the manuscript does not demonstrate that the hypernetwork-generated adapters remain aligned with the optimal policy when context must be inferred from partial observations on held-out AIB tasks. This assumption is load-bearing for the reported 58.1% and 11.5% gains.
minor comments (2)
  1. [method description] The description of random input masking and normalization could be clarified with a precise algorithmic statement or pseudocode to allow reproduction of the context-inference stabilization.
  2. [evaluation section] AIB environment details (state/action dimensions, exact definitions of actuator inversion and permutation modes, and how held-out tasks are sampled) should be expanded in the main text or appendix to facilitate independent verification of the discontinuous-shift isolation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our work. We provide point-by-point responses to the major comments below. Our responses focus on clarifying the scope of the theoretical results and highlighting the supporting empirical evidence from the Actuator Inversion Benchmark.

read point-by-point responses
  1. Referee: [theoretical support section (expressivity separation and variance decomposition)] The central claim that dynamics-only hypernetwork training suffices to produce policy and Q-function adapters for zero-shot control on unseen discontinuous shifts is not fully supported by the provided theoretical results. The expressivity separation establishes representational capacity but does not address whether gradient descent on the dynamics loss recovers the specific modulation required for optimal control; this gap directly affects the zero-shot generalization claim on AIB held-out tasks.

    Authors: The referee correctly notes that our expressivity separation result establishes that hypernetworks can represent the required modulations but does not guarantee that training on dynamics prediction via gradient descent will find the specific parameters needed for optimal control. We do not claim such a guarantee in the manuscript; the theoretical section focuses on separation of expressivity and variance reduction under within-mode compression. The zero-shot generalization claim is substantiated by the experimental results on held-out AIB tasks, where the method achieves substantial improvements without policy-specific supervision. We believe this empirical validation is sufficient for the contribution, and no changes are required to the theoretical claims as they are accurately stated. revision: no

  2. Referee: [variance decomposition with policy-gradient variance bounds] The variance decomposition and policy-gradient variance bounds assume that within-mode compression improves learning under non-overlapping contexts, yet the manuscript does not demonstrate that the hypernetwork-generated adapters remain aligned with the optimal policy when context must be inferred from partial observations on held-out AIB tasks. This assumption is load-bearing for the reported 58.1% and 11.5% gains.

    Authors: Regarding the variance decomposition, the bounds are derived to show how within-mode compression can reduce policy gradient variance in non-overlapping contexts. While the manuscript does not include a separate analysis proving adapter alignment under partial observations, the AIB benchmark is specifically designed to test this scenario, and the performance gains (58.1% over domain randomization and 11.5% over context-aware baseline) provide evidence that the adapters are aligned sufficiently for effective zero-shot control. The context inference is stabilized by the proposed normalization and masking techniques. We can revise the manuscript to include a short paragraph discussing the reliance on empirical validation for this aspect. revision: partial

standing simulated objections not resolved
  • The theoretical analysis does not include a proof that gradient descent on the dynamics loss recovers the optimal control adapters for unseen shifts.

Circularity Check

0 steps flagged

No load-bearing circularity in derivation chain

full rationale

The central proposal trains a hypernetwork solely on dynamics prediction to produce shared adapters for policy and value functions; this is an explicit design choice whose generalization to held-out AIB tasks is evaluated empirically rather than derived tautologically from the training objective. Expressivity separation results and variance decomposition bounds are invoked as independent theoretical support without reduction to self-citations or fitted parameters renamed as predictions. No self-definitional loops, ansatz smuggling via prior work, or uniqueness theorems imported from the same authors appear in the provided text. The performance claims (58.1% and 11.5% gains) are presented as measured outcomes on external benchmarks, not forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; no explicit fitted values or new postulated objects are named.

pith-pipeline@v0.9.0 · 5539 in / 1049 out tokens · 38281 ms · 2026-05-16T07:06:07.327933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    The binary modeSis decision-critical (wrong sign can catastrophically flip the control law)

  2. [2]

    Intuitively, one might expect that embeddings capturing more information about the full context(S, U) would yield better RL performance

    The continuous coordinate U (e.g., mass) may be weakly relevant or even largely irrelevantwithin each modeover the benchmark range, especially if robust control can handle that variation. Intuitively, one might expect that embeddings capturing more information about the full context(S, U) would yield better RL performance. Proposition A.14 shows this intu...

  3. [3]

    compression

    The policy-gradient variance upper bound improves despite lower Informativeness:If the policy-gradient estimator computed with eZsatisfies the same assumptions of Theorem A.13 with the same constantsV 0 andL m, then tr(Cov(G eZ))≤V 0 +L 2 m tr(Cov(eZ))< V 0 +L 2 m tr(Cov(Z)), whenever the inequality in Part 2 is strict. Thus, a representation can beless i...

  4. [4]

    IfI(Z;U|S) = 0, thenTerm 2 = 0

  5. [5]

    The converse does not hold:Term 2 = 0does not implyI(Z;U|S) = 0. Proof. Part 1.By definition, I(Z;U|S) = 0 if and only if Z⊥U|S , meaning p(Z|S=s, U=u) =p(Z|S=s) for all s, u. This implies ¯z(s, u) =E[Z|S=s, U=u] =E[Z|S=s] =µ s for all u. Since ¯z(s, u) =µs is constant in u, we haveCov U|S=s (¯z(s, U)) = 0for eachs. Averaging overSand taking 1 dz tr(·)yie...

  6. [6]

    I(Z;S) increases quickly and enters a near-saturated regime (relative to H(S) ) early in optimization, indicating rapid acquisition of mode information

    Rapid mode acquisition. I(Z;S) increases quickly and enters a near-saturated regime (relative to H(S) ) early in optimization, indicating rapid acquisition of mode information. In Cartpole and Reacher, I(Z;S) approaches values close toln 2, whereas in DI the attained level is lower but still substantial

  7. [7]

    I(Z;U|S) increases more gradually and typically continues to rise after I(Z;S) has effectively plateaued

    Slower within-mode refinement. I(Z;U|S) increases more gradually and typically continues to rise after I(Z;S) has effectively plateaued. The eventual level of I(Z;U|S) is task-dependent, with Reacher exhibiting markedly larger within-mode information than DI or Cartpole, consistent with arm-length scaling being more consequential for Reacher’s kinematics/...

  8. [8]

    The accompanying behavior ofVariability(M)and returns is environment-specific

    Regime changes aligned with mode saturation.Visible kinks (regime changes) occur around the time I(Z;S) enters its near-saturated regime. The accompanying behavior ofVariability(M)and returns is environment-specific. 0.6 0.8 I(Z;S,U) DI 0.6 0.9 Cartpole 0.6 1.2 Reacher (H) 0.57 0.60 I(Z;S) 0.56 0.64 0.650 0.675 0.1 0.2 I(Z;U|S) 0.15 0.30 0.3 0.6 10−2 6 × ...

  9. [9]

    Mean policy hypernetwork sensitivity (η-space): E∥∇ηLπ∥ (DMA*-SH) and E∥∇ηπ Lπ∥ (DMA*-H). Across environ- ments, the DMA*-SH curve remains substantially larger than DMA*-H, indicating a persistent hypothetical tendency of the policy objective to reshape the shared hypernetwork mapping if such updates were enabled. This is consistent with the actor and cri...

  10. [10]

    DMA*-SH exhibits consistently larger values than DMA*-H, indicating stronger dependence of the policy objective on the inferred context through the adapter pathway

    Mean policy context sensitivity (z-space): E∥∇zLπ∥. DMA*-SH exhibits consistently larger values than DMA*-H, indicating stronger dependence of the policy objective on the inferred context through the adapter pathway. In DMA*-H, the smaller values are consistent with weaker effective context utilization along this route

  11. [11]

    context (z-space): Var∥∇zLπ∥

    Variance of policy gradients w.r.t. context (z-space): Var∥∇zLπ∥. Unlike the mean sensitivity, which captures persistent dependence on the adapter pathway, the relative ordering varies across environments, suggesting that the variance is influenced by environment-specific nonstationarity and mode-switching

  12. [12]

    DMA*-SH typically shows slightly lower variance than DMA*-H, consistent with a modestly more stable optimization signal for the policy base parameters under shared adapters

    Variance of policy base-parameter gradient norms (ξ-space): Var∥∇ξLπ∥. DMA*-SH typically shows slightly lower variance than DMA*-H, consistent with a modestly more stable optimization signal for the policy base parameters under shared adapters

  13. [13]

    This measures variance of the dynamics- driven learning signal entering the context encoder

    Variance of context-encoder gradients under dynamics (ϕ-space): Var∥∇ϕLd∥. This measures variance of the dynamics- driven learning signal entering the context encoder. Values are small in magnitude, but DMA*-SH often shows slightly lower variance than DMA*-H

  14. [14]

    embedding ( z-space): Var∥∇zLd∥

    Variance of dynamics gradients w.r.t. embedding ( z-space): Var∥∇zLd∥. This measures variance of the dynamics objective’s sensitivity to the inferred embedding inRdz. Magnitudes are small, so this metric serves mainly as a weak supporting diagnostic

  15. [15]

    In DMA*-SH, this cosine tracks whether the (hypothetical) reward-driven direction in the sharedη coordinates tends to align with or oppose the dynamics-driven direction

    Dynamics–policy alignment (η-space): cos(∇ηLd,∇ ηLπ) (DMA*-SH) and a heuristic analogue in DMA*-H comparing ηf and ηπ. In DMA*-SH, this cosine tracks whether the (hypothetical) reward-driven direction in the sharedη coordinates tends to align with or oppose the dynamics-driven direction. In DMA*-H, the corresponding cosine is a heuristic as it compares di...

  16. [16]

    In DMA*-SH, this cosine summarizes whether actor and critic objectives would push the shared hypernetwork parameters in similar directions under the hypothetical update

    Policy–critic alignment (η-space): cos(∇ηLπ,∇ ηLQ) (DMA*-SH) and a heuristic analogue in DMA*-H comparing ηπ and ηQ. In DMA*-SH, this cosine summarizes whether actor and critic objectives would push the shared hypernetwork parameters in similar directions under the hypothetical update. In DMA*-H, the corresponding cosine is a heuristic as it compares diff...

  17. [17]

    DMA*-SH attains faster learning and higher returns than DMA*-H across the considered environments

    Returns. DMA*-SH attains faster learning and higher returns than DMA*-H across the considered environments. This performance gap co-occurs with the separation in E∥∇ηLπ∥ and E∥∇zLπ∥, consistent with stronger and more persistent context dependence through the shared adapter pathway. Overall, the most consistent separation is in the mean context-sensitivity...