Scalable Policy Maximization Under Network Interference
Pith reviewed 2026-05-19 13:11 UTC · model grok-4.3
The pith
Under common interference assumptions rewards linearize, enabling scalable Thompson sampling with sublinear regret on fresh networks each round.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under common assumptions on the structure of interference, rewards become linear. This enables us to develop a scalable Thompson sampling algorithm that maximizes policy impact when a new n-node network is observed each round. We prove a Bayesian regret bound that is sublinear in n and the number of rounds.
What carries the argument
Linearity of rewards under standard interference-structure assumptions, which reduces the sequential policy problem to a linear bandit solvable by Thompson sampling.
If this is right
- Policy optimization becomes feasible for large-scale networked systems such as online marketplaces or clinical trials on dynamic networks.
- The algorithm learns quickly in simulations and outperforms existing methods limited to fixed small networks.
- Bayesian regret grows sublinearly in both network size n and number of rounds.
- The approach closes the scalability gap between causal inference methods for interference and practical bandit algorithms.
Where Pith is reading between the lines
- If linearity holds in practice the method could extend to interference settings outside networks, such as spatial or temporal dependence.
- Real-world validation on marketplace or health-intervention data would test whether observed regret matches the sublinear bound.
- Relaxing the linearity premise might motivate scalable approximations based on nonlinear bandits or kernel methods.
Load-bearing premise
The premise that common assumptions on interference structure render rewards linear in the relevant features.
What would settle it
An experiment or simulation on networks where interference violates the linearity assumption and produces regret that scales linearly or worse with n or the number of rounds.
read the original abstract
Many interventions, such as vaccines in clinical trials or coupons in online marketplaces, must be assigned sequentially without full knowledge of their effects. Multi-armed bandit algorithms have proven successful in such settings. However, standard independence assumptions fail when the treatment status of one individual impacts the outcomes of others, a phenomenon known as interference. We study optimal-policy learning under interference on a dynamic network. Existing approaches to this problem require repeated observations of the same fixed network and struggle to scale in sample size beyond as few as fifteen connected units -- both limit applications. We show that under common assumptions on the structure of interference, rewards become linear. This enables us to develop a scalable Thompson sampling algorithm that maximizes policy impact when a new $n$-node network is observed each round. We prove a Bayesian regret bound that is sublinear in $n$ and the number of rounds. Simulation experiments show that our algorithm learns quickly and outperforms existing methods. The results close a key scalability gap between causal inference methods for interference and practical bandit algorithms, enabling policy optimization in large-scale networked systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that under common assumptions on the structure of interference, rewards become linear in observable features. This reduces the dynamic-network policy maximization problem (new n-node network each round) to a linear bandit, enabling a scalable Thompson sampling algorithm. The authors prove a Bayesian regret bound sublinear in both n and the number of rounds T, and report that simulations show faster learning and better performance than existing methods.
Significance. If the exact linearity in conditional expectation holds for fresh random networks each round and the regret derivation is rigorous, the result would meaningfully close the scalability gap between interference-aware causal inference and practical bandit algorithms for large networked systems.
major comments (2)
- [Abstract / model assumptions] Abstract and model section: the central claim that 'common assumptions on the structure of interference' render rewards exactly linear in the relevant features must be shown to hold in the conditional expectation E[Y | treatment vector, new random network] for each round. If linearity is obtained only after averaging over network realizations or holds only approximately, the posterior updates in Thompson sampling and the sublinear Bayesian regret bound would require additional n-dependent concentration arguments that are not yet visible.
- [Regret analysis] Regret analysis section: the Bayesian regret bound is derived from the resulting linear bandit. The manuscript should explicitly state the precise interference assumptions (e.g., exposure mappings or linear-in-neighbors) and verify that they produce an exact linear representation without residual network-dependent terms that would invalidate the standard linear-bandit regret analysis when networks are redrawn each round.
minor comments (2)
- [Experiments] Simulations: report the exact network generation process, the specific interference models tested, and the values of n and T used, to support the claim of outperformance and quick learning.
- [Notation] Notation: ensure that the feature vector used in the linear representation is defined consistently between the model section and the algorithm section.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our model assumptions and regret analysis. We address each major comment below and will revise the manuscript accordingly to strengthen the exposition.
read point-by-point responses
-
Referee: [Abstract / model assumptions] Abstract and model section: the central claim that 'common assumptions on the structure of interference' render rewards exactly linear in the relevant features must be shown to hold in the conditional expectation E[Y | treatment vector, new random network] for each round. If linearity is obtained only after averaging over network realizations or holds only approximately, the posterior updates in Thompson sampling and the sublinear Bayesian regret bound would require additional n-dependent concentration arguments that are not yet visible.
Authors: Under the interference assumptions in Section 2, the conditional expectation of each unit's outcome is exactly linear in a vector of observable features (own treatment plus network-derived exposure statistics such as the fraction of treated neighbors) for any fixed network realization. Because the linearity is with respect to these features computed from the realized network, it holds exactly in E[Y | treatment vector, new random network] without averaging or approximation. The features are observed each round, so the linear model for Thompson sampling is exact conditionally on the network and the standard Bayesian regret analysis applies directly. We will add a formal proposition in the revised manuscript verifying this conditional linearity. revision: yes
-
Referee: [Regret analysis] Regret analysis section: the Bayesian regret bound is derived from the resulting linear bandit. The manuscript should explicitly state the precise interference assumptions (e.g., exposure mappings or linear-in-neighbors) and verify that they produce an exact linear representation without residual network-dependent terms that would invalidate the standard linear-bandit regret analysis when networks are redrawn each round.
Authors: Section 2 already specifies the assumptions via linear exposure mappings (outcomes linear in own treatment and neighbor exposure). We will expand this into a dedicated subsection that explicitly lists the assumptions, defines the feature map, and proves that the representation is exactly linear with no residual network-dependent terms beyond the observed features. Because the network is redrawn independently each round and features are recomputed from the current realization, the linear bandit model holds exactly per round and the Bayesian regret bound follows from standard linear Thompson sampling results without additional terms or invalidation. revision: yes
Circularity Check
No circularity; linearity derived from external assumptions, regret bound follows independently
full rationale
The paper claims to show that under common assumptions on interference structure, rewards become linear (abstract), enabling a linear bandit reduction, Thompson sampling, and sublinear Bayesian regret in n and T. This is presented as a derivation from domain assumptions rather than a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps in the provided text reduce the central result to its own inputs by construction. The derivation chain is self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Common assumptions on the structure of interference make rewards linear.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
under common assumptions on the structure of interference, rewards become linear... rt = H(Zt; At)θ + ϵt
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We prove a Bayesian regret bound that is sublinear in n and the number of rounds
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Dynamic Treatment on Networks
Q-Ising integrates Bayesian dynamic Ising modeling with offline RL to enable adaptive network treatment policies that outperform static centrality benchmarks under spillovers.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.