Scalable Policy Maximization Under Network Interference

Aidan Gleich; Alexander Volfovsky; Eric Laber

arxiv: 2505.18118 · v2 · submitted 2025-05-23 · 📊 stat.ML · cs.LG

Scalable Policy Maximization Under Network Interference

Aidan Gleich , Eric Laber , Alexander Volfovsky This is my paper

Pith reviewed 2026-05-19 13:11 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords bandit algorithmsnetwork interferenceThompson samplingpolicy optimizationcausal inferencedynamic networksregret bounds

0 comments

The pith

Under common interference assumptions rewards linearize, enabling scalable Thompson sampling with sublinear regret on fresh networks each round.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that standard assumptions about interference structure turn expected rewards into linear functions of observable features. This reduction converts policy maximization into a tractable linear bandit problem solvable by Thompson sampling even when an entirely new n-node network arrives each round. A sympathetic reader cares because prior methods for interference demand repeated observations of one small fixed network and fail to scale past roughly fifteen units, while real interventions such as vaccine assignments or coupon distributions routinely involve large, changing contact networks. The work supplies a Bayesian regret bound sublinear in both network size and horizon, plus simulation evidence that the algorithm learns faster than existing approaches.

Core claim

Under common assumptions on the structure of interference, rewards become linear. This enables us to develop a scalable Thompson sampling algorithm that maximizes policy impact when a new n-node network is observed each round. We prove a Bayesian regret bound that is sublinear in n and the number of rounds.

What carries the argument

Linearity of rewards under standard interference-structure assumptions, which reduces the sequential policy problem to a linear bandit solvable by Thompson sampling.

If this is right

Policy optimization becomes feasible for large-scale networked systems such as online marketplaces or clinical trials on dynamic networks.
The algorithm learns quickly in simulations and outperforms existing methods limited to fixed small networks.
Bayesian regret grows sublinearly in both network size n and number of rounds.
The approach closes the scalability gap between causal inference methods for interference and practical bandit algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If linearity holds in practice the method could extend to interference settings outside networks, such as spatial or temporal dependence.
Real-world validation on marketplace or health-intervention data would test whether observed regret matches the sublinear bound.
Relaxing the linearity premise might motivate scalable approximations based on nonlinear bandits or kernel methods.

Load-bearing premise

The premise that common assumptions on interference structure render rewards linear in the relevant features.

What would settle it

An experiment or simulation on networks where interference violates the linearity assumption and produces regret that scales linearly or worse with n or the number of rounds.

read the original abstract

Many interventions, such as vaccines in clinical trials or coupons in online marketplaces, must be assigned sequentially without full knowledge of their effects. Multi-armed bandit algorithms have proven successful in such settings. However, standard independence assumptions fail when the treatment status of one individual impacts the outcomes of others, a phenomenon known as interference. We study optimal-policy learning under interference on a dynamic network. Existing approaches to this problem require repeated observations of the same fixed network and struggle to scale in sample size beyond as few as fifteen connected units -- both limit applications. We show that under common assumptions on the structure of interference, rewards become linear. This enables us to develop a scalable Thompson sampling algorithm that maximizes policy impact when a new $n$-node network is observed each round. We prove a Bayesian regret bound that is sublinear in $n$ and the number of rounds. Simulation experiments show that our algorithm learns quickly and outperforms existing methods. The results close a key scalability gap between causal inference methods for interference and practical bandit algorithms, enabling policy optimization in large-scale networked systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces interference on fresh networks each round to a linear bandit via structural assumptions, enabling scalable Thompson sampling with sublinear regret, but the exact conditional linearity needs checking against the stress-test point.

read the letter

The core contribution here is showing that common interference assumptions make rewards linear in observable features. That reduction lets them run Thompson sampling on a new n-node network every round and claim a Bayesian regret bound that is sublinear in both n and the number of rounds. It directly targets the fact that earlier interference-aware methods were stuck on tiny fixed networks of around fifteen nodes.

Referee Report

2 major / 2 minor

Summary. The paper claims that under common assumptions on the structure of interference, rewards become linear in observable features. This reduces the dynamic-network policy maximization problem (new n-node network each round) to a linear bandit, enabling a scalable Thompson sampling algorithm. The authors prove a Bayesian regret bound sublinear in both n and the number of rounds T, and report that simulations show faster learning and better performance than existing methods.

Significance. If the exact linearity in conditional expectation holds for fresh random networks each round and the regret derivation is rigorous, the result would meaningfully close the scalability gap between interference-aware causal inference and practical bandit algorithms for large networked systems.

major comments (2)

[Abstract / model assumptions] Abstract and model section: the central claim that 'common assumptions on the structure of interference' render rewards exactly linear in the relevant features must be shown to hold in the conditional expectation E[Y | treatment vector, new random network] for each round. If linearity is obtained only after averaging over network realizations or holds only approximately, the posterior updates in Thompson sampling and the sublinear Bayesian regret bound would require additional n-dependent concentration arguments that are not yet visible.
[Regret analysis] Regret analysis section: the Bayesian regret bound is derived from the resulting linear bandit. The manuscript should explicitly state the precise interference assumptions (e.g., exposure mappings or linear-in-neighbors) and verify that they produce an exact linear representation without residual network-dependent terms that would invalidate the standard linear-bandit regret analysis when networks are redrawn each round.

minor comments (2)

[Experiments] Simulations: report the exact network generation process, the specific interference models tested, and the values of n and T used, to support the claim of outperformance and quick learning.
[Notation] Notation: ensure that the feature vector used in the linear representation is defined consistently between the model section and the algorithm section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our model assumptions and regret analysis. We address each major comment below and will revise the manuscript accordingly to strengthen the exposition.

read point-by-point responses

Referee: [Abstract / model assumptions] Abstract and model section: the central claim that 'common assumptions on the structure of interference' render rewards exactly linear in the relevant features must be shown to hold in the conditional expectation E[Y | treatment vector, new random network] for each round. If linearity is obtained only after averaging over network realizations or holds only approximately, the posterior updates in Thompson sampling and the sublinear Bayesian regret bound would require additional n-dependent concentration arguments that are not yet visible.

Authors: Under the interference assumptions in Section 2, the conditional expectation of each unit's outcome is exactly linear in a vector of observable features (own treatment plus network-derived exposure statistics such as the fraction of treated neighbors) for any fixed network realization. Because the linearity is with respect to these features computed from the realized network, it holds exactly in E[Y | treatment vector, new random network] without averaging or approximation. The features are observed each round, so the linear model for Thompson sampling is exact conditionally on the network and the standard Bayesian regret analysis applies directly. We will add a formal proposition in the revised manuscript verifying this conditional linearity. revision: yes
Referee: [Regret analysis] Regret analysis section: the Bayesian regret bound is derived from the resulting linear bandit. The manuscript should explicitly state the precise interference assumptions (e.g., exposure mappings or linear-in-neighbors) and verify that they produce an exact linear representation without residual network-dependent terms that would invalidate the standard linear-bandit regret analysis when networks are redrawn each round.

Authors: Section 2 already specifies the assumptions via linear exposure mappings (outcomes linear in own treatment and neighbor exposure). We will expand this into a dedicated subsection that explicitly lists the assumptions, defines the feature map, and proves that the representation is exactly linear with no residual network-dependent terms beyond the observed features. Because the network is redrawn independently each round and features are recomputed from the current realization, the linear bandit model holds exactly per round and the Bayesian regret bound follows from standard linear Thompson sampling results without additional terms or invalidation. revision: yes

Circularity Check

0 steps flagged

No circularity; linearity derived from external assumptions, regret bound follows independently

full rationale

The paper claims to show that under common assumptions on interference structure, rewards become linear (abstract), enabling a linear bandit reduction, Thompson sampling, and sublinear Bayesian regret in n and T. This is presented as a derivation from domain assumptions rather than a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or steps in the provided text reduce the central result to its own inputs by construction. The derivation chain is self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on a domain assumption about interference structure producing linear rewards; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Common assumptions on the structure of interference make rewards linear.
Invoked directly in the abstract to enable the linear bandit reduction and scalable algorithm.

pith-pipeline@v0.9.0 · 5709 in / 1220 out tokens · 66306 ms · 2026-05-19T13:11:05.921846+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

under common assumptions on the structure of interference, rewards become linear... rt = H(Zt; At)θ + ϵt
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove a Bayesian regret bound that is sublinear in n and the number of rounds

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dynamic Treatment on Networks
stat.ML 2026-05 unverdicted novelty 7.0

Q-Ising integrates Bayesian dynamic Ising modeling with offline RL to enable adaptive network treatment policies that outperform static centrality benchmarks under spillovers.