arxiv: 2604.08804 · v1 · submitted 2026-04-09 · 📊 stat.ML · cs.LG· stat.ME

Recognition: no theorem link

Policy-Aware Design of Large-Scale Factorial Experiments

Xin Wen , Xi Chen , Will Wei Sun , Yichen Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:39 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords factorial experimentstensor completionpolicy identificationsimple regretsequential halvingA/B testingcombinatorial designlow-rank models

0 comments

The pith

A two-stage design models factorial experiment outcomes as low-rank tensors to identify high-performing policies with complexity scaling to tensor degrees of freedom and level separations rather than full combinatorial size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Digital platforms face combinatorial explosion when testing combinations of interface elements or incentives, yet traffic limits exhaustive trials and decentralized A/B tests ignore interactions. The paper proposes centralizing these into one factorial design whose expected outcomes are treated as a low-rank tensor. In stage one, a subset of combinations is sampled, tensor completion infers the rest, and marginal contributions prune weak factor levels. In stage two, sequential halving refines the survivors into a final policy. Gap-independent regret bounds and gap-dependent identification results follow, showing sample needs depend on tensor rank and separations. An offline test on 100 million Taobao interactions confirms gains over baselines especially when budgets are tight or noise is high.

Core claim

By modeling intervention outcomes as a low-rank tensor, the two-stage procedure first uses tensor completion on a sampled subset to estimate marginal contributions and eliminate weak factor levels, then runs sequential halving on the reduced set to select a policy. This produces gap-independent simple-regret bounds and gap-dependent identification guarantees whose leading terms scale with the tensor's degrees of freedom and the separation structure across factor levels, rather than the full factorial size.

What carries the argument

Two-stage policy-aware design that applies low-rank tensor completion for marginal elimination followed by sequential halving on surviving combinations.

If this is right

Platforms can run overlapping experiments on shared traffic without decentralized A/B tests missing interactions.
Required samples grow linearly with tensor degrees of freedom and level separations instead of exponentially with number of factors.
Advantages appear most clearly in low-budget and high-noise regimes typical of production traffic.
Centralized design makes combinatorial product decisions operationally feasible at platform scale.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank elimination step could be inserted into other sequential decision procedures that face combinatorial action spaces.
Periodic re-estimation of the tensor rank from accumulating data might allow the design to adapt mid-experiment.
Similar tensor modeling could reduce sample needs in non-experimental combinatorial optimization tasks that share the low-rank structure.

Load-bearing premise

Expected outcomes of intervention combinations can be modeled as a low-rank tensor that permits accurate tensor completion and reliable estimation of marginal contributions for eliminating weak factor levels.

What would settle it

A controlled deployment or simulation in which real outcomes violate the low-rank assumption enough that the two-stage method's final policy regret or identification error exceeds that of one-shot tensor completion or unstructured best-arm selection.

read the original abstract

Digital firms routinely run many online experiments on shared user populations. When product decisions are compositional, such as combinations of interface elements, flows, messages, or incentives, the number of feasible interventions grows combinatorially, while available traffic remains limited. Overlapping experiments can therefore generate interaction effects that are poorly handled by decentralized A/B testing. We study how to design large-scale factorial experiments when the objective is not to estimate every treatment effect, but to identify a high-performing policy under a fixed experimentation budget. We propose a two-stage design that centralizes overlapping experiments into a single factorial problem and models expected outcomes as a low-rank tensor. In the first stage, the platform samples a subset of intervention combinations, uses tensor completion to infer performance on untested combinations, and eliminates weak factor levels using estimated marginal contributions. In the second stage, it applies sequential halving to the surviving combinations to select a final policy. We establish gap-independent simple-regret bounds and gap-dependent identification guarantees showing that the relevant complexity scales with the degrees of freedom of the low-rank tensor and the separation structure across factor levels, rather than the full factorial size. In an offline evaluation based on a product-bundling problem constructed from 100 million Taobao interactions, the proposed method substantially outperforms one-shot tensor completion and unstructured best-arm benchmarks, especially in low-budget and high-noise settings. These results show how centralized, policy-aware experimentation can make combinatorial product design operationally feasible at platform scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper combines tensor completion for pruning weak factor levels with sequential halving on the survivors to identify strong policies in large factorial experiments, claiming regret bounds that scale with low-rank degrees of freedom instead of the full combinatorial size.

read the letter

This paper tackles a practical problem: platforms need to test many overlapping product changes but have limited traffic, so full factorial designs are impossible. The authors centralize the experiments into one big factorial setup, model outcomes as a low-rank tensor, use the first stage to sample some combinations and complete the tensor to drop weak levels via marginal estimates, then run sequential halving on the reduced set in stage two. That integration and the specific bounds tied to tensor rank and separation structure are the main new pieces. The offline Taobao product-bundling evaluation, built from real interaction data, shows clear gains over plain tensor completion or unstructured best-arm methods, especially when budgets are tight and noise is high. That empirical part is useful and grounded in external data rather than synthetic fits. The low-rank assumption is the load-bearing piece, and the stress-test note is fair: if completion error in stage one exceeds the gaps to weak levels, the pruning fails and the second stage still faces something close to the full size, erasing the claimed scaling. The abstract states the bounds are established independently, but without the full derivations visible here it is hard to judge how robust they are to moderate completion noise. The Taobao results are offline, so they do not directly test whether elimination worked because of accurate marginals. This work is aimed at researchers and engineers who run large-scale online experiments on digital platforms. It is worth a serious referee because it addresses a real operational gap with a concrete two-stage procedure and some supporting evidence, even if the low-rank premise and error-gap condition need tighter scrutiny in review.

Referee Report

2 major / 2 minor

Summary. The paper proposes a two-stage policy-aware design for large-scale factorial experiments under fixed budgets. Expected outcomes are modeled as a low-rank tensor; stage 1 samples a subset of combinations, applies tensor completion to infer untested outcomes, and eliminates weak factor levels via estimated marginal contributions; stage 2 runs sequential halving on the surviving reduced set to identify a high-performing policy. It claims gap-independent simple-regret bounds and gap-dependent identification guarantees whose complexity scales with the low-rank tensor degrees of freedom and separation structure rather than the full factorial size. An offline evaluation on a product-bundling task derived from 100 million Taobao interactions shows substantial outperformance versus one-shot tensor completion and unstructured best-arm baselines, especially in low-budget noisy regimes.

Significance. If the central claims hold, the work offers a practical route to combinatorial policy search in platform-scale experimentation by exploiting low-rank structure to shrink the effective search space. The combination of theoretical regret and identification bounds with an offline evaluation on real interaction data is a clear strength; the results suggest centralized, policy-aware designs can make factorial product design feasible where decentralized A/B testing fails.

major comments (2)

[§4] §4 (Theoretical Analysis), gap-dependent identification result: the claimed scaling with low-rank degrees of freedom requires that stage-1 tensor completion error is smaller than the minimum separation gap to weak factor levels with high probability, so that elimination correctly reduces the search space. No explicit condition or bound relating completion error (controlled by sampling fraction and noise variance) to the gap is provided; without it the second-stage complexity can revert to the full factorial size, undermining the central complexity claim.
[§5] §5 (Empirical Evaluation), Taobao offline results: the reported outperformance in low-budget regimes is consistent with the method but does not isolate whether elimination succeeded (e.g., via reported fraction of correctly discarded levels or marginal-estimate error versus true gaps). This leaves open whether gains arise from accurate low-rank marginals or from other algorithmic features, weakening support for the policy-aware reduction mechanism.

minor comments (2)

[§2] Notation for the low-rank tensor and its degrees of freedom is introduced without a compact summary table; adding one would clarify how the complexity term is derived from the assumed rank.
[Abstract] The abstract states that bounds are 'gap-independent' and 'gap-dependent' but does not list the precise assumptions on noise or sampling; a one-sentence qualifier would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which help clarify the conditions for our theoretical claims and strengthen the empirical support for the elimination mechanism. We address each major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis), gap-dependent identification result: the claimed scaling with low-rank degrees of freedom requires that stage-1 tensor completion error is smaller than the minimum separation gap to weak factor levels with high probability, so that elimination correctly reduces the search space. No explicit condition or bound relating completion error (controlled by sampling fraction and noise variance) to the gap is provided; without it the second-stage complexity can revert to the full factorial size, undermining the central complexity claim.

Authors: We agree that an explicit condition is required to ensure the tensor completion error is smaller than the minimum gap with high probability, so that the elimination step reliably reduces the search space and the complexity scales with the low-rank degrees of freedom. The current manuscript presents the gap-dependent bound under the assumption of correct elimination but does not derive the necessary sampling condition. In the revision we will add a supporting lemma in §4 that bounds the completion error in terms of sampling fraction, noise variance, and tensor rank, followed by an explicit assumption on the minimum gap (relative to this error) under which elimination succeeds w.h.p. This will make the claimed scaling rigorous without altering the core results. revision: yes
Referee: [§5] §5 (Empirical Evaluation), Taobao offline results: the reported outperformance in low-budget regimes is consistent with the method but does not isolate whether elimination succeeded (e.g., via reported fraction of correctly discarded levels or marginal-estimate error versus true gaps). This leaves open whether gains arise from accurate low-rank marginals or from other algorithmic features, weakening support for the policy-aware reduction mechanism.

Authors: We agree that additional diagnostics would better isolate the contribution of the stage-1 elimination step. The current evaluation demonstrates overall gains, especially in low-budget noisy regimes, but does not report the fraction of levels correctly discarded or the marginal estimation error versus true gaps. In the revised §5 we will add these metrics computed from the Taobao dataset (using the full interaction data as ground truth), including tables or figures showing elimination accuracy and marginal error as functions of budget. This will provide direct evidence for the policy-aware reduction mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; bounds derived independently from low-rank model and algorithm

full rationale

The paper derives gap-independent simple-regret bounds and gap-dependent identification guarantees from the two-stage procedure (tensor completion followed by sequential halving) under the explicit low-rank tensor assumption on expected outcomes. The claimed scaling with tensor degrees of freedom and separation structure follows directly from standard analysis of the elimination step and halving algorithm, without reducing to fitted parameters or self-referential definitions. The offline Taobao evaluation uses external interaction data for validation rather than recycling the same fitted quantities. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the provided abstract or description; the central claims remain self-contained against the model assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that outcomes admit a low-rank tensor representation; this is a domain assumption rather than a derived result. Tensor rank is likely a free hyperparameter chosen to enable completion. No invented entities are introduced.

free parameters (1)

tensor rank
The low-rank tensor model requires selecting a rank value to enable completion of untested combinations; this choice is not derived from first principles and affects the elimination and selection stages.

axioms (1)

domain assumption Expected outcomes of intervention combinations form a low-rank tensor
Invoked to justify tensor completion for inferring performance on untested combinations and estimating marginal contributions.

pith-pipeline@v0.9.0 · 5566 in / 1425 out tokens · 69569 ms · 2026-05-10T16:39:25.018199+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages

[1]

Empowering Patients Using Smart Mobile Health Platforms: Evidence From a Randomized Field Experiment

isbn: 978-0-393-97995-4. [Gho+22] Anindya Ghose, Xitong Guo, Beibei Li, and Yuanyuan Dang. “Empowering Patients Using Smart Mobile Health Platforms: Evidence From a Randomized Field Experiment”. In: Management Information Systems Quarterly 46.1 (Mar. 2022), pp. 151–192. issn: 0276-7783. [GJ04] Peter Glynn and Sandeep Juneja. “A Large Deviations Perspectiv...

work page doi:10.1086/367748 2022
[2]

large ℓ∞ error

Using Bℓ−1 ∈ Fℓ and the tower property, P ( Fℓ(t)c ∩ Bℓ−1 ) = E [ 1Bℓ−11Fℓ(t)c ] = E [ 1Bℓ−1 E [ 1Fℓ(t)c | Fℓ ]] = E [ 1Bℓ−1 P(Fℓ(t)c | Fℓ) ] . By Assumption EC.1, P(Fℓ(t)c | Fℓ) ≤ Φ ℓ(t) a.s. Therefore, P ( Fℓ(t)c ∩ Bℓ−1 ) ≤ Φ ℓ(t) E[1Bℓ−1] = Φ ℓ(t) P(Bℓ−1). Dividing by P(Bℓ−1) gives P ( Fℓ(t)c | ⋂ i<ℓ Fi(t) ) = P(Fℓ(t)c | Bℓ−1) ≤ Φ ℓ(t). ec8 e-companion...

2023