Transformers as Bayesian In-Context Experimenters: Smoothness-Adaptive Efficient ATE Estimation

David Simchi-Levi; Jiachun Li

arxiv: 2606.31184 · v1 · pith:JQCDHXVGnew · submitted 2026-06-30 · 💻 cs.LG · cs.AI

Transformers as Bayesian In-Context Experimenters: Smoothness-Adaptive Efficient ATE Estimation

Jiachun Li , David Simchi-Levi This is my paper

Pith reviewed 2026-07-01 06:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords transformersin-context learningadaptive experimentsaverage treatment effectsNeyman allocationBayesian updatingmixture of expertsamortized inference

0 comments

The pith

Transformers trained to imitate a Bayesian posterior Neyman teacher learn adaptive treatment allocations that converge to the oracle rule for efficient ATE estimation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transformers can be trained via imitation learning to act as amortized policies for sequential randomized experiments. The training target is a Bayesian teacher that maintains nonparametric beliefs over potential outcomes and assigns posterior Neyman treatment probabilities based on observed history. When the smoothness of outcome functions is unknown, a mixture-of-experts architecture indexes separate experimenters by smoothness class and uses a gating network that concentrates on the appropriate expert. The authors prove that the resulting policy class has bounded complexity, so it can be learned by empirical risk minimization from supervised pretraining on teacher trajectories.

Core claim

Transformers constructively implement the mapping from experimental history to posterior Neyman treatment probabilities through attention-based sufficient statistics and projected gradient descent steps that imitate Bayesian updating for Gaussian-series priors. The resulting amortized policy converges to the oracle covariate-dependent Neyman allocation and supports efficient ATE inference. When smoothness is unknown, the mixture-of-experts transformer with a hierarchical-posterior gate concentrates on near-oracle experts and still delivers the efficiency gains.

What carries the argument

Bayesian in-context experimenter: a transformer policy trained to imitate the Bayesian posterior Neyman teacher by using attention to maintain sufficient statistics and projected gradient descent to perform the Bayesian update.

If this is right

The learned policy converges to the oracle covariate-dependent Neyman rule as experimental data accumulates.
Efficient ATE inference remains valid even when outcome smoothness is unknown.
The mixture-of-experts gate functions as a hierarchical posterior and selects near-oracle experts.
The policy can be obtained via empirical risk minimization on supervised pretraining data generated by the teacher.
Attention mechanisms are sufficient to track the history statistics needed for the imitation task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same imitation approach could be applied to amortize other sequential experimental designs that require online variance estimation.
Deployment would require checking whether the transformer remains stable when the true outcome distributions depart from the Gaussian-series family used in training.
The construction suggests that in-context learning can serve as a general mechanism for amortizing nonparametric Bayesian updating in sequential statistical decisions.
Extensions could test whether the same architecture works for high-dimensional covariates or for designs that optimize criteria other than ATE precision.

Load-bearing premise

Attention-based sufficient statistics plus projected gradient descent inside the transformer can faithfully imitate the nonparametric Bayesian updating step of the teacher for Gaussian-series priors.

What would settle it

A held-out simulation in which the trained transformer produces treatment probabilities that deviate from the teacher's posterior Neyman allocations on new covariate sequences, or in which the resulting ATE estimator shows no variance reduction relative to a fixed equal-allocation design.

read the original abstract

Adaptive experiments for average treatment effects (ATE) require randomized allocations balancing valid inference with statistical efficiency. The oracle design is a covariate-dependent Neyman rule governed by unknown arm-conditional outcome variances. We investigate whether this sequential variance-estimation and allocation process can be amortized via in-context learning. We introduce Bayesian in-context experimenters: transformer policies trained to imitate a Bayesian posterior Neyman teacher. The teacher updates nonparametric beliefs over potential outcomes using experimental history to assign posterior Neyman treatment probabilities. This design converges to the oracle rule, supporting efficient ATE inference. Transformers constructively implement this mapping through attention-based sufficient statistics and projected gradient descent, imitating Bayesian updating for Gaussian-series priors. To address unknown outcome smoothness, we combine smoothness-indexed experimenters using a mixture-of-experts transformer. The gate acts as a hierarchical posterior over smoothness classes, concentrating on near-oracle experts. By bounding the complexity of the transformer class, we prove this amortized policy can be learned via empirical risk minimization using supervised pretraining. Experiments confirm accurate teacher imitation, adaptive allocation, and improved ATE precision over baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper amortizes Bayesian Neyman allocation for adaptive ATE via transformer imitation learning and MoE over smoothness, but the architectural mechanism for faithful posterior imitation rests on a complexity bound rather than a direct convergence argument.

read the letter

The core contribution is a transformer policy trained by imitation to mimic a Bayesian posterior Neyman teacher for covariate-dependent allocation, with a mixture-of-experts gate that selects among smoothness-indexed experts. This framing lets them prove that the policy class is learnable by ERM under supervised pretraining, and the experiments reportedly show tighter ATE estimates than standard baselines when smoothness is unknown.

What stands out is the constructive implementation claim: attention layers maintain sufficient statistics and internal projected gradient steps are said to imitate the nonparametric Bayesian update for Gaussian-series priors. The complexity bound on the transformer class is a concrete piece of theory that supports the amortization story.

The soft spot is exactly the stress-test point. The learnability result bounds the size of the function class but does not establish that the specific attention-plus-PGD mechanism converges to the correct infinite-dimensional posterior or that the resulting allocation error vanishes at a rate that preserves asymptotic efficiency. If that gap is not closed in the full proofs or simulations, the claim that the design converges to the oracle rule and delivers efficient inference remains partly aspirational.

The paper is aimed at researchers who already work on adaptive experimental design and want to explore amortized ML policies for it. A reader who cares about practical automation of Neyman allocation under unknown smoothness will find the setup worth examining, even if the imitation fidelity needs more verification.

I would send it to peer review. The framing and the complexity argument are substantive enough to merit referee time, provided the authors can address the mechanism gap.

Referee Report

3 major / 2 minor

Summary. The paper claims that transformers trained via imitation of a Bayesian posterior Neyman teacher can amortize covariate-dependent Neyman allocation for adaptive ATE estimation. The teacher performs nonparametric Bayesian updating over potential outcomes; the transformer implements this via attention-based sufficient statistics and internal projected gradient descent steps for Gaussian-series priors. Unknown smoothness is handled by a mixture-of-experts transformer whose gate acts as a hierarchical posterior over smoothness classes. A complexity bound on the transformer class establishes that the policy is learnable by ERM under supervised pretraining, and experiments are said to confirm accurate imitation, adaptive allocation, and improved ATE precision.

Significance. If the central claims hold, the work would supply an amortized, smoothness-adaptive policy for efficient adaptive experimentation that converges to the oracle Neyman rule without requiring explicit variance estimation at deployment time. The mixture-of-experts construction for hierarchical posterior concentration over smoothness indices and the explicit complexity bound for ERM learnability would be concrete strengths.

major comments (3)

[Proof of learnability (referenced in abstract)] The learnability proof bounds transformer complexity for ERM but does not establish that attention-based sufficient statistics plus internal PGD steps converge (in allocation error) to the nonparametric Bayesian posterior update for Gaussian-series priors at a rate sufficient to preserve asymptotic efficiency of the ATE estimator when smoothness is unknown.
[Teacher construction (abstract and §3)] It is unclear whether the teacher itself depends on fitted variances or other quantities that the transformer is then trained to reproduce; if so, the imitation setup risks circularity that would undermine the claim of convergence to the oracle rule.
[Mixture-of-experts construction (abstract and §4)] The mixture-of-experts gate is asserted to concentrate on near-oracle experts, but no argument is given that this concentration occurs fast enough to avoid bias in the final ATE estimator under smoothness misspecification.

minor comments (2)

[Abstract] The abstract supplies no equations, proof sketches, dataset descriptions, or quantitative results; these must appear in the main text with explicit section references.
[Notation and priors] Clarify the precise definition of Gaussian-series priors and their relation to the smoothness indices used in the mixture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive comments on our manuscript. We address each major comment point by point below, with clarifications and indications of revisions.

read point-by-point responses

Referee: [Proof of learnability (referenced in abstract)] The learnability proof bounds transformer complexity for ERM but does not establish that attention-based sufficient statistics plus internal PGD steps converge (in allocation error) to the nonparametric Bayesian posterior update for Gaussian-series priors at a rate sufficient to preserve asymptotic efficiency of the ATE estimator when smoothness is unknown.

Authors: The complexity bound in the learnability result establishes that the transformer policy class is learnable by ERM under supervised pretraining. However, we agree that this bound does not include explicit rates showing convergence of the attention-based sufficient statistics and internal PGD steps to the nonparametric Bayesian posterior update in allocation error, nor does it address preservation of asymptotic efficiency for the ATE estimator under unknown smoothness. We will revise the relevant section to explicitly note this limitation and discuss its implications. revision: partial
Referee: [Teacher construction (abstract and §3)] It is unclear whether the teacher itself depends on fitted variances or other quantities that the transformer is then trained to reproduce; if so, the imitation setup risks circularity that would undermine the claim of convergence to the oracle rule.

Authors: The teacher is constructed as an independent nonparametric Bayesian updater that maintains beliefs over potential outcomes and derives posterior Neyman allocations directly from experimental history and the model's posterior variances. It does not depend on any quantities fitted by the transformer. The transformer is trained solely to imitate this fixed teacher via supervised pretraining, so there is no circularity in the setup. We will add a clarifying paragraph in §3 to make this independence explicit. revision: yes
Referee: [Mixture-of-experts construction (abstract and §4)] The mixture-of-experts gate is asserted to concentrate on near-oracle experts, but no argument is given that this concentration occurs fast enough to avoid bias in the final ATE estimator under smoothness misspecification.

Authors: The manuscript describes the MoE gate as implementing a hierarchical posterior over smoothness classes that concentrates on near-oracle experts, but we acknowledge that no formal rate argument is provided to ensure this concentration is sufficiently rapid to preclude bias in the ATE estimator under misspecification. We will revise §4 to include a discussion of this point, along with additional empirical results from the experiments demonstrating robustness. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation relies on explicit imitation design and standard ERM complexity bounds

full rationale

The paper defines the transformer policy as an explicit imitation learner trained on trajectories from a separately specified Bayesian posterior Neyman teacher; the learnability result is a standard uniform convergence bound over a transformer hypothesis class whose complexity is bounded independently of the target posterior. No equation reduces the claimed convergence or efficiency to a fitted parameter or self-citation by construction, and the architectural claim (attention sufficient statistics plus internal PGD) is presented as a constructive mechanism rather than derived from the result itself. The mixture-of-experts gate over smoothness classes is likewise an explicit hierarchical design choice, not a tautological renaming of the oracle allocation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review; ledger entries are inferred from stated claims and are therefore provisional.

free parameters (1)

smoothness indices
The mixture-of-experts design requires a discrete set of smoothness classes whose choice is not derived from first principles in the abstract.

axioms (1)

domain assumption The Bayesian teacher updates nonparametric beliefs over potential outcomes using experimental history to compute posterior Neyman probabilities.
This is the core modeling choice invoked to define the teacher that the transformer imitates.

invented entities (1)

Bayesian in-context experimenter no independent evidence
purpose: Transformer policy that amortizes the sequential variance-estimation and allocation process.
New named construct introduced to describe the trained transformer; no independent evidence outside the paper is supplied.

pith-pipeline@v0.9.1-grok · 5723 in / 1463 out tokens · 30295 ms · 2026-07-01T06:41:46.315592+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Asymptotic Efficiency Bounds for a Class of Experimental Designs

Timothy B Armstrong. Asymptotic efficiency bounds for a class of experimental designs. arXiv preprint arXiv:2205.02726, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023

work page arXiv 2023
[3]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017

2017
[4]

Bayesian adaptive methods for clinical trials

Scott M Berry, Bradley P Carlin, J Jack Lee, and Peter Muller. Bayesian adaptive methods for clinical trials. CRC press, 2010

2010
[5]

Concentration inequalities using the entropy method

St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart. Concentration inequalities using the entropy method. The Annals of Probability, 31 0 (3): 0 1583--1614, 2003

2003
[6]

Bayesian experimental design: A review

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical science, pages 273--304, 1995

1995
[7]

Semiparametric efficient inference in adaptive experiments

Thomas Cook, Alan Mishler, and Aaditya Ramdas. Semiparametric efficient inference in adaptive experiments. In Causal Learning and Reasoning, pages 1033--1064. PMLR, 2024

2024
[8]

Gradu, and C

Jessica Dai, Paula Gradu, and Christopher Harshaw. Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187, 2023

work page arXiv 2023
[9]

Empirical Processes in M-estimation, volume 6

Sara A Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

2000
[10]

On the role of the propensity score in efficient semiparametric estimation of average treatment effects

Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66 0 (2): 0 315--331, 1998

1998
[11]

Martingale limit theory and its application

Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic press, 2014

2014
[12]

The theory of response-adaptive randomization in clinical trials

Feifang Hu and William F Rosenberger. The theory of response-adaptive randomization in clinical trials. John Wiley & Sons, 2006

2006
[13]

In-context algorithm emulation in fixed-weight transformers

Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, and Han Liu. In-context algorithm emulation in fixed-weight transformers. arXiv preprint arXiv:2508.17550, 2025

work page arXiv 2025
[14]

arXiv preprint arXiv:2511.07378 , year=

Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization. arXiv preprint arXiv:2511.07378, 2025

work page arXiv 2025
[15]

Ishihara, J

Masahiro Kato, Takuya Ishihara, Junya Honda, and Yusuke Narita. Efficient adaptive experimental design for average treatment effect estimation. arXiv preprint arXiv:2002.05308, 2020

work page arXiv 2002
[16]

Supervised pretraining can learn in-context reinforcement learning

Jonathan Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, and Emma Brunskill. Supervised pretraining can learn in-context reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 43057--43083, 2023

2023
[17]

Simchi-Levi, and Y

Jiachun Li, David Simchi-Levi, and Yunxiao Zhao. Optimal adaptive experimental design for estimating treatment effect. arXiv preprint arXiv:2410.05552, 2024

work page arXiv 2024
[18]

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining

Licong Lin, Yu Bai, and Song Mei. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. arXiv preprint arXiv:2310.08566, 2023

work page arXiv 2023
[19]

Lectures on convex optimization, volume 137

Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

2018
[20]

Adaptive bayesian procedures using random series priors

Weining Shen and Subhashis Ghosal. Adaptive bayesian procedures using random series priors. Scandinavian Journal of Statistics, 42 0 (4): 0 1194--1213, 2015

2015
[21]

Frequentist coverage of adaptive nonparametric bayesian credible sets

Botond Szab \'o , Aad W Van Der Vaart, and JH Van Zanten. Frequentist coverage of adaptive nonparametric bayesian credible sets. 2015

2015
[22]

Autooed: Automated optimal experimental design platform with data-and time-efficient multi-objective optimization

Yunsheng Tian, Mina Konakovic Lukovic, Michael Foshey, Timothy Erps, Beichen Li, and Wojciech Matusik. Autooed: Automated optimal experimental design platform with data-and time-efficient multi-objective optimization. 2021

2021
[23]

Sequence length independent norm-based generalization bounds for transformers

Jacob Trauger and Ambuj Tewari. Sequence length independent norm-based generalization bounds for transformers. In International Conference on Artificial Intelligence and Statistics, pages 1405--1413. PMLR, 2024

2024
[24]

Freedman's inequality for matrix martingales

Joel Tropp. Freedman's inequality for matrix martingales. 2011

2011
[25]

An introduction to matrix concentration inequalities

Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and trends in machine learning , 8 0 (1-2): 0 1--230, 2015

2015
[26]

Rates of contraction of posterior distributions based on G aussian process priors

Aad van der Vaart and Harry van Zanten. Rates of contraction of posterior distributions based on G aussian process priors. The Annals of Statistics, 36 0 (3): 0 1435--1463, 2008

2008
[27]

Adaptive B ayesian estimation using a G aussian random field with inverse gamma bandwidth

Aad van der Vaart and Harry van Zanten. Adaptive B ayesian estimation using a G aussian random field with inverse gamma bandwidth. The Annals of Statistics, 37 0 (5B): 0 2655--2675, 2009

2009
[28]

Full adaptation to smoothness using randomly truncated series priors with gaussian coefficients and inverse gamma scaling

Jan van Waaij and Harry van Zanten. Full adaptation to smoothness using randomly truncated series priors with gaussian coefficients and inverse gamma scaling. Statistics & Probability Letters, 123: 0 93--99, 2017

2017
[29]

Error bounds for approximations with deep relu networks

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94: 0 103--114, 2017

2017
[30]

Adaptive neyman allocation

Jinglong Zhao. Adaptive neyman allocation. 2023

2023

[1] [1]

Asymptotic Efficiency Bounds for a Class of Experimental Designs

Timothy B Armstrong. Asymptotic efficiency bounds for a class of experimental designs. arXiv preprint arXiv:2205.02726, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Transformers as statisticians: Provable in-context learning with in-context algorithm selection

Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023

work page arXiv 2023

[3] [3]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017

2017

[4] [4]

Bayesian adaptive methods for clinical trials

Scott M Berry, Bradley P Carlin, J Jack Lee, and Peter Muller. Bayesian adaptive methods for clinical trials. CRC press, 2010

2010

[5] [5]

Concentration inequalities using the entropy method

St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart. Concentration inequalities using the entropy method. The Annals of Probability, 31 0 (3): 0 1583--1614, 2003

2003

[6] [6]

Bayesian experimental design: A review

Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical science, pages 273--304, 1995

1995

[7] [7]

Semiparametric efficient inference in adaptive experiments

Thomas Cook, Alan Mishler, and Aaditya Ramdas. Semiparametric efficient inference in adaptive experiments. In Causal Learning and Reasoning, pages 1033--1064. PMLR, 2024

2024

[8] [8]

Gradu, and C

Jessica Dai, Paula Gradu, and Christopher Harshaw. Clip-ogd: An experimental design for adaptive neyman allocation in sequential experiments. arXiv preprint arXiv:2305.17187, 2023

work page arXiv 2023

[9] [9]

Empirical Processes in M-estimation, volume 6

Sara A Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000

2000

[10] [10]

On the role of the propensity score in efficient semiparametric estimation of average treatment effects

Jinyong Hahn. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66 0 (2): 0 315--331, 1998

1998

[11] [11]

Martingale limit theory and its application

Peter Hall and Christopher C Heyde. Martingale limit theory and its application. Academic press, 2014

2014

[12] [12]

The theory of response-adaptive randomization in clinical trials

Feifang Hu and William F Rosenberger. The theory of response-adaptive randomization in clinical trials. John Wiley & Sons, 2006

2006

[13] [13]

In-context algorithm emulation in fixed-weight transformers

Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, and Han Liu. In-context algorithm emulation in fixed-weight transformers. arXiv preprint arXiv:2508.17550, 2025

work page arXiv 2025

[14] [14]

arXiv preprint arXiv:2511.07378 , year=

Yu Huang, Zixin Wen, Aarti Singh, Yuejie Chi, and Yuxin Chen. Transformers provably learn chain-of-thought reasoning with length generalization. arXiv preprint arXiv:2511.07378, 2025

work page arXiv 2025

[15] [15]

Ishihara, J

Masahiro Kato, Takuya Ishihara, Junya Honda, and Yusuke Narita. Efficient adaptive experimental design for average treatment effect estimation. arXiv preprint arXiv:2002.05308, 2020

work page arXiv 2002

[16] [16]

Supervised pretraining can learn in-context reinforcement learning

Jonathan Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, and Emma Brunskill. Supervised pretraining can learn in-context reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 43057--43083, 2023

2023

[17] [17]

Simchi-Levi, and Y

Jiachun Li, David Simchi-Levi, and Yunxiao Zhao. Optimal adaptive experimental design for estimating treatment effect. arXiv preprint arXiv:2410.05552, 2024

work page arXiv 2024

[18] [18]

Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining

Licong Lin, Yu Bai, and Song Mei. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. arXiv preprint arXiv:2310.08566, 2023

work page arXiv 2023

[19] [19]

Lectures on convex optimization, volume 137

Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018

2018

[20] [20]

Adaptive bayesian procedures using random series priors

Weining Shen and Subhashis Ghosal. Adaptive bayesian procedures using random series priors. Scandinavian Journal of Statistics, 42 0 (4): 0 1194--1213, 2015

2015

[21] [21]

Frequentist coverage of adaptive nonparametric bayesian credible sets

Botond Szab \'o , Aad W Van Der Vaart, and JH Van Zanten. Frequentist coverage of adaptive nonparametric bayesian credible sets. 2015

2015

[22] [22]

Autooed: Automated optimal experimental design platform with data-and time-efficient multi-objective optimization

Yunsheng Tian, Mina Konakovic Lukovic, Michael Foshey, Timothy Erps, Beichen Li, and Wojciech Matusik. Autooed: Automated optimal experimental design platform with data-and time-efficient multi-objective optimization. 2021

2021

[23] [23]

Sequence length independent norm-based generalization bounds for transformers

Jacob Trauger and Ambuj Tewari. Sequence length independent norm-based generalization bounds for transformers. In International Conference on Artificial Intelligence and Statistics, pages 1405--1413. PMLR, 2024

2024

[24] [24]

Freedman's inequality for matrix martingales

Joel Tropp. Freedman's inequality for matrix martingales. 2011

2011

[25] [25]

An introduction to matrix concentration inequalities

Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and trends in machine learning , 8 0 (1-2): 0 1--230, 2015

2015

[26] [26]

Rates of contraction of posterior distributions based on G aussian process priors

Aad van der Vaart and Harry van Zanten. Rates of contraction of posterior distributions based on G aussian process priors. The Annals of Statistics, 36 0 (3): 0 1435--1463, 2008

2008

[27] [27]

Adaptive B ayesian estimation using a G aussian random field with inverse gamma bandwidth

Aad van der Vaart and Harry van Zanten. Adaptive B ayesian estimation using a G aussian random field with inverse gamma bandwidth. The Annals of Statistics, 37 0 (5B): 0 2655--2675, 2009

2009

[28] [28]

Full adaptation to smoothness using randomly truncated series priors with gaussian coefficients and inverse gamma scaling

Jan van Waaij and Harry van Zanten. Full adaptation to smoothness using randomly truncated series priors with gaussian coefficients and inverse gamma scaling. Statistics & Probability Letters, 123: 0 93--99, 2017

2017

[29] [29]

Error bounds for approximations with deep relu networks

Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural networks, 94: 0 103--114, 2017

2017

[30] [30]

Adaptive neyman allocation

Jinglong Zhao. Adaptive neyman allocation. 2023

2023