Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data

Bo Zhang; Oliver Dukes; Pan Zhao; Peter B. Gilbert

arxiv: 2604.13265 · v1 · submitted 2026-04-14 · 📊 stat.ME · stat.AP

Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data

Pan Zhao , Peter B. Gilbert , Oliver Dukes , Bo Zhang This is my paper

Pith reviewed 2026-05-10 14:09 UTC · model grok-4.3

classification 📊 stat.ME stat.AP

keywords cumulative incidence curvesdata fusionimmunobridgingvaccine efficacysurrogate endpointscounterfactual estimationmultiply robust estimatorscause-specific incidence

0 comments

The pith

Data from historical vaccine trials fused with immunobridging studies estimates counterfactual cumulative incidence curves for variant vaccines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops statistical methods to combine participant-level records from an original large efficacy trial with data from smaller immunobridging studies that only measure immune responses. This fusion produces estimates of what disease incidence would have been if an updated vaccine had been tested in the original population. The approach includes efficient and multiply robust estimators and extends the estimation to cause-specific curves when pathogens have multiple serotypes. It is applied to real COVID-19 booster data to produce hypothetical incidence curves and to check whether the immune marker fully accounts for protection.

Core claim

We develop methods of inference for the counterfactual cumulative incidence curve using participant-level data from both a historical vaccine efficacy trial and an immunobridging study. We further extend these methods to pathogens with multiple serotypes by estimating cause-specific cumulative incidence curves. We describe the identification assumptions, propose efficient and multiply robust estimators, and assess their finite-sample performance through simulation studies.

What carries the argument

The efficient and multiply robust estimators that fuse participant-level data from the historical efficacy trial and the immunobridging study under the identification assumptions that link the two sources.

If this is right

The methods yield estimates of hypothetical cumulative incidence for a bivalent mRNA booster using data from the COVAIL trial.
The methods provide a way to test the assumption of no controlled direct effects of the vaccine beyond the surrogate.
The extension produces cause-specific cumulative incidence curves for multi-serotype pathogens such as dengue or influenza.
Simulation studies confirm good finite-sample performance of the proposed estimators under the stated assumptions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could shorten the timeline for updating vaccines against new variants by reducing the need for repeated large efficacy trials.
Similar data-fusion strategies might apply to other medical settings where surrogate endpoints are used to approve regimen changes.
If the no-direct-effects assumption holds across variants, the same historical trial data could support repeated immunobridging updates.

Load-bearing premise

The vaccine affects disease risk only through the measured surrogate immune marker, with no remaining direct effects on the clinical endpoint.

What would settle it

A direct randomized comparison of the updated vaccine versus the original vaccine that produces disease incidence rates different from those predicted by the fused estimators.

Figures

Figures reproduced from arXiv: 2604.13265 by Bo Zhang, Oliver Dukes, Pan Zhao, Peter B. Gilbert.

**Figure 2.** Figure 2: shows the sampling distributions of the proposed estimators for P(T(A = 1′ ) > 5) across 4 × 4 = 16 data-generating processes. The true parameter values are indicated in each panel as a dashed red line [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗

**Figure 3.** Figure 3: Left panel: Boxplots and violin plots of Day 15 (D15) neutralizing antibody titers against Omicron BA.4/BA.5 among Stage-2 Prototype Pfizer-BioNTech vaccinee (blue), Stage-2 Omicron-containing Pfizer-BioNTech vaccinees (orange), and Stage-4 BA.4/5 + Prototype Pfizer-BioNTech vaccinees (purple). Right panel: Cumulative incidence curves for Stage-2 Prototype Pfizer-BioNTech vaccinees (blue) and Stage-2 Omic… view at source ↗

**Figure 4.** Figure 4: Cumulative incidence curves of Stage-2 Prototype Pfizer-BioNTech vaccinees (blue) and Stage-2 Omicron-containing Pfizer-BioNTech vaccinees (orange), along with the counterfactual cumulative incidence curve for Stage-2 Omicron-containing Pfizer-BioNTech vaccinees estimated under the no controlled direct effects assumption (purple). Dashed lines indicate 95% pointwise confidence intervals. 8 Discussion In th… view at source ↗

read the original abstract

Refined vaccine regimens containing variant-matched inserts are often authorized based on historical phase 3 efficacy trials together with immunobridging studies. Phase 3 trials are essential for establishing immune biomarkers that reliably predict disease risk or vaccine efficacy against clinical endpoints. Once such immune correlates are identified, updated vaccine regimens can be approved through immunobridging designs that compare the immunogenicity of the updated regimen to that of an already-approved vaccine. We develop methods of inference for the counterfactual cumulative incidence curve using participant-level data from both a historical vaccine efficacy trial and an immunobridging study. We further extend these methods to pathogens with multiple serotypes -- such as dengue virus and influenza -- by estimating cause-specific cumulative incidence curves. We describe the identification assumptions, propose efficient and multiply robust estimators, and assess their finite-sample performance through simulation studies. We then apply the proposed methods to (1) estimating the hypothetical cumulative incidence curve for a bivalent mRNA booster and (2) testing a key assumption of no controlled direct effects, using data from the COVID-19 Variant Immunologic Landscape (COVAIL) Trial, a multistage randomized clinical study evaluating the safety and immunogenicity of a second COVID-19 booster dose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives multiply robust estimators for fusing historical vaccine trial data with immunobridging studies to estimate counterfactual cumulative incidence curves, with the key no-direct-effect assumption as the main thing to watch.

read the letter

The main contribution is a set of efficient, multiply robust estimators that combine participant-level data from a full historical efficacy trial (where both vaccine and outcome are seen) with an immunobridging study (where only the surrogate marker is observed under new regimens). They also extend the approach to cause-specific curves for pathogens with multiple serotypes. The identification assumptions are stated up front, the estimators are derived, finite-sample behavior is checked in simulations, and the methods are applied to the COVAIL trial data to estimate a hypothetical bivalent booster curve while testing the no-controlled-direct-effect assumption in that setting. This framing is new for the vaccine-update context and builds directly on existing causal and survival tools without obvious circularity or invented entities. The simulations and real-data example give concrete evidence that the approach works when the assumptions hold. The central soft spot is the no-controlled-direct-effect assumption itself. Any residual vaccine effect on incidence that bypasses the surrogate would bias the fused curves, and while the paper tests the assumption in COVAIL, the simulations do not include sensitivity checks for small violations. That makes the results conditional on a strong condition that may or may not transfer to other settings. The methods look technically grounded and the citation pattern is appropriate for the subfield. This is for vaccine statisticians and causal-inference researchers who work on surrogate endpoints and immunobridging designs. A reader in that area would get practical value from the estimators and the COVAIL example. It deserves serious referee time because the problem is timely, the methods are tailored, and they supply both derivation and empirical checks.

Referee Report

2 major / 3 minor

Summary. The paper develops methods for estimating counterfactual cumulative incidence curves by fusing participant-level data from a historical vaccine efficacy trial (where both vaccine assignment and clinical outcomes are observed) with an immunobridging study (where only the surrogate immune marker is observed under new regimens). It extends the framework to pathogens with multiple serotypes via cause-specific cumulative incidence curves, states the required identification assumptions (including no controlled direct effects of the vaccine beyond the surrogate), proposes efficient and multiply robust estimators, evaluates finite-sample performance in simulations, and applies the methods to COVAIL trial data to estimate curves for a bivalent mRNA booster while testing the no controlled direct effects assumption.

Significance. If the identification assumptions hold, the fused estimators would enable more efficient inference for counterfactual vaccine efficacy curves without requiring new large-scale efficacy trials for each updated regimen, which is relevant for regulatory immunobridging. The multiply robust property and the cause-specific extension for multi-serotype pathogens are notable strengths. The simulation studies and the COVAIL application (including assumption testing) provide concrete evidence of applicability, though the central claims rest on the validity of the no controlled direct effects assumption.

major comments (2)

[Identification assumptions and simulation studies] The identification strategy (described in the methods section) relies on the no controlled direct effects assumption to link the historical trial and immunobridging data. While the paper tests this assumption in the COVAIL application, the simulations evaluate performance only under the assumption holding and do not include sensitivity analyses quantifying bias under plausible violations (e.g., unmeasured pathways or serotype-specific effects); this is load-bearing for the validity of all counterfactual curve estimates.
[Estimator derivation] The multiply robust estimators are proposed for the fused data setting. The manuscript should explicitly verify (perhaps via the influence function or asymptotic expansion) whether the multiple robustness property is preserved when the two data sources have different sampling mechanisms and missingness patterns, or whether additional conditions on the nuisance estimators are required.

minor comments (3)

[COVAIL application] In the real-data application, clarify how the error bars or confidence intervals for the estimated counterfactual curves account for the uncertainty from both data sources and the surrogate modeling.
[Extension to multiple serotypes] The notation for cause-specific cumulative incidence functions could be made more consistent across the single-serotype and multi-serotype sections to improve readability.
[Introduction] Add a brief discussion of how the methods compare to existing approaches for surrogate endpoint analysis in vaccine trials (e.g., principal stratification or mediation methods) to better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped clarify key aspects of our work. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Identification assumptions and simulation studies] The identification strategy (described in the methods section) relies on the no controlled direct effects assumption to link the historical trial and immunobridging data. While the paper tests this assumption in the COVAIL application, the simulations evaluate performance only under the assumption holding and do not include sensitivity analyses quantifying bias under plausible violations (e.g., unmeasured pathways or serotype-specific effects); this is load-bearing for the validity of all counterfactual curve estimates.

Authors: We agree that sensitivity analyses under violations of the no controlled direct effects assumption would strengthen the manuscript. The current simulations are designed to evaluate consistency and efficiency when the assumption holds, which is the standard approach for establishing the method's properties under correct identification. In the revision, we will add a dedicated simulation study that introduces controlled direct effects (including serotype-specific pathways) and reports the resulting bias, variance, and coverage of the estimators. This will provide a more complete picture of the assumption's practical importance. revision: yes
Referee: [Estimator derivation] The multiply robust estimators are proposed for the fused data setting. The manuscript should explicitly verify (perhaps via the influence function or asymptotic expansion) whether the multiple robustness property is preserved when the two data sources have different sampling mechanisms and missingness patterns, or whether additional conditions on the nuisance estimators are required.

Authors: The multiple robustness property is preserved under the heterogeneous sampling and missingness patterns because the efficient influence function is derived by treating the two data sources as distinct strata with known sampling probabilities. The outcome regression and propensity score estimators are fitted separately within each source, and cross-fitting ensures the required orthogonality. No further conditions on the nuisance estimators are needed beyond those already stated for consistency. We will add an explicit verification, including a sketch of the asymptotic expansion, to the methods section in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation grounded in explicit identification assumptions and independent simulations

full rationale

The paper states identification assumptions (including no controlled direct effects of vaccine beyond the surrogate), proposes multiply robust estimators for counterfactual cumulative incidence curves via data fusion, evaluates finite-sample performance in separate simulation studies, and applies the methods to COVAIL trial data while testing the key assumption. No equations or steps reduce a claimed prediction or result to a fitted input by construction, and no load-bearing premise collapses to a self-citation chain or ansatz smuggled from prior author work. The central contribution remains independent of its outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard causal identification assumptions for counterfactual estimation and data fusion that are invoked but not enumerated in the abstract; no free parameters or invented entities are described.

axioms (1)

domain assumption Identification assumptions linking historical trial and immunobridging data for counterfactual cumulative incidence
Required to identify the target curves from the fused observed data; mentioned explicitly in the abstract.

pith-pipeline@v0.9.0 · 5528 in / 1322 out tokens · 62764 ms · 2026-05-10T14:09:03.434126+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages

[1]

J., Robertson, S

Issa J Dahabreh, Sarah E Robertson, and Miguel A Hern´ an. Generalizing and transporting in- ferences about the effects of treatment assignment subject to non-adherence.arXiv preprint arXiv:2211.04876,

work page arXiv
[2]

Correlates of protection against symptomatic and asymptomatic SARS-CoV-2 infection.Nature Medicine, 27(11):2032– 2040,

Shuo Feng, Daniel J Phillips, Thomas White, Homesh Sayal, Parvinder K Aley, Sagida Bibi, Christina Dold, Michelle Fuskova, Sarah C Gilbert, Ian Hirsch, et al. Correlates of protection against symptomatic and asymptomatic SARS-CoV-2 infection.Nature Medicine, 27(11):2032– 2040,

work page 2032
[3]

Immune correlates analysis of the ENSEMBLE single Ad26

Youyi Fong, Adrian B McDermott, David Benkeser, Sanne Roels, Daniel J Stieh, An Vandebosch, Mathieu Le Gars, Griet A Van Roey, Christopher R Houchens, Karen Martins, et al. Immune correlates analysis of the ENSEMBLE single Ad26. COV2. S dose vaccine efficacy clinical trial. Nature Microbiology, 7(12):1996–2010,

work page 1996
[4]

A novel decomposition to explain heterogeneity in observational and randomized studies of causality.arXiv preprint arXiv:2208.05543, 2022a

Brian Gilbert, Ivan Dıaz, Kara E Rudolph, and Tat-Thang Vo. A novel decomposition to explain heterogeneity in observational and randomized studies of causality.arXiv preprint arXiv:2208.05543, 2022a. Peter B Gilbert and Ying Huang. Predicting overall vaccine efficacy in a new setting by re- calibrating baseline covariate and intermediate response endpoint...

work page arXiv
[5]

Towards a unified theory for semiparametric data fusion with individual-level data.arXiv preprint arXiv:2409.09973,

Ellen Graham, Marco Carone, and Andrea Rotnitzky. Towards a unified theory for semiparametric data fusion with individual-level data.arXiv preprint arXiv:2409.09973,

work page arXiv
[6]

In- fluenza vaccine effectiveness in the United States during the 2015–2016 season.New England Journal of Medicine, 377(6):534–543,

Michael L Jackson, Jessie R Chung, Lisa A Jackson, C Hallie Phillips, Joyce Benoit, Arnold S Monto, Emily T Martin, Edward A Belongia, Huong Q McLean, Manjusha Gaglani, et al. In- fluenza vaccine effectiveness in the United States during the 2015–2016 season.New England Journal of Medicine, 377(6):534–543,

work page 2015
[7]

Kennedy, Sivaraman Balakrishnan, and Max G’Sell

Edward H. Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal effects.Annals of Statistics, 48(4):2008–2030,

work page 2008
[8]

Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data

38 Supplemental Materials to “Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data” by Pan Zhao, Peter B. Gilbert, Oliver Dukes, and Bo Zhang. A Proofs A.1 Proofs of Proposition 1, 2 and Theorem 4 We prove the first identification result via the “me...

work page 2012
[9]

A.5 Proof of Proposition 5 We first state a useful lemma from Kennedy et al

y(T) = Ψ(P), which completes the proof of multiple robustness. A.5 Proof of Proposition 5 We first state a useful lemma from Kennedy et al. [2020]. Lemma 1.Let ˆf(o)be a function estimated from a sampleO N = (O n+1, . . . , ON), and letP n denote the empirical measure over(O 1, . . . , On), which is independent ofO N. Then (Pn −P) ˆf−f =O P ∥ ˆf−f∥ n1/2 !...

work page 2020
[10]

Summarizing the above results, we have ˆΨ−Ψ = (P n −P){ϕ ∗ a (P)}+o P n−1/2 , which completes the proof

− 1 κ ) {ˆµ(X, a, s)−µ(X, a, s)}f(s|X, A=a,Γ = 1)ds =o P n−1/2 , which imply Z ˆP,P =o P n−1/2 , by Cauchy-Schwarz. Summarizing the above results, we have ˆΨ−Ψ = (P n −P){ϕ ∗ a (P)}+o P n−1/2 , which completes the proof. A.6 Proof of Theorem 2 and 7 Proof.When the outcome is subject to ignorable right censoring, the efficient influence function ϕC∗ a,t ca...

work page 2006
[11]

The extension to censored competing risks data follows straightforwardly when we sety(T) = I{T≤t,∆ =j}, j= 1,

GT (t|X, a, S)− Z s∈S GT (t|X, a, s)f(s|X, A=a,Γ = 1)ds − Γ κ Z s∈S GT (t|X, a, s)f(s|X, A=a,Γ = 1)ds−R(a, t; Γ = 1). The extension to censored competing risks data follows straightforwardly when we sety(T) = I{T≤t,∆ =j}, j= 1, . . . , J[Rytgaard and van der Laan, 2024]. A.7 Proof of Theorem 3 Proof.Denote byf ∗,G T∗ andG C∗ the probability limits of the ...

work page 2024

[1] [1]

J., Robertson, S

Issa J Dahabreh, Sarah E Robertson, and Miguel A Hern´ an. Generalizing and transporting in- ferences about the effects of treatment assignment subject to non-adherence.arXiv preprint arXiv:2211.04876,

work page arXiv

[2] [2]

Correlates of protection against symptomatic and asymptomatic SARS-CoV-2 infection.Nature Medicine, 27(11):2032– 2040,

Shuo Feng, Daniel J Phillips, Thomas White, Homesh Sayal, Parvinder K Aley, Sagida Bibi, Christina Dold, Michelle Fuskova, Sarah C Gilbert, Ian Hirsch, et al. Correlates of protection against symptomatic and asymptomatic SARS-CoV-2 infection.Nature Medicine, 27(11):2032– 2040,

work page 2032

[3] [3]

Immune correlates analysis of the ENSEMBLE single Ad26

Youyi Fong, Adrian B McDermott, David Benkeser, Sanne Roels, Daniel J Stieh, An Vandebosch, Mathieu Le Gars, Griet A Van Roey, Christopher R Houchens, Karen Martins, et al. Immune correlates analysis of the ENSEMBLE single Ad26. COV2. S dose vaccine efficacy clinical trial. Nature Microbiology, 7(12):1996–2010,

work page 1996

[4] [4]

A novel decomposition to explain heterogeneity in observational and randomized studies of causality.arXiv preprint arXiv:2208.05543, 2022a

Brian Gilbert, Ivan Dıaz, Kara E Rudolph, and Tat-Thang Vo. A novel decomposition to explain heterogeneity in observational and randomized studies of causality.arXiv preprint arXiv:2208.05543, 2022a. Peter B Gilbert and Ying Huang. Predicting overall vaccine efficacy in a new setting by re- calibrating baseline covariate and intermediate response endpoint...

work page arXiv

[5] [5]

Towards a unified theory for semiparametric data fusion with individual-level data.arXiv preprint arXiv:2409.09973,

Ellen Graham, Marco Carone, and Andrea Rotnitzky. Towards a unified theory for semiparametric data fusion with individual-level data.arXiv preprint arXiv:2409.09973,

work page arXiv

[6] [6]

In- fluenza vaccine effectiveness in the United States during the 2015–2016 season.New England Journal of Medicine, 377(6):534–543,

Michael L Jackson, Jessie R Chung, Lisa A Jackson, C Hallie Phillips, Joyce Benoit, Arnold S Monto, Emily T Martin, Edward A Belongia, Huong Q McLean, Manjusha Gaglani, et al. In- fluenza vaccine effectiveness in the United States during the 2015–2016 season.New England Journal of Medicine, 377(6):534–543,

work page 2015

[7] [7]

Kennedy, Sivaraman Balakrishnan, and Max G’Sell

Edward H. Kennedy, Sivaraman Balakrishnan, and Max G’Sell. Sharp instruments for classifying compliers and generalizing causal effects.Annals of Statistics, 48(4):2008–2030,

work page 2008

[8] [8]

Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data

38 Supplemental Materials to “Efficient estimation of cumulative incidence curves via data fusion with surrogates: application to integrated analysis of vaccine trial and immunobridging data” by Pan Zhao, Peter B. Gilbert, Oliver Dukes, and Bo Zhang. A Proofs A.1 Proofs of Proposition 1, 2 and Theorem 4 We prove the first identification result via the “me...

work page 2012

[9] [9]

A.5 Proof of Proposition 5 We first state a useful lemma from Kennedy et al

y(T) = Ψ(P), which completes the proof of multiple robustness. A.5 Proof of Proposition 5 We first state a useful lemma from Kennedy et al. [2020]. Lemma 1.Let ˆf(o)be a function estimated from a sampleO N = (O n+1, . . . , ON), and letP n denote the empirical measure over(O 1, . . . , On), which is independent ofO N. Then (Pn −P) ˆf−f =O P ∥ ˆf−f∥ n1/2 !...

work page 2020

[10] [10]

Summarizing the above results, we have ˆΨ−Ψ = (P n −P){ϕ ∗ a (P)}+o P n−1/2 , which completes the proof

− 1 κ ) {ˆµ(X, a, s)−µ(X, a, s)}f(s|X, A=a,Γ = 1)ds =o P n−1/2 , which imply Z ˆP,P =o P n−1/2 , by Cauchy-Schwarz. Summarizing the above results, we have ˆΨ−Ψ = (P n −P){ϕ ∗ a (P)}+o P n−1/2 , which completes the proof. A.6 Proof of Theorem 2 and 7 Proof.When the outcome is subject to ignorable right censoring, the efficient influence function ϕC∗ a,t ca...

work page 2006

[11] [11]

The extension to censored competing risks data follows straightforwardly when we sety(T) = I{T≤t,∆ =j}, j= 1,

GT (t|X, a, S)− Z s∈S GT (t|X, a, s)f(s|X, A=a,Γ = 1)ds − Γ κ Z s∈S GT (t|X, a, s)f(s|X, A=a,Γ = 1)ds−R(a, t; Γ = 1). The extension to censored competing risks data follows straightforwardly when we sety(T) = I{T≤t,∆ =j}, j= 1, . . . , J[Rytgaard and van der Laan, 2024]. A.7 Proof of Theorem 3 Proof.Denote byf ∗,G T∗ andG C∗ the probability limits of the ...

work page 2024