Multiplier bootstrap and multi-stage estimator

Author: Yen-Chi Chen (University of Washington)
Date: 06/12/2026

Short conclusion

We should use the coupled multiplier bootstrap, i.e., the weight WiW_i of each observation stays the same throughout multiple stages. Do NOT use different weights at different stages for the same observation.


Introduction of multiplier bootstrap

The multiplier bootstrap is a popular method for assessing the uncertainty of an estimator from minimizing a (empirical) risk or solving a system of estimating equations.

We consider a simple regression model where our data consists of an univariate outcome YY and a (possibly multivariate) covariate XX. Our goal is to estimate the conditional mean m(x)=E(YX=x)m(x) = \mathbb{E}(Y|X=x).
Our data consists of IID
(X1,Y1),,(Xn,Yn).(X_1,Y_1),\cdots, (X_n,Y_n).

Suppose we place a model on m(x)m(x) that is indexed by a parameter θ\theta, i.e., our model is mθ(x)m_\theta(x). Our goal is then to estimate θ\theta from the data.
Consider the traditional least square estimator:
θ^n=argminθ1ni=1n(Yimθ(Xi))2,\hat \theta_n = {\sf argmin}_\theta \, \frac{1}{n}\sum_{i=1}^n (Y_i - m_\theta(X_i))^2,
where the quantity R^n(θ)=1ni=1n(Yimθ(Xi))2\hat R_n(\theta) = \frac{1}{n}\sum_{i=1}^n (Y_i - m_\theta(X_i))^2 is the empirical risk.
Under very mild conditions, we know that
n(θ^nθ)dN(0,Σ),\sqrt{n}(\hat \theta_n - \theta^*) \overset{d}{\rightarrow} N(0, \Sigma),
where θ\theta^* is the parameter that minimizes the population risk
θ=argminθR(θ),R(θ)=E[(Y1mθ(X1))2].\theta^* = {\sf argmin}_\theta R(\theta),\qquad R(\theta) = \mathbb{E}\left[(Y_1 - m_\theta(X_1))^2\right].

Conventionally, we work out the score and Hessian of R(θ)R(\theta) and derive the asymptotic covariance matrix Σ\Sigma and construct an estimator of it. Note that this estimator often takes a form of Σ^=U^Ω^U^T\hat \Sigma = \hat U \hat \Omega \hat U^T, so it is often known as a sandwich estimator.

The bootstrap procedure bypasses the need of doing the math to work out the estimator of Σ\Sigma. Instead, we use resampling technique to numerically compute an estimator of Σ\Sigma.

In particular, the multiplier bootstrap uses the following procedure to compute an estimator of Σ\Sigma:

Multiplier bootstrap
  1. We generate IID weights W1,,WnExp(1)W_1,\cdots, W_n\sim {\sf Exp}(1)[1]
  2. We compute the weighted least-square estimator:
    θ^n=argminθ1ni=1nWi(Yimθ(Xi))2.\hat \theta_n^* = {\sf argmin}_\theta \, \frac{1}{n}\sum_{i=1}^n W_i (Y_i - m_\theta(X_i))^2.
  3. Repeat the above two steps BB times, leading to
    θ^n(1),,θ^n(B)\hat \theta_n^{*(1)},\cdots, \hat \theta_n^{*(B)}
    estimators.
  4. We estimate the asymptotic covariance matrix using the sample covariance matrix of the BB bootstrap estimate:
    Σ^B=1Bb=1B(θ^n(b)θˉn,B)(θ^n(b)θˉn,B)T,θˉn,B=1Bb=1Bθ^n(b).\hat \Sigma^*_B = \frac{1}{B} \sum_{b=1}^B (\hat \theta_n^{*(b)} - \bar \theta_{n,B})(\hat \theta_n^{*(b)} - \bar \theta_{n,B})^T, \qquad \bar \theta_{n,B} = \frac{1}{B}\sum_{b=1}^B \hat \theta_n^{*(b)}.

A major benefit of the multiplier bootstrap (and other bootstrap variant) is that we do not need to work out the asymptotic covariance matrix and find an estimator to it. If we are able to compute the estimator once, we can easily modify the code with an extra for loop to assess the uncertainty.

Multiplier bootstrap versus empirical bootstrap

The empirical bootstrap samples with replacement of the original data to create a bootstrap sample and then compute the estimator using the bootstrap sample.

Both empirical and multiplier bootstrap are popular procedure in practice and in most cases, if one method works, we expect the other to work.
In terms of validity, empirical bootstrap is a lot easier to understand why it works:

Why empirical bootstrap works:

The bootstrap sample from the empirical bootstrap is IID from the empirical distribution of the original data. Since we know that the empirical distribution converges to the true CDF under very mild conditions, we expect the bootstrap sample behaves like a new sample from the target distribution.[2]

When multiplier bootstrap is better than empirical bootstrap?

The multipler bootstrap, however, becomes a better approach when we have incomplete observations and the same size is small. We will dsicuss more about this later.

Multi-stage estimator

The story becomes more complicated when we have a multi-stage estimator. In particular, we will consider a very simple missing-at-random (MAR) scenario.

Suppose now the response variable YY can be missing. So somtimes we only observe XX and sometimes we observe both (X,Y)(X,Y). To deal with missingness in our probability model, we introduce a binary response variable Ri{0,1}R_i\in\{0,1\} such that Ri=1R_i=1 if YiY_i is observed.

The (MAR) assumption assumes that the missingness of YY is conditionally independent of the actual value of YY given the covariate XX, i.e.,
YRX.Y\perp R |X.

Now we consider a very simple problem: estimating the marginal mean of YY, i.e., we want to estimate μE(Y)\mu \equiv \mathbb{E}(Y).

Apparently, the naive estimator of using the mean value of observed YY is not going to be consistent because this naive estimator
μ^naive=i=1nRiYij=1nRjPE(RY)E(R)E(Y)μ.\hat \mu_{\sf naive} = \frac{\sum_{i=1}^n R_i Y_i}{\sum_{j=1}^n R_j}\overset{P}{\rightarrow} \frac{\mathbb{E}(RY)}{\mathbb{E}(R)}\neq \mathbb{E}(Y)\equiv \mu.

Under the (MAR), it is known that we can use a regression adjustment method for estimating YY utilizing the fact that
μE(Y)=E(RY+(1R)Y)=E[RY+(1R)E(YR,X)]=E[RY+(1R)E(YX)]=E[RY+(1R)m(X)].\begin{align*} \mu&\equiv \mathbb{E}(Y)\\ & = \mathbb{E}(RY + (1-R)Y)\\ & = \mathbb{E}[RY + (1-R)\mathbb{E}(Y|R,X)]\\ & = \mathbb{E}[RY + (1-R)\mathbb{E}(Y|X)]\\ & = \mathbb{E}[RY + (1-R)m(X)]. \end{align*}
Therefore, the regression adjustment (RA) estimator suggests to use
μ^RA=1ni=1nRiYi+(1Ri)m^(Xi),\hat \mu_{\sf RA} = \frac{1}{n}\sum_{i=1}^n R_i Y_i + (1-R_i) \hat m(X_i),
where m^(Xi)\hat m(X_i) is an estimator of the outcome regression model m(x)=E(YX=x)m(x) = \mathbb{E}(Y|X=x).
Note that the regression model m(x)m(x) is called a nuisance parameter.

Interpretation of RA estimator:

The regression adjustment estimator μ^RA\hat \mu_{\sf RA} has a simple interpretation: when YiY_i is observed, we use its observed value; when YiY_i is missing, we use the outcome regression model m^(Xi)\hat m(X_i) to predict its value.

If we use a parametric model like in the previous section, this estimator will be m^(x)=mθ^n(x)\hat m(x) = m_{\hat \theta_n}(x), where
θ^n=argminθ1ni=1nRi(Yimθ(Xi))2.\hat \theta_n = {\sf argmin}_\theta \,\frac{1}{n}\sum_{i=1}^n R_i (Y_i - m_\theta(X_i))^2.
Namely, we estimate the parameter using the complete data.

Caution

One should be careful that this RA estimator works only if our parameter of interest is the marginal mean μE(Y)\mu \equiv\mathbb{E}(Y). If the parameter of interest is other quantity, we have to use a different outcome regression model.

For μ^RA\hat \mu_{\sf RA}, how can we estimate its uncertainty?

Continue: When multiplier bootstrap is better than empirical bootstrap?

When we have incomplete observations and sample size is small, multiplier bootstrap is generally better than the empirical bootstrap. This is because our bootstrap sample may not contain enoough observations to reliably estimate the nuisance (in the above example, the outcome regression model). When our copmlete observations have a small sample size, the empirical bootstrap sample may end up with no complete observations! So we cannot even produce a bootstrap estimate of the nuisance estimator. On the other hand, the multiplier bootstrap still works in this case.

Two versions of multiplier bootstrap

Recall that we have used data twice here:

Now we want to use the multiplier bootstrap. A common question people asked is:

Since we have two estimators, should we use the same or different weights for each observation for the two estimators θ^n\hat\theta_n and μ^RA\hat \mu_{\sf RA}?

Method 1 (correct): coupled multiplier bootstrap (same weight).

The simple answer to this problem is: we use the same weight for the same observation in both stages.
Namely, we only generate one set of IID random weights W1,,WnW_1,\cdots, W_n. Whenever (Xi,Yi,Ri)(X_i, Y_i, R_i) is used (in any stage), we always use the same weight WiW_i.

Why do we have to use the same weight? This is because the bootstrap attempts to recover the uncertainty of the total estimator. In the original estimator, the same individual is used in both stages. This creates a dependency between the nuisance estimator and the regression adjustment estimator. We want the bootstrap to recover such dependency.

Method 2 (wrong): different weights for different stages.

If we use different weights, say Wi,1W_{i,1} when (Xi,Yi,Ri)(X_i, Y_i, R_i) is used for estimating θ^n\hat \theta_n, and Wi,2W_{i,2} when (Xi,Yi,Ri)(X_i, Y_i, R_i) is used for θ^RA\hat \theta_{\sf RA}, this introduces independence between the two stages, so it underestimate the uncertainty in θ^RA\hat \theta_{\sf RA}.

Here is a simple simulation illustrating this problem; the code for this experiment is at the end of this note:

The black bar indicates the true errors. You can clearly see that the coupled multiplier bootstrap--we use the same weights for the two stages--recovers the true uncertainty while the independent multiplier bootstrap fails to recover it.

While here we consider the regression adjustment method, the same result occurs when we are using the inverse probability weighting estimator. The correct way of performing a multiplier bootstrap is to use the same weight in both stages.

Doubly-robust/one-step estimator

One may be wondering if the same biased result occurs for doubly-robust (DR) estimator. It turns out that if we are using DR estimators, using the different weights actually does not lead to a bias uncertainty estimate. However, using the correct procedure, the same weights across different stages, has a better finite sample performance.

Formally, the DR estimator is
μ^DR=1ni=1nYiRiπβ^(Xi)+mθ^n(Xi)(1Riπβ^n(Xi)),\hat \mu_{\sf DR} = \frac{1}{n}\sum_{i=1}^n \frac{Y_i R_i}{\pi_{\hat \beta}(X_i)} + m_{\hat \theta_n}(X_i) \left(1- \frac{R_i}{\pi_{\hat \beta_n}(X_i)}\right),
where mθ(x)m_\theta(x) is the regression model on m(X)=E(YX=x,R=1)m(X) = \mathbb{E}(Y|X=x, R=1) and πβ(x)\pi_\beta(x) is the propensity score model on π(x)=P(R=1X=x)\pi(x) = P(R=1|X=x). A common model for the propensity score is the logistic regresion: πβ(x)=eβTx1+eβTx\pi_\beta(x) = \frac{e^{\beta^Tx}}{ 1+e^{\beta^Tx}} and we estimate β\beta by a logistic regression of RXR\sim X.

In this case, the estimator is a three-stage estimator since we use data three times:

We consider again two variants of the multiplier bootstrap:

The following shows the result of using these two methods, assuming that both models are correct.

Result

Now both methods are asymptotically valid when both models are correct while the coupled multiplier bootstrap is preferred since it approximates the actual uncertainty better.

Why?

The reason why both methods are valid is due to the fact that when both outcome regression and propensity score models are correct, the errors from these two nuisance models is negligible compared to the errors due to the final averaging process. Thus, while the independent multiplier bootstrap ignores the actual dependency of estimating the nuisance, such dependency contributes little to the overall uncertainty. So an inconsistent estimate of these uncertainties is fine.

However, despite the fact that both methods approximate the actual uncertainty, we still see that method 1, the coupled multipler bootstrap, is preferred since it better recovers the entire uncertainty, so it has a superior finite sample performance.

Appendix: R scripts

These R scripts are written by Gemini 3.1 pro.

For the regression adjustment estimator

# -------------------------------------------------------------------------
# 1. Data Generation
# -------------------------------------------------------------------------
set.seed(42)

n <- 1000
X <- runif(n, 0, 1)
E <- rnorm(n, 0, 1)

# True outcome
Y <- 3 + 4 * X + E

# Missingness mechanism (MAR)
# Note: Prompt stated P(M=1|Y) but formula only depends on X. 
# We implement the formula as given, which correctly satisfies MAR.
prob_missing <- exp(2 * X) / (1 + exp(2 * X))

# Let M = 1 denote missing, M = 0 denote observed
M <- rbinom(n, 1, prob_missing)

# Create the observed Y vector (NA where M == 1)
Y_obs <- Y
Y_obs[M == 1] <- NA

data <- data.frame(X, Y_obs, M)

# -------------------------------------------------------------------------
# 2. Point Estimation (Original Data)
# -------------------------------------------------------------------------

# [Outcome regression]
fit_or <- lm(Y_obs ~ X, data = data)

# [Propensity score] (Included per prompt, though not strictly needed for g-comp)
fit_ps <- glm(M ~ X, data = data, family = binomial(link = "logit"))

# [g-compute] Estimator
# Impute missing Y using the outcome regression
Y_imputed <- data$Y_obs
missing_idx <- which(data$M == 1)
Y_imputed[missing_idx] <- predict(fit_or, newdata = data[missing_idx, ])

# Point estimate for the mean of Y
mu_hat <- mean(Y_imputed)
cat("Point Estimate (mu_hat):", round(mu_hat, 4), "\n\n")

# -------------------------------------------------------------------------
# 3. Multiplier Bootstrap Setup
# -------------------------------------------------------------------------
B <- 500 # Number of bootstrap replicates

# Initialize vectors to store bootstrap estimates
boot_coupled <- numeric(B)
boot_indep <- numeric(B)

for (b in 1:B) {
  
  # Generate weights ~ Exp(1)
  W_coupled <- rexp(n, rate = 1)
  
  # For the independent variant, we need a second set of weights
  W_indep_OR <- rexp(n, rate = 1)
  W_indep_gcomp <- rexp(n, rate = 1)
  
  # =========================================================
  # Variant 1: Coupled Multiplier Bootstrap
  # =========================================================
  # 1. Weighted Outcome Regression
  # (lm automatically drops NAs, so we just pass the full data and weights)
  fit_or_coupled <- lm(Y_obs ~ X, data = data, weights = W_coupled)
  
  # 2. Impute missing values
  Y_imp_coupled <- data$Y_obs
  Y_imp_coupled[missing_idx] <- predict(fit_or_coupled, newdata = data[missing_idx, ])
  
  # 3. Weighted g-computation
  boot_coupled[b] <- sum(W_coupled * Y_imp_coupled) / sum(W_coupled)
  
  # =========================================================
  # Variant 2: Independent Multiplier Bootstrap
  # =========================================================
  # 1. Weighted Outcome Regression (using W1)
  fit_or_indep <- lm(Y_obs ~ X, data = data, weights = W_indep_OR)
  
  # 2. Impute missing values
  Y_imp_indep <- data$Y_obs
  Y_imp_indep[missing_idx] <- predict(fit_or_indep, newdata = data[missing_idx, ])
  
  # 3. Weighted g-computation (using W2)
  boot_indep[b] <- sum(W_indep_gcomp * Y_imp_indep) / sum(W_indep_gcomp)
  
}

# -------------------------------------------------------------------------
# 4. The "Gold Standard" Monte Carlo Truth
# -------------------------------------------------------------------------
# We simulate the true sampling variance by generating 1000 entirely new datasets

M_sims <- 1000
true_sampling_distribution <- numeric(M_sims)

for (m in 1:M_sims) {
  # Re-run the exact data generating process
  X_sim <- runif(n, 0, 1)
  E_sim <- rnorm(n, 0, 1)
  Y_sim <- 3 + 4 * X_sim + E_sim
  prob_missing_sim <- exp(2 * X_sim) / (1 + exp(2 * X_sim))
  M_sim <- rbinom(n, 1, prob_missing_sim)
  
  Y_obs_sim <- Y_sim
  Y_obs_sim[M_sim == 1] <- NA
  data_sim <- data.frame(X = X_sim, Y_obs = Y_obs_sim, M = M_sim)
  
  # Point Estimate
  fit_or_sim <- lm(Y_obs ~ X, data = data_sim)
  Y_imputed_sim <- data_sim$Y_obs
  missing_idx_sim <- which(data_sim$M == 1)
  Y_imputed_sim[missing_idx_sim] <- predict(fit_or_sim, newdata = data_sim[missing_idx_sim, ])
  
  true_sampling_distribution[m] <- mean(Y_imputed_sim)
}

true_se <- sd(true_sampling_distribution)

cat("\n--- The Moment of Truth ---\n")
cat("True Monte Carlo SE (Gold Standard):", round(true_se, 4), "\n")
cat("Coupled Multiplier Bootstrap SE:    ", round(se_coupled, 4), " (Accurate)\n")
cat("Independent Multiplier Bootstrap SE:", round(se_indep, 4), " (Underestimated)\n")

For the doubly-robust estimator

# -------------------------------------------------------------------------
# 1. Data Generation
# -------------------------------------------------------------------------
set.seed(42)

n <- 1000
X <- runif(n, 0, 1)
E <- rnorm(n, 0, 1)

Y <- 3 + 4 * X + E
prob_missing <- exp(2 * X) / (1 + exp(2 * X))

M <- rbinom(n, 1, prob_missing)
R <- 1 - M # R = 1 means Observed, R = 0 means Missing

Y_obs <- Y
Y_obs[R == 0] <- NA

data <- data.frame(X, Y_obs, M, R)

# -------------------------------------------------------------------------
# 2. Point Estimation (Doubly Robust)
# -------------------------------------------------------------------------

# [Outcome regression] m_hat(X)
fit_or <- lm(Y_obs ~ X, data = data)
m_hat <- predict(fit_or, newdata = data)

# [Propensity score] pi_hat(X) = P(R=1 | X)
fit_ps <- glm(R ~ X, data = data, family = binomial(link = "logit"))
pi_hat <- predict(fit_ps, type = "response", newdata = data)

# [Doubly Robust Estimator]
# Note: When R=0, Y_obs is NA. We use ifelse to safely evaluate 0 * NA as 0.
dr_terms <- m_hat + ifelse(data$R == 1, (1 / pi_hat) * (data$Y_obs - m_hat), 0)
mu_hat_dr <- mean(dr_terms)

cat("DR Point Estimate (mu_hat):", round(mu_hat_dr, 4), "\n\n")

# -------------------------------------------------------------------------
# 3. Multiplier Bootstrap Setup
# -------------------------------------------------------------------------
B <- 500

boot_coupled_dr <- numeric(B)
boot_indep_dr <- numeric(B)

for (b in 1:B) {
  
  # Generate weights ~ Exp(1)
  W_coupled <- rexp(n, rate = 1)
  
  # For independent, we draw separate weights for OR, PS, and Final Mean
  W_indep_OR <- rexp(n, rate = 1)
  W_indep_PS <- rexp(n, rate = 1)
  W_indep_FINAL <- rexp(n, rate = 1)
  
  # =========================================================
  # Variant 1: Coupled Multiplier Bootstrap (DR)
  # =========================================================
  # Fit models using the SAME weights
  fit_or_coupled <- lm(Y_obs ~ X, data = data, weights = W_coupled)
  # quasibinomial suppresses warnings for non-integer weights
  fit_ps_coupled <- glm(R ~ X, data = data, weights = W_coupled, family = quasibinomial) 
  
  m_hat_coupled <- predict(fit_or_coupled, newdata = data)
  pi_hat_coupled <- predict(fit_ps_coupled, type = "response", newdata = data)
  
  dr_terms_coupled <- m_hat_coupled + ifelse(data$R == 1, (1 / pi_hat_coupled) * (data$Y_obs - m_hat_coupled), 0)
  
  # Weighted mean
  boot_coupled_dr[b] <- sum(W_coupled * dr_terms_coupled) / sum(W_coupled)
  
  # =========================================================
  # Variant 2: Independent Multiplier Bootstrap (DR)
  # =========================================================
  # Fit models using INDEPENDENT weights
  fit_or_indep <- lm(Y_obs ~ X, data = data, weights = W_indep_OR)
  fit_ps_indep <- glm(R ~ X, data = data, weights = W_indep_PS, family = quasibinomial)
  
  m_hat_indep <- predict(fit_or_indep, newdata = data)
  pi_hat_indep <- predict(fit_ps_indep, type = "response", newdata = data)
  
  dr_terms_indep <- m_hat_indep + ifelse(data$R == 1, (1 / pi_hat_indep) * (data$Y_obs - m_hat_indep), 0)
  
  # Weighted mean using the third independent weight
  boot_indep_dr[b] <- sum(W_indep_FINAL * dr_terms_indep) / sum(W_indep_FINAL)
}

# -------------------------------------------------------------------------
# 4. The "Gold Standard" Monte Carlo Truth
# -------------------------------------------------------------------------
M_sims <- 1000
true_sampling_distribution_dr <- numeric(M_sims)

for (m in 1:M_sims) {
  X_sim <- runif(n, 0, 1)
  E_sim <- rnorm(n, 0, 1)
  Y_sim <- 3 + 4 * X_sim + E_sim
  prob_missing_sim <- exp(2 * X_sim) / (1 + exp(2 * X_sim))
  M_sim <- rbinom(n, 1, prob_missing_sim)
  R_sim <- 1 - M_sim
  
  Y_obs_sim <- Y_sim
  Y_obs_sim[R_sim == 0] <- NA
  data_sim <- data.frame(X = X_sim, Y_obs = Y_obs_sim, R = R_sim)
  
  # Fit on simulated data
  fit_or_sim <- lm(Y_obs ~ X, data = data_sim)
  fit_ps_sim <- glm(R ~ X, data = data_sim, family = binomial)
  
  m_hat_sim <- predict(fit_or_sim, newdata = data_sim)
  pi_hat_sim <- predict(fit_ps_sim, type = "response", newdata = data_sim)
  
  dr_terms_sim <- m_hat_sim + ifelse(data_sim$R == 1, (1 / pi_hat_sim) * (data_sim$Y_obs - m_hat_sim), 0)
  true_sampling_distribution_dr[m] <- mean(dr_terms_sim)
}

# -------------------------------------------------------------------------
# 5. Results
# -------------------------------------------------------------------------
se_coupled_dr <- sd(boot_coupled_dr)
se_indep_dr <- sd(boot_indep_dr)
true_se_dr <- sd(true_sampling_distribution_dr)

cat("--- DR Estimator Standard Errors ---\n")
cat("True Monte Carlo SE:                ", round(true_se_dr, 4), "\n")
cat("Coupled Multiplier Bootstrap SE:    ", round(se_coupled_dr, 4), "\n")
cat("Independent Multiplier Bootstrap SE:", round(se_indep_dr, 4), "\n")

  1. The exponetial random variable can be replaced by other random variable with mean 11 and variance 11. ↩︎

  2. A caveat here is: we need the parameter of interest to be a smooth quantity in the sense that changing the underlying population distribution will smoothly changes the parameter of interest; this is formally known as the Hadamard differentiability. ↩︎

  3. This is due to a property call Donsker class. If we use a nonparametric estimator for m^\hat m, we generally need to avoid using data twice. ↩︎