This document demonstrates the implementation of the Empirical Bootstrap methods discussed in Section 2 of the lecture notes. We will focus on estimating the population median \(\theta\) and constructing various confidence intervals.
We will generate synthetic data from a skewed distribution (Chi-squared) to highlight the differences between the interval methods, particularly how they handle asymmetry.
We simulate \(n=50\) observations from a \(\chi^2_3\) distribution. The true median of a \(\chi^2_k\) distribution is approximately \(k(1 - \frac{2}{9k})^3\).
n <- 50
df <- 3
X <- rchisq(n, df = df)
# True population median
true_median <- qchisq(0.5, df = df)
# Our estimator: Sample Median
theta_hat <- median(X)
cat("Sample Size (n):", n, "\n")
## Sample Size (n): 50
cat("Sample Median:", round(theta_hat, 4), "\n")
## Sample Median: 1.7681
cat("True Population Median:", round(true_median, 4), "\n")
## True Population Median: 2.366
As described in Section 2, we generate \(B\) bootstrap samples by sampling with replacement from the original data \(X_1, \dots, X_n\). For each sample, we compute the bootstrap estimator \(\hat{\theta}^*_n\).
B <- 2000 # Number of bootstrap replicates
# Generate B bootstrap samples and compute medians
# This creates the distribution of \hat{\theta}^*
boot_theta <- replicate(B, {
X_star <- sample(X, size = n, replace = TRUE)
median(X_star)
})
# Visualizing the Bootstrap Distribution
hist(boot_theta, breaks = 30, main = "Bootstrap Distribution of Sample Median",
xlab = expression(hat(theta)^"*"), freq = FALSE, col = "lightblue")
abline(v = theta_hat, col = "red", lwd = 2, lty = 2)
legend("topright", legend = c("Original Est."), col = c("red"), lty = 2, lwd = 2)
We can estimate the variance and Mean Squared Error (MSE) of our estimator using the bootstrap replicates.
# Bootstrap Mean
theta_bar_star <- mean(boot_theta)
# Bootstrap Variance
var_boot <- sum((boot_theta - theta_bar_star)^2) / (B - 1)
# Bootstrap MSE
# Note: This estimates MSE of \hat{\theta}_n against the *bootstrap* center,
# acting as a proxy for the true MSE.
mse_boot <- mean((boot_theta - theta_hat)^2)
cat("Bootstrap Variance estimate:", round(var_boot, 5), "\n")
## Bootstrap Variance estimate: 0.22474
cat("Bootstrap MSE estimate:", round(mse_boot, 5), "\n")
## Bootstrap MSE estimate: 0.22836
We will now construct the \(1-\alpha\) confidence intervals (\(\alpha = 0.05\)) using the four methods described in Section 2.1.
alpha <- 0.05
This uses the \(\alpha/2\) and \(1-\alpha/2\) percentiles of the bootstrap distribution directly. It is transformation invariant but assumes the bootstrap distribution maps directly to the sampling distribution.
# Quantiles of the bootstrap distribution
ci_percentile <- quantile(boot_theta, probs = c(alpha/2, 1 - alpha/2))
print(ci_percentile)
## 2.5% 97.5%
## 1.228389 2.872790
This method enforces symmetry around the original estimate \(\hat{\theta}_n\). We compute the deviations \(D^{*(b)} = |\hat{\theta}^{*(b)}_n - \hat{\theta}_n|\) and find the percentile \(s_{1-\alpha/2}\).
# 1. Compute absolute deviations
D_star <- abs(boot_theta - theta_hat)
# 2. Find the (1 - alpha) quantile of these deviations
s_val <- quantile(D_star, probs = 1 - alpha)
# Using the strict notation from lecture notes:
ci_symmetric <- c(theta_hat - s_val, theta_hat + s_val)
print(ci_symmetric)
## 95% 95%
## 0.7497969 2.7863820
This relies on the asymptotic normality of the estimator. We use the standard normal critical value \(z_{1-\alpha/2}\) and the bootstrap estimate of the standard error.
z_score <- qnorm(1 - alpha/2)
se_boot <- sqrt(var_boot) # Derived from standard deviation of boot_theta
ci_normal <- c(theta_hat - z_score * se_boot,
theta_hat + z_score * se_boot)
print(ci_normal)
## [1] 0.8389374 2.6972415
This method generally offers the best coverage accuracy (\(O(n^{-1})\)). However, it requires a standard error estimate \(\hat{\sigma}\) for each bootstrap sample.
Note: For the median, estimating the standard error \(\hat{\sigma}\) requires estimating the density \(f(\hat{\theta})\), which is computationally expensive (requires a double bootstrap or kernel density estimation).
To demonstrate the mechanics of the Studentized interval without the complexity of density estimation, we will demonstrate this method on the Sample Mean for the same dataset, where \(\hat{\sigma} = s/\sqrt{n}\) is easily calculated.
# --- Switching to Mean for Studentized Demonstration ---
mu_hat <- mean(X)
sigma_hat <- sd(X) / sqrt(n) # Standard error of the mean
# Generate Studentized Bootstrap Statistics (T*)
t_star_replicates <- replicate(B, {
# 1. Resample
X_star <- sample(X, size = n, replace = TRUE)
# 2. Compute bootstrap estimators
mu_star <- mean(X_star)
sigma_star <- sd(X_star) / sqrt(n) # SE estimate for this specific bootstrap sample
# 3. Compute t-statistic
(mu_star - mu_hat) / sigma_star
})
# Determine quantiles of T*
# Note: We swap the quantiles in the subtraction formula
eta_lower <- quantile(t_star_replicates, probs = alpha/2)
eta_upper <- quantile(t_star_replicates, probs = 1 - alpha/2)
# Construct Interval
ci_studentized_mean <- c(mu_hat - sigma_hat * eta_upper,
mu_hat - sigma_hat * eta_lower)
cat("Studentized Interval for the MEAN:", ci_studentized_mean, "\n")
## Studentized Interval for the MEAN: 2.05061 3.495127
Visualizing the intervals for the Median computed in sections 3.1 - 3.3.