UW STAT 581: Advanced Theory of Statistical Inference I (2026 Autumn) Chapter 1 Elementary Decision Theory

Decision rule. We consider a stochastic decision in the form of the conditional probability distribution $D(a|x)$. We call $D(a|x)$ a decision rule and denote $\mathcal{D}$ to be the collection of possible decision rules. When $\mathcal{A}$ is finite, we denote $\rho(a|x)$ to be the conditional probability of taking action $a$ when we observe $X=x$. When $\mathcal{A}$ is continuous, we denote $\rho(a |x)$ to be the conditional probability density of taking action $a$ when we observe $X=x$.

The decision rule includes non-random decision rules that we assign a probability of $1$ to a specific value $\delta(x) \in\mathcal{A}$ when $X=x$. For such decision rule, the underlying probability measure is a point mass at $\delta(x)$, so we will just denote such rule as $\delta(x)$.

Statistical decision theory tries to find the best decision rule inside $\mathcal{D}$. To evaluate the performance of a rule $D$, we consider the loss function \[L:\Theta\times \mathcal{A} \rightarrow \mathbb{R}.\] In the above decision model, we have two layers of randomness, the sampling randomness in $X$ from $p_\theta$, and the randomness in the decision rule $D(a|x)$. So the loss will be a random quantity, which is not ideal for evaluating the performance of a rule.

To resolve this problem, we consider the risk function, which is the expected loss: \[R(\theta, D) = \int_\mathcal{X}\int_{\mathcal{A}} L(\theta, a) D(da|x) p_\theta(x)dx = \begin{cases} \int \sum_{a\in\mathcal{A}} L(\theta, a) \rho(a|x) p_\theta(x)dx,\qquad &\mbox{if $\mathcal{A}$ is finite}\\ \int L(\theta, a) \rho(a|x)da \,p_\theta(x)dx,\qquad &\mbox{if $\mathcal{A}$ is continuous}. \end{cases}\]

With the risk function, we can then compare decision rules and discuss the optimal decision rule.

Example 1 (Point estimate). The conventional point estimation can be viewed as a decision problem. Suppose we observe $X_1,\cdots, X_n\sim N(\mu, 1)$ and our goal is to estimate the unknown parameter $\mu \in \mathbb{R}$. In this case, the action space $\mathcal{A} = \mathbb{R}$ is our estimate of $\mu$. A common loss function for this problem is a squared loss: \[L(\mu, a) = ( \mu-a)^2.\] The sample mean is a non-random decision rule that $\delta(x) = \bar X_n=\frac{1}{n}\sum_{i=1}^n X_i$. Namely, $D(a|x)$ is a point mass at $a = \frac{1}{n}\sum_{i=1}^n x_i$.

For a given decision rule $D$, its risk depends on the underlying parameter $\theta$, which is somewhat not ideal since the parameter $\theta$ is generally not observed. To remove the effect of $\theta$ in the risk function, we consider two frameworks:

Admissibility. A decision rule $D$ is called inadmissible if there exists another decision rule $D'$ such that $R(\theta, D') \leq R(\theta, D)$ for some $\theta$ and there is at least one $\theta' \in\Theta$ such that $R(\theta', D') < R(\theta', D)$. If a decision rule is not inadmissible, it is admissible.

Clearly, a decision rule $D$ we pick should be admissible, otherwise why not just choose the other rule $D'$ that making $D$ inadmissible? However, later we shall see that some conventional estimator, such as the sample mean, may not be admissible!

Example 3 (Bus waiting problem). Now we examine a simple example on bus waiting problem. Suppose I live near Seattle Children’s hospital. One morning I wake up very late, and the class will start in $7$ minutes. Fortunately, there is a bus stop just outside my home, so I have two possible ways to school: taking the bus or walking to school. Walking to school takes a constant $9$ minutes so I will always be late for $2$ minutes. Taking the buss may be faster but there is a huge uncertainty here. To simplify the problem, we assume that there are only four possible scenarios: no traffic (bus takes $3$ mins), light traffic (bus takes $6$ mins), regular traffic (bus takes $9$ mins), and heavy traffic (but takes $12$ mins). When I go outside, I can see the traffic and then make the decision on how to commute to school. My goal is to minimize the amount of late time.

In this case, the action space has two possible elements, so we denote it as a binary number: \[A = \begin{cases} 0,\qquad\mbox{if we take the bus},\\ 1,\qquad\mbox{if we walk}. \end{cases}\] The parameter space is the traffic scenario, which we denote it as \[\theta = \begin{cases} 3,\qquad\mbox{if no traffic},\\ 6,\qquad\mbox{if light traffic},\\ 9,\qquad\mbox{if regular traffic},\\ 12,\qquad\mbox{if heavy traffic}. \end{cases}\] And the loss function is the following table:

$L(\theta, a)$	$\theta = 3$	$\theta = 6$	$\theta = 9$	$\theta = 12$
$A=0$	0	0	2	5
$A=1$	2	2	2	2

Suppose my perception of the traffic outside of my home is a binary random variable $X$ such that \[X = \begin{cases} 0,\qquad\mbox{if no so many traffic},\\ 1,\qquad\mbox{if I see many traffic}. \end{cases}\] And for simplicity, we use a simple Bernoulli to model the distribution of $X$ given parameter $\theta$: \[X\sim {\sf Ber}(\theta/15).\]

Under this setup, now we investigate how to find a good decision rule.

For simplicity, we consider a non-random decision rule $D_\delta(x)$ that place a point mass at $\delta(x)\in\{0,1\}.$ For many problems, analyzing non-random decision rule is useful since a stochastic decision rule can be viewed as a randomized non-random decision rule.

One thing you may immediately notice is: the loss function is in fact independent of the data–it is purely a characteristic of the action and the parameter. However, it will be a random quantity if we have choose a decision rule $\delta(x)$. For a given $\theta$ and a decision rule $D_\delta(x)$, the risk is \[R(\theta, D_\delta) = \sum_{x=0,1} L(\theta, \delta(x)) P_\theta(x) =L(\theta, \delta(1)) \left(\frac{\theta}{15}\right) + L(\theta, \delta(0)) \left(1- \frac{\theta}{15}\right) .\] A straightforward calculation shows that \[\begin{align*} R(3,D_\delta) &= L(3,\delta(1))\times0.2 + L(3,\delta(0))\times0.8\\ R(6,D_\delta) &= L(6,\delta(1))\times0.4 + L(6,\delta(0))\times0.6\\ R(9,D_\delta) &= L(9,\delta(1))\times0.6 + L(9,\delta(0))\times0.4\\ R(12,D_\delta) &= L(12,\delta(1))\times0.8+ L(12,\delta(0))\times0.2. \end{align*}\]

Now we run into a problem: the risk value depends on $\theta$! Clearly, when $\theta = 3,6,9$, the optimal strategy is to choose $\delta(x) = 0$ regardless of $x=0$ or $x=1$. On the other hand, when $\delta=12$, the optimal strategy is to choose $\delta(x) = 1$, again regardless of $x=0$ or $x=1$.

The problem is: we do NOT know $\theta$, so we need a way to remove its effect and design a good rule $\delta(x)$ .

The Bayesian framework and the minimax framework offer solution to this problem. In the Bayesian framework, we impose a prior on $\theta$, so we can average these risks to obtain a good decision rule (known as the Bayes rule). In the minimax framework, we consider the worst case $\theta$ and try to find a rule to minimize this worst case scenario.

You can show that the minimax rule is $\delta_{MM}(x) = 1$ (we always walk to school) while the Bayes rule depends on the prior distribution. Suppose we consider a uniform prior, i.e., $P(\theta = 3)=P(\theta = 6)=P(\theta = 9)=P(\theta = 12)=\frac{1}{4}$, our Bayes rule is \[\delta_\pi(x) = \begin{cases} 0,\qquad\mbox{if $X=0$},\\ 1,\qquad\mbox{if $X=1$}. \end{cases}\] Namely, if we do not see traffic, we take the bus. If we see traffics, we walk to school.

You can easily show that both the minimax and the above Bayes rule are admissible. An inadmissible rule is: \[\delta_{\sf bad}(x) = \begin{cases} 1,\qquad\mbox{if $X=0$},\\ 0,\qquad\mbox{if $X=1$}. \end{cases}\] This rule $\delta_{\sf bad}(x)$ is dominated by the Bayes rule, i.e., $R(\theta, \delta_{\pi}(x))\leq R(\theta, \delta_{\sf bad}(x))$ for all $\theta$ and when $\theta=3,6,12$, we have $R(\theta, \delta_{\pi}(x))< R(\theta, \delta_{\sf bad}(x))$.

(Neyman-Pearson) Hypothesis testing

The hypothesis testing can be viewed as a decision problem. However, it is a constrained decision problem that is not in the conventional Bayesian or minimax framework.

In the hypothesis testing problem, we partition the parameter space into two subsets: $\Theta = \Theta_0 \cup \Theta_1$ and $\Theta_0 \cap \Theta_1 = \emptyset$, where $\Theta_0$ is the scenario where the null hypothesis is correct. Our action $A\in\{0,1\}$ is the decision to reject the null hypothesis or not. $A=1$ when we reject the null. The data is the random variable $X$ from a model $p_\theta$, so our decision rule of the test can be viewed as a binary random variable $D(X) \in \{0,1\}$ whose distribution depends on the value of $X$. Note that $D(X)$ can be non-random.

To associate the type-1 and type-2 errors to the loss function, we consider the following losses: \[L(\theta, a) = I(a = 1) I(\theta\in \Theta_0) + \ell_1 I(a = 0) I(\theta\in \Theta_1),\] where $\ell_1 >0$ is a constant. Namely, if we falsely reject null while it is true, the loss is $1$. If we fail to reject the null while it is false, the loss is $\ell_1$.

For a given decision rule, $D$, the type-1 error is \[\sup_{\theta \in \Theta_0}P(\mbox{reject } H_0; \theta) = \sup_{\theta \in \Theta_0} P(D(X)=1; \theta) = \sup_{\theta \in \Theta_0} \mathbb{E}\left[L(\theta, D)\right].\] The type-2 error is \[P(\mbox{fail to reject }H_0; \theta) = P(D(X)=0; \theta) =\mathbb{E}\left[L(\theta, D)\right]\] when $\theta \in\Theta_1$.

The Neymann-Pearson test framework tries to find the optimal decision rule such that \[\min_{D\in\mathcal{D}} \mathbb{E}\left[L(\theta, D)\right]\] for $\theta \in \Theta_1$ subject to the constraint that \[\sup_{\theta \in \Theta_0} \mathbb{E}\left[L(\theta, D)\right]\leq \alpha,\] i.e., the type-1 error is controlled at $\alpha$.

Point estimation: non-random decision rule

In point estimation problem, i.e., our goal is to estimate $\theta$, the action space reduces to the parameter space (our action is our estimator). In this case, it is very common that the loss function is convex in $a\in\mathcal{A}$ and $\mathcal{A}$ is a convex set. For instance, the square loss $L(\theta, a) = (\theta - a)^2$ or absolute loss $L(\theta, a) = |\theta - a|$ are both convex loss function.

Let $A(x)$ be the outcome of the random decision rule $D(\cdot |x)$. Clearly, $\delta(x) = \mathbb{E}(A(x))$. You can easily show that due to the Jensen’s inequality, the risk is \[\begin{equation*} R(\theta, D) = \int \mathbb{E}[L(\theta, A(x))]p(x)dx \geq \int L(\theta, \mathbb{E}(A(x))) \equiv \int L(\theta, \delta(x)) p(x)dx = R(\theta, \delta). \label{eq::point} \end{equation*}\] So the average action always has a lower risk for any $\theta$, which completes the proof.

This lays the foundation for why in point estimation, we are mostly considering non-random decision rule as our estimator. This echoes the idea of Rao-Blackwell’s theorem that once we know the sufficient statistics, including other information (statistics) will not improve the estimator and could even worsen the accuracy!

Therefore, in most of this note, we will focus on non-random decision rule. When considering a non-random decision rule, we will just use $\delta(x)$ in place of $D_\delta$ for abbreviation.

Bayesian inference

When using the Bayesian framework, we consider the Bayes risk given a prior distribution and the model generates our data $p_\theta(x)$ becomes a conditional distribution $p(x|\theta)$. For simplicity, we assume that the prior distribution $\Pi(\theta)$ has a density function $\pi(\theta)$ and write the expectation \[\mathbb{E}(\phi(\theta)) = \int \phi(\theta) \Pi(d\theta) = \int \phi(\theta) \pi(\theta) d\theta.\]

Now consider a non-random decision rule $D_\delta(x)$ that place a point mass at $\delta(x)$, the Bayes risk can be written as \[\begin{align*} R_\pi(D_\delta) &= \int_\Theta R(\theta, D_\delta) \pi(\theta)d\theta \\ &= \int_\Theta\int_\mathcal{X} L(\theta, \delta(x)) p(x|\theta)dx \pi(\theta) d\theta\\ & =\int_\mathcal{X} \left[\int_\Theta L(\theta, \delta(x)) \pi(\theta|x)d\theta\right] p(x)dx, \end{align*}\] where $p(x) = \int p(x|\theta) \pi(\theta) d\theta$ is the marginal probability of $X$. The above form is useful because the quantity inside the bracket, \[Q(a|x) = \int_\Theta L(\theta, a) \pi(\theta|x)d\theta\] can often be optimized separately for each $x$, making it easy to obtain the Bayes rule.

A key quantity in the above analysis is the conditional distribution $\pi(\theta|x),$ which is known as the posterior distribution and is a critical concept in the Bayesian inference. So here we will briefly introduce the Bayesian inference.

Let $X$ be the random variable representing our data and $p_\theta(x)$ be the model that generates $X$. In Bayesian inference, the parameter $\theta$ is viewed as a random variable from a prior distribution $\pi$. So the statistical model $p_\theta(x ) = p(x|\theta)$ can be viewed as a conditional distribution.

The prior distribution $\pi$ can be viewed as our belief of how likely $\theta$ to be at different values and the posterior distribution \[\pi(\theta|x) \equiv p(\theta|x) = \frac{p(x|\theta) \pi(\theta)}{p(x)}\] represents our belief of $\theta$ after seeing the data $X=x$.

Since the posterior distribution is a conditional distribution of $\theta$ given $x$, the value of $x$ is fixed for this distribution, so we have \[\pi(\theta|x) \propto p(x|\theta) \pi(\theta) \propto f(\theta, x),\] where $f(\theta,x)$ is called the kernel of the posterior distribution. Clearly, the kernel is not unique but once we know the kernel function, we can quickly obtain the posterior distribution via \[\pi(\theta|x) = \frac{f(\theta,x)}{ c_f(x)},\qquad c_f(x) = \int f(\theta,x) d\theta.\] A useful trick: we can just utilize the kernel function to find out what family the posterior distribution belongs to.

The prior $\pi(\theta)$ is called a conjugate prior to a statistical model $p(x|\theta)$ if the resulting posterior distribution and prior distribution belong to the same parametric family. Using a conjugate prior reduces the computation drastically since we only need to adjust the parameter values.

Example 5 (Conjugate prior 1: Beta-Binomial). Consider $X|\theta \sim {\sf Bin}(n, \theta)$ and $\theta \sim {\sf Beta}(\alpha, \beta)$, where $N,\alpha,\beta$ are given. Then the posterior distribution is \[\pi(\theta|x) \propto p(x|\theta) \pi(\theta) \propto {n\choose x} \theta^{x} (1-\theta)^{n-x} \cdot \frac{1}{B(\alpha,\beta)} \theta^{\alpha-1} (1-\theta)^{\beta-1} \propto \theta^{\alpha+x-1} (1-\theta)^{\beta+n-x-1}.\] So a kernel function to this posterior distribution is $f(\theta,x) = \theta^{\alpha+x-1} (1-\theta)^{\beta+n-x-1}$ and it is clear that this belongs to the form of a PDF of a Beta distribution with parameter $(\alpha+x, \beta+n-x)$. Thus, the posterior distribution of $\theta$ given $X=x$ is Beta distribution with parameter $(\alpha+x, \beta+n-x)$, which belongs to the same family as the prior distribution.

Example 6 (Conjugate prior 2: Poisson-Gamma). Consider $X|\theta \sim {\sf Poisson}(\theta)$ and $\theta \sim {\sf Gamma}(\alpha, \lambda)$. Then the posterior distribution is \[\pi(\theta|x) \propto p(x|\theta) \pi(\theta) \propto \frac{1}{x!} \theta^x e^{-\theta} \cdot \frac{\lambda^\alpha\theta^{\alpha-1}}{\Gamma(\alpha)} e^{-\lambda \theta} \propto \theta^{x+\alpha-1} e^{-\theta(\lambda+1)}.\] A kernel function is $f(\theta,x) = \theta^{x+\alpha-1} e^{-\theta(\lambda+1)}$, which clearly shows that it is a Gamma distribution with parameter $(\alpha+x, \lambda+1)$.

Example 7 (Conjugate prior 3: Multinomial-Dirichlet). Suppose $\theta\in\mathbb{R}^K$ satisfying $\theta_j \geq 0$ and $\sum_{j=1}^K\theta_j=1$ (this is called a $K$-simplex). Given $\theta$, the random vector $X\in\{0,1,\cdots, N\}^K$ is from a multinomial distribution ${\sf Mult}(N, \theta)$ if $\sum_{j=1}^K X_j = N$ and its joint PMF is \[p(x_1,\cdots, x_K|\theta) = \frac{N!}{x_1!\cdots x_K!} \prod_{j=1}^K\theta_j^{x_j}.\] You can easily see that the Binomial distribution is a special case of the multinomial distribution when $K=2$.

To place a prior on feasible values of $\theta$ ($K$-simplex), we consider the Dirichlet distribution: $\theta \sim {\sf Dir}(\alpha)$, where $\alpha\in \mathbb{R}_{> 0}^K$ and the PDF of $\theta$ is \[\pi(\theta) = \frac{1}{D(\alpha)} \prod_{j=1}^K \theta_j^{\alpha_j - 1},\] when $\theta_j \geq 0$ and $\sum_{j=1}^K \theta_j = 1$ and the density is $0$ outside this feasible range. The Dirichlet distribution can be viewed as a generalized Beta distribution–when $K=2$, the Dirichlet distribution reduces to Beta distribution.

In this case, the posterior distribution of $\theta$ given $X$ is \[\pi(\theta|x) \propto p(x|\theta) \pi(\theta) \propto \theta_j^{X_j + \alpha_j - 1},\] so it is Dirichlet distribution with parameter $\alpha+ X$.

Example 8 (Conjugate prior 4: Normal-normal). Suppose we have random variable $X|\theta\sim N(\theta,\sigma^2)$ and we place a prior distribution on $\theta$ as $\theta\sim N(\mu,\tau^2)$. We assume that $\sigma^2, \mu,\tau^2$ are known. In this case, the posterior distribution of $\theta$ given $X$ is \[\begin{align*} \pi(\theta|x) \propto p(x|\theta) \pi(\theta) \propto e^{-\frac{1}{2\sigma^2}(x-\theta)^2} \cdot e^{-\frac{1}{2\tau^2} (\theta-\mu)^2}, \end{align*}\] which is a normal distribution \[\begin{equation} \theta|X \sim N\left(\frac{1/\tau^2}{1/\tau^2 + 1/\sigma^2}\mu +\frac{1/\sigma^2}{1/\tau^2 + 1/\sigma^2}X, \frac{1}{1/\tau^2+1/\sigma^2}\right). \label{eq::NN} \end{equation}\]

Bayes rule

Recall that in the beginning of the previous section, we have the following decomposition of the Bayes risk for a non-random decision rule $\delta(x)$: \[R_\pi(\delta) =\int_\mathcal{X} \left[\int_\Theta L(\theta, \delta(x)) \pi(\theta|x)d\theta\right] p(x)dx.\]

Theorem 10 shows that a simple way to find a Bayes rule is to consider the conditional quantity $Q(a|x) = \int_\Theta L(\theta, a) \pi(\theta|x)d\theta$ and choose $\delta(x)$ that minimizes the risk. Specifically, for each $x$, consider \[\begin{equation} \delta_{\pi}(x) = {\sf argmin}_{a\in\mathcal{A}}\int_\Theta L(\theta, a) \pi(\theta|x)d\theta. \label{eq::BR} \end{equation}\] Then $\delta_{\pi}(x)$ is the Bayes rule for the prior distribution $\pi$.

Example 11 (Mean square loss). Consider the point estimation problem where we want to estimate $\Psi(\theta)\in\mathbb{R}$, a transformed quantity of the underlying parameter. Our action is our estimate. Consider the squared loss $L(\theta, a) = (\Psi(\theta) -a)^2$.

What will our Bayes rule be in this case?

By equation $\eqref{eq::BR}$, we only need to consider the conditional quantity \[Q(a|x) = \int_\Theta (\Psi(\theta) -a)^2 \pi(\theta|x)d\theta.\] Clearly, this is a quadratic function of $a$, so the minimizer of it satisfies the first-order condition: \[\frac{\partial}{\partial a} Q(a|x) = 0 = \int_\Theta \Psi(\theta) \pi(\theta|x)d\theta - a.\] So we conclude the Bayes rule is \[\delta_\pi(x) = \int_\Theta \Psi(\theta) \pi(\theta|x)d\theta ,\] which is the posterior mean.

Example 12 (Beta-Binomial revisited). We now revisit the Beta-Binomial problem again. Consider $X|\theta \sim {\sf Bin}(n, \theta)$ and $\theta \sim {\sf Beta}(\alpha, \beta)$, where $N,\alpha,\beta$ are given. Then the posterior distribution is a Beta distribution with parameter $(\alpha+x, \beta+n-x)$.

Suppose we are interested in estimating $\theta$.

Under the square loss, the Bayes rule will be the posterior mean of ${\sf Beta}(\alpha+x, \beta+n-x)$, i.e., \[\delta_\pi(x) = \frac{\alpha+x}{\alpha+\beta+n} = \left(\frac{\alpha}{\alpha+\beta}\right) \frac{\alpha+\beta}{\alpha+\beta+n} + \left(\frac{x}{n}\right) \frac{n}{\alpha+\beta+n} = \mbox{prior mean} \cdot W_n + \hat \theta_{MLE} \cdot (1-W_n),\] is a convex combination between the prior mean of $\theta$ and the maximum likelihood estimator $\hat \theta_{MLE} = x/n$ and the proportion of the prior $W_n = \frac{\alpha+\beta}{\alpha+\beta+n}$ is shrinking to $0$ as the sample size $n\rightarrow\infty$.

Minimax rule

For Bayes rule, Theorems 9 and 10 are powerful and useful for finding a practical rule. The minimax rule, unfortunately, does not have such nice direct theoretical results. However, if we can change the problem of searching a minimax rule into searching a Bayes rule, we can utilize the above two theorems.

Recall that in the minimax framework, we consider a minimax risk of a decision rule $D$ \[R^*(D) = \sup_{\theta \in\Theta} R(\theta, D).\] The minimax rule is $D_{MM}$ such that $R^*(D)$ is minimized, i.e., for any other rule $D'$, we have \[R^*(D_{MM}) \leq R^*(D').\] When we consider only the non-random rules, then the minimax rule is $\delta_{MM}(x)$ with \[\begin{equation} \sup_{\theta \in\Theta} R(\theta, \delta)\equiv R^*(\delta_{MM}) \leq R^*(\delta')\equiv \sup_{\theta \in\Theta} R(\theta, \delta') \end{equation}\] for any other non-random decision rules $\delta'$.

Before we proceed, we first introduce the concept of least favorable prior. A prior $\pi_0$ is called the least favorable prior if its Bayes risk \[R_{\pi_0}(\delta_{\pi_0}) = \sup_\pi R_{\pi}(\delta_\pi).\] Namely, this prior is like the worst prior that maximizes the Bayes risk.

Minimaxity. For any other decision rule $\delta$, we immediately have \[\begin{align*} \sup_{\theta \in \Theta}R(\theta, \delta) &\geq \int R (\theta, \delta) \pi_0(\theta)d\theta\\ &\geq \int R (\theta, \delta_{\pi_0}) \pi_0(\theta)d\theta\\ & \equiv R_{\pi_0}(\delta_{\pi_0})\\ & = \sup_{\theta\in\Theta}R(\theta,\delta_{\pi_0}). \end{align*}\] Thus, $\delta_{\pi_0}$ is minimax.

Least favorable prior. Consider another prior distribution $\rho$. Then its Bayes risk \[\begin{align*} R_\rho(\delta_\rho) &\equiv \int R(\theta, \delta_\rho)\rho(\theta) d\theta\\ & \leq \int R(\theta, \delta_{\pi_0})\rho(\theta) d\theta\\ &\leq \sup_{\theta\in\Theta} R(\theta, \delta_{\pi_0})\\ & = R_{\pi_0}(\delta_{\pi_0}). \end{align*}\]

Theorem 13 is a critical result that offers a simple way to find a minimax rule via a Bayes rule. The follow Corollary offers an even simpler way to find a minimax rule.

Example 15 (Minimax rule in Beta-Binomial). We now revisit the Beta-Binomial problem again. Consider $X|\theta \sim {\sf Bin}(n, \theta)$ and $\theta \sim {\sf Beta}(\alpha, \beta)$, where $N,\alpha,\beta$ are given. Then the posterior distribution is a Beta distribution with parameter $(\alpha+x, \beta+n-x)$. Suppose again we are interested in estimating $\theta$.

Its risk under the square loss is \[\begin{align*} R(\theta, \delta_\pi) & = \mathbb{E}((\theta - \delta_\pi(X))^2|\theta)\\ & = (\theta - \mathbb{E}(\delta_\pi(X)|\theta))^2 + {\sf Var}(\delta_\pi(X)|\theta)\\ & = \left(\theta - \frac{\alpha+n\theta}{\alpha+\beta+n}\right)^2 + \frac{n \theta (1-\theta)}{n^2} \frac{n^2}{(\alpha+\beta+n)^2}\\ & = \frac{1}{(n+\alpha+\beta)^2} \left[[(\alpha+\beta+n)\theta - \alpha-n\theta]^2 +n\theta (1-\theta)\right]\\ & = \frac{1}{(n+\alpha+\beta)^2} \left[\alpha^2 + (n-2\alpha(\alpha+\beta))\cdot \theta + ((\alpha+\beta)^2 -n) \cdot \theta^2 \right]. \end{align*}\] To find the minimax rule, we can change our prior’s parameter so that the above risk is a constant. This requires \[n-2\alpha(\alpha+\beta) = 0,\qquad (\alpha+\beta)^2 -n = 0,\] which leads to the solution $\alpha = \beta = \sqrt{n}/2$!

Namely, if we choose the prior to be ${\sf Beta}(\sqrt{n}/2, \sqrt{n}/2)$, its risk function will be \[R(\theta, \delta_\pi) = \frac{\alpha^2}{(n+\alpha+\beta)^2},\] a constant. By Corollary 14, this Bayes rule is the minimax rule.

Finding minimax rule via a sequence of Bayes rules

Theorem 13 shows that we can find a minimax rule via a Bayes rule. This result can be further strengthen to allow a sequence of Bayes rules.

Theorem 16 is often used to verify if a given estimator/rule is minimax, so it is more like a confirmatory approach. In comparison, Theorem 13 shows a direct way to create a minimax estimator/rule from Bayes rule, so Theorem 13 is more like a constructive approach.

We can also generalize the definition of being least favorable to the sequence of priors and Theorem 16 also shows that this sequence of prior is least favorable.

For any other decision rule $\delta$, we immediately have \[\begin{align*} \sup_{\theta \in \Theta}R(\theta, \delta) \geq \int R (\theta, \delta) \pi_k(\theta)d\theta \geq \int R (\theta, \delta_{\pi_k}) \pi_k(\theta)d\theta \equiv r_k . \end{align*}\]

Taking limit of $k\rightarrow \infty$ in both sides leads to \[\begin{align*} \sup_{\theta \in \Theta}R(\theta, \delta) \geq \lim_{k\rightarrow \infty} r_k \equiv r_\infty = \sup_{\theta\in\Theta}R(\theta,\delta_0). \end{align*}\] Thus, $\delta_0$ is minimax.

Theorem 16 suggests another way to find minimax rule when we have a candidate rule (estimator). For a candidate rule/estimator, if we can show that its risk is the limit of a sequence of prior distribution’s risk, then this rule/estimator is minimax.

Example 17 (Sample mean is minimax under Gaussian: known variance). An application of Theorem 16 is to show that the sample mean is minimax in a univariate Gaussian model. Suppose our data is IID $X_1,\cdots, X_n \sim N(\theta, \sigma^2)$, where $\sigma^2$ is known and $\theta$, the population mean, is our parameter of interest.

Consider the sample mean $\bar X_n$ as an estimator of the population mean $\theta$. It is obvious that the sample mean has a risk $R(\theta, \bar X_n) = \frac{\sigma^2}{n}$.

Will the sample mean $\bar X_n$ be a minimax rule/estimator under the square loss $L(\theta, a) = (\theta-a)^2$? The answer is YES!

We will utilize the normal-normal conjugate prior. Suppose we have a prior $\theta\sim N(\mu, \tau^2)$, where $\mu,\tau^2$ are given. We denote this prior distribution as $\phi_{\mu,\tau^2}$.

Using the fact that the sample mean $\bar X_n|\theta \sim N(\theta, \sigma^2/n)$ and equation $\eqref{eq::NN}$, the posterior distribution of $\theta$ given the sample mean $\bar X_n$ is a normal distribution \[\theta|\bar X_n \sim N\left(\frac{1/\tau^2}{1/\tau^2 + n/\sigma^2}\mu +\frac{n/\sigma^2}{1/\tau^2 + n/\sigma^2}\bar X_n, \frac{1}{1/\tau^2+n/\sigma^2}\right).\] Thus, the Bayes rule, which is the posterior mean of $\theta$ given $\bar X_n$ is \[\delta_{\phi_{\mu,\tau^2}}(\bar X_n) = \frac{1/\tau^2}{1/\tau^2 + n/\sigma^2}\mu +\frac{n/\sigma^2}{1/\tau^2 + n/\sigma^2}\bar X_n\] and it has a Bayes risk \[\begin{align*} R_{\phi_{\mu,\tau^2}}( \delta_{\phi_{\mu,\tau^2}}) & = \mathbb{E}\left[(\theta - \delta_{\phi_{\mu,\tau^2}}(\bar X_n))^2\right]\\ & = \mathbb{E}\left[\mathbb{E}[(\theta - \mathbb{E}(\theta|\bar X_n))^2|\bar X_n]\right]\\ & = \mathbb{E}\left[{\sf Var}(\theta|\bar X_n)\right]\\ &= \frac{1}{1/\tau^2 + n/\sigma^2}. \end{align*}\]

Thus, consider a sequence of prior distributions \[N(\mu, 1), N(\mu, 4), N(\mu, 9), N(\mu, 16),\cdots, N(\mu, k^2),\cdots\] and their Bayes risks will be approaching \[\lim_{k\rightarrow\infty}\frac{1}{1/k^2 + n/\sigma^2} = \frac{\sigma^2}{n} = R(\theta, \bar X_n),\] which is the risk of the sample mean. By Theorem 16, the sample mean is a minimax rule for square loss.

The above example shows that the sample mean is minimax when $\sigma^2$ is known. Now we generalize this result to cases where $\sigma^2$ is known. Before we proceed, we first introduce a useful lemma and generalize the risk to a nonparametric model $\mathcal{P}$. The set $\mathcal{P}$ is a collection of distributions. The loss function can be generalized to be $L:\mathcal{P} \times \mathcal{A}\rightarrow \mathbb{R}$, i.e., our loss is $L(P, a)\in\mathbb{R}$ for a distribution $P\in\mathcal{P}$ and an action $a\in\mathcal{A}$. For models indexed by a parameter $\theta$, the above notation just reduces back to our original definition. The risk function of a decision rule $\delta$ is then $R(P, \delta) = \mathbb{E}_{X\sim P}(L(P, \delta(X))) = \int L(P, \delta(x)) P(dx)$.

This implies that there exist $\delta^*$ such that \[\sup_{P\in\mathcal{P}_2}R(P, \delta^*)< \sup_{P\in\mathcal{P}_2}R(P, \delta).\]

Using the fact that $\mathcal{P}_1\subset \mathcal{P}_2$, we have \[\sup_{P\in\mathcal{P}_1}R(P, \delta^*)\leq \sup_{P\in\mathcal{P}_2}R(P, \delta^*).\]

Thus, we conclude that \[\sup_{P\in\mathcal{P}_1}R(P, \delta^*)\leq \sup_{P\in\mathcal{P}_2}R(P, \delta^*) < \sup_{P\in\mathcal{P}_2}R(P, \delta) = \sup_{P\in\mathcal{P}_1}R(P, \delta),\] which contradicts to the fact that $\delta$ is minimax on $\mathcal{P}_1$.

Example 19 (Sample mean is minimax under Gaussian: unknown variance). Now we come back to the Gaussian sample mean problem. We already know that the sample mean $\bar {X}_n$ is minimax for the model \[\mathcal{P}_1 = \{N(\mu,\sigma^2): \mu\in\mathbb{R}, \sigma = \sigma_0\}\] for a known $\sigma_0$. Moreover, the sample mean achieves the risk \[\sup_{P\in\mathcal{P}_1}R(P, \delta) = \frac{\sigma^2_0}{n}.\]

Suppose we do not know $\sigma^2$ but we have an upper bound about it. This leads to the model \[\mathcal{P}_2 = \{N(\mu,\sigma^2): \mu\in\mathbb{R}, \sigma \leq \sigma_0\}.\] Clearly, $\mathcal{P}_1\subset \mathcal{P}_2$ and you can easily show that the maximal risk under $\mathcal{P}_2$ is \[\sup_{P\in\mathcal{P}_2}R(P, \delta) = \frac{\sigma^2_0}{n} = \sup_{P\in\mathcal{P}_1}R(P, \delta).\]

Thus, by Lemma 18, $\bar X_n$ is also minimax under $\mathcal{P}_2$.

Admissibility

A rule $\delta$ is inadmissible if there exists another rule $\delta'$ such that $R(\theta, \delta')\leq R(\theta, \delta)$ for all $\theta$ and $R(\theta, \delta')< R(\theta, \delta)$ for some $\theta$. When a rule is not inadmissible, it is admissible. Generally, checking the admissibility of a rule is very hard but here are some useful theories for this task.

Example 22 (Normal-normal example). Consider again the $n$ IID normal-normal example under square loss. The Bayes rule is the posterior mean, which is \[\delta_\pi(\bar X_n)= \frac{1/\tau^2}{1/\tau^2 + n/\sigma^2} \mu + \frac{n/\sigma^2}{ 1/\tau^2 + n/\sigma^2} \bar X_n.\] Clearly, this risk is finite.

Moreover, you can easily see that $Q$ is the dominating measure of any $P_\theta$ since they are both Gaussian. So the Bayes rule, the posterior mean, is unique Bayes and hence admissible.

Admissibility of sample mean ($d<3$)

The previous result shows that the posterior mean under the normal-normal model is admissible. This almost tells us the admissibility of the sample mean. But we are still missing a piece: the result holds only for a fixed $\tau^2$. Unlike minimaxity, admissibility cannot be achieved via a sequence of prior distributions (Section 4.1). So the question remains: is the sample mean admissible?

We will prove this by contradiction. The highlight idea is: while we cannot apply the minimax theory from Theorem 16 to admissibility, the fact that sample mean can be viewed as a limiting case of a sequence of Bayes rules is useful.

Without loss of generality, we just assume $\sigma^2 = 1$. Clearly, the risk of $\bar X_n$ is \[R(\theta,\bar {X}_n) = \frac{1}{n}.\] Assume that $\bar X_n$ is not admissible, so there exists another rule $\delta^*$ such that \[R(\theta, \delta^*) \leq \frac{1}{n} -\epsilon\] for some $\epsilon>0$ and any $\theta \in [\theta_0,\theta_1]$, where $\theta_0<\theta_1$ are some fixed quantity. The existence of this interval is due to the continuity of the risk function under the normal model.

Consider a prior $\pi$ on $\theta$ as $\theta \in N(0,\tau^2)$. We already know that the Bayes rule under this prior $\delta_\pi$ is the posterior mean and it has a Bayes risk \[r_\tau \equiv \int_\Theta R(\theta, \delta_\pi) \pi(\theta)d\theta = \frac{1}{1/\tau^2 + n}.\]

Now we consider the risk of $\delta^*$ under such prior: \[r^* \equiv \int_\Theta \pi(\theta)d\theta\]

The fact that $\delta_\pi$ is the Bayes rule implies that \[\begin{equation} r_\tau \leq r^*. \label{eq::AD1} \end{equation}\]

Our strategy is very simple: we will show that $r^*<r_\tau$ when $\tau$ is sufficiently large, which contradicts to the fact that $\delta_\pi$ is the Bayes rule, i.e., equation $\eqref{eq::AD1}$ will be violated.

Since both $r_\tau, r^*$ are related to $\frac{1}{n}$, the risk of the sample mean, we consider the following ratio: \[\begin{align*} \frac{ \frac{1}{n}-r^*}{ \frac{1}{n}-r_\tau} & = \frac{\frac{1}{\sqrt{2\pi \tau^2}}\int (\frac{1}{n} - R(\theta, \delta^*)) e^{-\frac{1}{2\tau^2} \theta^2}d\theta}{\frac{1}{n} - \frac{1}{1/\tau^2 + n}}\\ & = \frac{n (1+n\tau^2)}{\sqrt{2\pi \tau^2}}\int \left(\frac{1}{n} - R(\theta, \delta^*)\right) e^{-\frac{1}{2\tau^2} \theta^2}d\theta\\ & \geq \frac{n (1+n\tau^2)}{\sqrt{2\pi \tau^2}}\int\epsilon e^{-\frac{1}{2\tau^2} \theta^2}d\theta\\ & \geq \frac{n (1+n\tau^2)}{\sqrt{2\pi \tau^2}}\epsilon \int_{\theta_0}^{\theta_1} e^{-\frac{1}{2\tau^2} \theta^2}d\theta. \end{align*}\] When $\tau$ is sufficiently large, $e^{-\frac{1}{2\tau^2} \theta^2}\geq \frac{1}{2}$ for any $\theta \in [\theta_0,\theta_1]$.

Therefore, the above inequality becomes \[\frac{ \frac{1}{n}-r^*}{ \frac{1}{n}-r_\tau} \geq \frac{n (1+n\tau^2)}{\sqrt{2\pi \tau^2}}\epsilon\int_{\theta_0}^{\theta_1} \frac{1}{2}d\theta = \frac{n (1+n\tau^2)}{2\sqrt{2\pi \tau^2}}\epsilon(\theta_1-\theta_0) > 1\] when $\tau$ is sufficiently large (notice that there is $\tau^2$ in the numerator and $\sqrt{\tau^2} = \tau$ in the denominator). As a result, we conclude that \[\frac{ \frac{1}{n}-r^*}{ \frac{1}{n}-r_\tau}>1 \Leftrightarrow \frac{1}{n}-r^* > \frac{1}{n}-r_\tau \Leftrightarrow r^* < r_\tau,\] which contradicts to equation $\eqref{eq::AD1}$ that $\delta_\pi$ is the Bayes estimator under $N(0,\tau^2)$. Therefore, such $\delta^*$ does not exist, so the sample mean $\bar X_n$ is admissible.

Remark. You can also show that the sample mean is admissible when $d=2$. See page 170 of

Inadmissibility of sample mean ($d\geq 3$)

For a multivariate $\theta\in\mathbb{R}^d$ and a multivariate estimator/action $\hat \theta\in \mathbb{R}^d$, we define the square loss to be \[L(\theta, \hat \theta) = \sum_{j=1}^d (\theta_j - \hat \theta_j)^2 = \|\theta - \hat \theta\|^2.\]

To prove Theorem 24, it suffices to provide an estimator that is better than the sample mean in the sense that the risk is not worse than sample mean and there exists some parameter $\theta$ that this estimator will be better than the sample mean.

To simplify the problem, we consider a simple Gaussian model at $n=1$ and $\sigma^2=1$: \[X \sim N(\theta, \mathbf{I}_d),\] where $\mathbf{I}_d$ is the $d\times d$ identity matrix and $X,\theta \in\mathbb{R}^d$. Our goal is to estimate $\theta$. The sample mean in this scenario corresponds to $X$.

Here we introduce the fast James-Stein shrinkage estimator (JS estimator): \[\begin{equation} \hat \theta_{JS} = \begin{cases} \left(1-\frac{d-2}{\|X\|^2}\right)X,\quad &\mbox{if $X\neq 0$}\\ 0 ,\quad &\mbox{if $X=0$}. \end{cases} \end{equation}\] It turns out that JS estimator has a smaller risk than the sample mean $\bar X_n$ for estimating the population mean!

Now we show that $\hat \theta_{JS}$ has a smaller risk than $X$, i.e., \[R(\theta, \hat \theta_{JS})< R(\theta, X) \equiv \mathbb{E}\left[\|\theta - X\|^2\right].\]

The risk of $\hat \theta_{JS}$ can be decomposed as \[\begin{equation} \begin{aligned} R(\theta, \hat \theta_{JS})& \equiv \mathbb{E}\left[\left\|\left(1- \frac{d-2}{\|X\|^2}\right) X -\theta\right\|^2\right]\\ & = \mathbb{E}\left[\left\|-\frac{d-2}{\|X\|^2} X + (X-\theta)\right\|^2\right]\\ & = \mathbb{E}\left[\left\|-\frac{d-2}{\|X\|^2} X + (X-\theta)\right\|^2\right]\\ & = \mathbb{E}\left[\frac{(d-2)^2}{\|X\|^2}\right] - 2(d-2) \mathbb{E}\left[\frac{X^T (X-\theta)}{\|X\|^2}\right] + \mathbb{E}\left[\|X-\theta\|^2\right]\\ & = \mathbb{E}\left[\frac{(d-2)^2}{\|X\|^2}\right] - 2(d-2) \mathbb{E}\left[\frac{X^T (X-\theta)}{\|X\|^2}\right] +R(\theta,X). \end{aligned} \label{eq::S1} \end{equation}\] Thus, all we need is to show that the first two terms are negative, i.e., we need \[\begin{equation} \mathbb{E}\left[\frac{(d-2)^2}{\|X\|^2}\right] < 2(d-2) \mathbb{E}\left[\frac{X^T (X-\theta)}{\|X\|^2}\right]. \label{eq::S2} \end{equation}\]

To show equation $\eqref{eq::S2}$, we use the following famous Stein’s lemma:

Note that this Lemma is a straight forward result using integration by parts under Gaussian model, so we omit its proof.

Now consider $g(y) = \frac{y}{\|y\|^2}$, so $g_j(y) = \frac{y_j}{\|y\|^2}$. Clearly, \[\nabla \cdot g(y) = \sum_{j=1}^d \frac{\partial}{\partial y_j } g_j(y) = \sum_{j=1}^d \frac{1}{\|y\|^2} - \frac{2y_j^2}{\|y\|^4} = \frac{d}{\|y\|^2} - \frac{2\|y\|^2}{\|y\|^4} = \frac{(d-2)}{\|y\|^2}.\] Lemma 26 implies that \[\begin{align*} \mathbb{E}\left[\frac{X^T (X-\theta)}{\|X\|^2}\right] = \mathbb{E}\left[(X-\theta)^T g(X)\right] = \mathbb{E}\left[ \frac{(d-2)}{\|X\|^2}\right]. \end{align*}\] As a result, we conclude that the righ-hand-side of equation $\eqref{eq::S2}$ \[\begin{align*} 2(d-2) \mathbb{E}\left[\frac{X^T (X-\theta)}{\|X\|^2}\right] = 2 (d-2)\mathbb{E}\left[ \frac{(d-2)}{\|X\|^2}\right] = 2 (d-2)^2 \mathbb{E}\left[ \frac{1}{\|X\|^2}\right] > \mathbb{E}\left[ \frac{(d-2)^2}{\|X\|^2}\right] \end{align*}\] so equation $\eqref{eq::S2}$ holds.

Thus, $R(\theta, \hat \theta_{JS})< R(\theta, X)$ for all $\theta$ in the Gaussian model, so the sample mean is NOT admissible.

Note that while JS estimator $\hat \theta_{JS}$ dominates the sample mean $X$, the JS estimator is inadmissible as well. See pages 356-357, and Section 5.7, pages 376-389 in

\(L(\theta, a)\)	\(\theta = 3\)	\(\theta = 6\)	\(\theta = 9\)	\(\theta = 12\)
\(A=0\)	0	0	2	5
\(A=1\)	2	2	2	2

UW STAT 581: Advanced Theory of Statistical Inference I (2026 Autumn)
Chapter 1 Elementary Decision Theory

Elementary Decision Theory