Statistical Power

Statistical Power#

Power is not covered in [FPP07]. Here’s a brief treatment from a free textbook. I choose to cover it because it’s a natural extension of hypothesis testing and of importance as experiments have become more common in tech and academic research. Power helps you make sense of your conclusion if you fail to reject the null hypothesis. With a pitifully small sample, you’re unlikely to reject the null. It’s power that helps us understand how likely we are to reject the null if we can assume that the alternative is true. That is, it helps us understand the potential for a false negative.

Type I and II Errors#

A Type I error is made by rejecting the null hypothesis when the null hypothesis is true. A Type II Error is made when failing to reject the null hypothesis when the null hypothesis is false. While some mistakes, like snooping, are a matter of bad practice and can be avoided, Type I and II errors are unavoidable.

	$H_0$ is true	$H_0$ is false
Don’t Reject	True Negative	False Negative (Type II Error)
Reject	False Positive (Type I Error)	True Positive

Hypothesis testing involves a binary decision. If we compare this to the deliberations of a judge, it is the null hypothesis that is on trial. Rejecting the hypothesis is akin to judging it to be guilty. Failing to reject the null hypothesis is akin to acquitting it, and this might be considered a negative result. Continuing the judicial analogy, a Type I error is convicting the true, innocent null hypothesis. A Type II error lets the false, crooked null hypothesis off the hook.

Statisticians use $\alpha$ and $\beta$ to denote the Type I and II conditional error rates. I call them conditional error rates to emphasize that each is a conditional probability.

\[\alpha = \mathbb{P}(\text{reject }H_0 \mid H_0\text{ true})\]

\[\beta = \mathbb{P}(\text{fail to reject }H_0 \mid H_0\text{ false})\]

The value $\alpha$ is familiar, being directly related to the confidence level. A test has an associated power level, $1-\beta$. To pinpoint power or $\beta$, it is typical to propose a single alternative. For example, while the null might specify that $\mu=0$, the alternative considered below is $\mu=2$.

Power#

The power of a test is the probability it will reject the null hypothesis if the null is false.

For a given statistical test, the power depends on the significance level $\alpha$ and the sample size $n$.

First, the significance $\alpha$ influences the power because you are determining how liberal or conservative to be with rejecting the null. A high $\alpha$ means you will reject the null more often.

_images/power_stacked_vary_alpha.svg — Fig. 59 As $\alpha$ increases, $\beta$ decreases and power increases. The power is the unshaded region under the orange curve.#

Above, we simplify the world to consider a single alternative hypothesis, under which the $z$-statistic is actually drawn from a distribution centered at two. Suppose you wanted 95% power in the illustration above ($\beta = 0.05$). According to the null hypothesis, a $z$-statistic will be drawn from a standard normal distribution (the top panel). To force $\beta = 0.05$, the vertical line must be 1.645 standard deviations away from the mean of the alternative distribution. Accordingly our critical value is $z^\star = 2-1.645 = 0.355$. A $z$-table helps show this corresponds to $\alpha \approx 0.361$.

Second, this analysis was based on an effect size corresponding to the alternative distribution being centered at two. A bigger effect size will spread the two distributions further apart. What if we have a smaller effect size? The power will be lower for a fixed $\alpha$. What $\alpha$ is required if the test statistic was actually drawn from a distribution centered at 1.645? Why is this unrealistic?

Third, higher $n$ increases power by lowering the standard errors and thus making the sampling distribution for the sample mean more narrow. This is because the standard error for such a distribution is $\text{SE} = \frac{\text{SD}}{\sqrt{n}}$. With less overlapping area, greater power is achieved.

In real-world applications, a power analysis is usually done before launching an experiment. While 95% is the typical confidence level, 80% is the typical power chosen. A power analysis takes these targets as given and then finds the required number of observations, $n$.

Is power relevant in the world of big data?#

Power increases with sample size, but that doesn’t mean that you never have to worry about power in big data applications. After all, a small percentage increase in revenue for a company the size of Amazon amounts to a very big number.

Ronny Kohavi provides some back-of-the-envelope calculations (the details of which we can ignore) on LinkedIn:

If you think large companies with a massive userbase (Amazon, Google) have an easy time detecting tiny changes in A/B tests, you’re wrong! … The largest companies cannot power experiments with enough users to detect a $10M loss.

And now, you can be a gadfly in any meeting by deploying these three questions:

Is the difference statistically significant?
Is a proposed explanation for the statistically significant difference causal or is it confounded?
Is a result statistically insignificant because of low power?

Exercises#

Exercise 51

Ippei bought a special coin from Jontay for gambling purposes. Jontay tells him the coin is weighted to come up heads 60% of the time. Jontay is a scammer and sold a regular coin that comes up heads 50% of the time. Ippei tests the coin with a $z$-test after 96 flips of the coin. Ippei starts with a null hypothesis that $p=0.6$ and tests an alternative than $p< 0.6$. Ippei uses a 95% confidence level ($\alpha = 0.05$). What test statistic does Jontay expect Ippei to find? Is the power above or below 50%?