More About Significance

More About Significance#

Important Readings

[FPP07], Chapters 29

A test of significance is table stakes in business or academia for almost any empirical inquiry. Test results are just information though, meaning they still have to be interpreted, critiqued, and put to use wisely. In this section, we’ll discuss ways tests can be run improperly.

[FPP07] presents a few rules:

Transparency. Describe the data, the test, and the P-value instead of only the conclusion.
Restraint. Don’t snoop–deciding what hypotheses to test after seeing the data.
Importance. Statistical significance and importance are not the same.
Model. The model should match the data.
Design. Statistical significance does not substitute for good study design.

Let’s elaborate on some of these.

Data Snooping#

Snooping undermines your asserted confidence levels. One way of snooping is to determine if you will run a left- or right-tailed \(z\)-test only after calculating your test statistic. Let’s think through the consequences.

Set \(\alpha = 0.05\). This means if the null is true, you will only reject the null 5% of the time.
Calculate \(z\).
Assuming the null is true, half the time you find \(z > 0\). Use a right-tailed test with critical value 1.645.
The other half of the time, you find \(z\leq 0\). Use a left-tailed test with a critical value of -1.645.

What is the chance of rejecting the null now, assuming the null to be true? It is 5% + 5% = 10%. This sort of rule is the same as running a two-tailed test with a 90% confidence level.

Peeking#

There are still many more ways to snoop. One subtle mistake occurs when you might stop collecting data once your P-value dips below \(\alpha\). For example, a company might start an A/B test with a plan to run it for a week, but stop it sooner if they see a low P-value. Without more advanced statistical calculations, this undermines your asserted confidence level.

This can be demonstrated in Google Sheets. In the spreadsheet below, I simulate 100 tosses of a fair coin in each tab. For each toss, I calculate the \(z\)-statistic up to that point. The \(z\)-statistics greater than 1.645 (a right-tailed test with \(\alpha = 0.05\)) are highlighted in yellow. We should expect one of these twenty experiments to end with a \(z\)-statistic above 1.645. However, several more show an extreme test significant at some point between the first and last coin toss. Someone who stops the experiment as soon as \(z > 1.645\) will result in a false positive much more than 5% of the time.

The \(z\)-statistics are calculated “correctly,” but the mistake is in the stopping procedure. The procedure, in some sense, selectively drops observations. This turns out to be all take and no give. After all, the experiment would never be stopped early in favor of the null and it is possible for the \(z\)-statistic to settle down to a more moderate value. The solution is to exercise restraint. Use a fixed experiment length (or learn fancier methods for sequential hypothesis tests).

Multiple Hypotheses#

Suppose I designed an experiment where I flipped a coin 500 times to see if it was weighted. After every 25 flips, I paint the coin a new color, dividing my data into 20 subsets. Chance is lumpy, so while \(\hat{p}\) will be close to 0.5, I will find streaks in the data where the sample proportion is far from 0.5. This is demonstrated via simulation in Fig. 57.

_images/coinTable.svg — Fig. 57 The maroon and orange coins look to be rigged, but only by chance.#

This sort of repeated testing inflates the chance of rejecting the null. Risk compounds. If you conduct \(n\) independent hypothesis tests, each at \(\alpha = 0.05\),

\[ \mathbb{P}( \text{at least one false positive} \mid H_0 \text{ true}) = 1 - 0.95^n.\]

The probability is 5% for \(n=1\) and it gets bigger for \(n> 1\). P-values are calculated assuming this isn’t happening.

Practical Importance#

Statistical significance relies on a P-value crossing an arbitrary threshold. As previously mentioned ((Aside) Big Picture), we should be skeptical of significant results necessarily being important and insignificant results being unimportant.

Below are two similar boxes. We can say they are similar because in either case we will draw a 1 with either 50% or 50.1% chance. With enough draws made with replacement, the sample average from the second box will be higher and this will be judged to be statistically significant when conducting a two-sample test. Is that important? Maybe. In some settings, a 0.1 percentage point difference in averages is important. In other settings it is not.

_images/boxTwoSim.svg — Fig. 58 The difference between these boxes is easily overstated if you will only make one draw from each.#

Imagine you can select one ticket from a single box and then you win a dollar if you draw a 1. You might have 95% confidence that the second box is better, but with a single draw from each, the draws are the same 50% of the time, the first box is better 24.5% of the time, and the second box is better 25.5% of the time. Statistical significance can therefore become a shiny distraction, leading to practical significance being put aside or snooping and other bad statistical practices.

Model and Design#

All of our results assume sampling done with replacement. Simple random sampling is done without replacement and this is okay if we aren’t sampling too much of the entire population. The concept of statistical significance does not apply to census data taken from our population of interest. Nonrandom sampling, like taking a convenience sample, can also introduce selection bias. Not that it’s especially convenient, but I shouldn’t predict the success of new Taco Bell menu items by polling a sample of Met Gala attendees.

A test of significance does not ensure a test was well designed, and publication in a peer-reviewed journal might not add much. Recall we introduced the hiearchy of evidence in the Types of Evidence section and that we could do this even before introducing hypothesis testing. Observational studies are not as trustworthy because of the potential for confounding, and hypothesis does nothing to solve this problem. Adding a hypothesis test in such a case might be called “intellectual desperation.”

Subjectivity and Argument#

[FPP07] concludes with an important reminder. Irrespective of the bottom-line of your research, tests of significance answer “one and only one question: How easy is it to explain the difference between the data and what is expected on the null hypothesis, on the basis of chance variation alone?” To bridge the gap between that and your bottom-line requires deeper understanding and some principled argument. From [Abe12]:

Some subjectivity in statistical presentations is unavoidable, as acknowledged even by the rather stuffy developers of statistical hypothesis testing … Data analysis should not be pointlessly formal. It should make an interesting claim; it should tell a story that an informed audience will care about, and it should do so by intelligent interpretation of appropriate evidence from empirical measurements or observations.