The Chi-Square Test#

Important Readings

We used \(z\)-tests for proportions, or when our box model had just two categories and could be cast as a 0-1 box.

The simple 0-1 days are gone. Now we live in a world where the box might contain many values. The NBA lottery chooses among fourteen teams or a gambler might throw a six-sided die. At work, engineers randomize over who is supposed to conduct code reviews. A name is randomly selected, as if from a box, and that person is stuck with the onerous task of double-checking a colleague’s code. The box below considers a three-colleague example.

_images/boxCodeReview.svg

Suppose that over 30 draws, we expect the work to be split evenly but that Velma does 10 reviews, Marvin is tasked with 15, and Paul only does 5.

Paul is more senior and, only after seeing the data, we’re suspicious that the game has been rigged to lower his workload. A \(z\)-test could be applied here if we consider Paul vs. Everyone. Let \(p\) be the proportion of times Paul is selected. Our null is \(p = \frac{1}{3}\) and the alternative is \(p < \frac{1}{3}\).

The resulting \(z\)-statistic is

\[z = \dfrac{ \frac{5}{30} - \frac{1}{3} }{\sqrt{\frac{\frac{1}{3}\cdot\frac{2}{3}}{30}}} \approx -1.94.\]

This is beyond the critical value of -1.645 for a left-tailed test with a 95% confidence level. We’re tempted to declare that Paul is cheating. But the next section should give us pause.

The \(\chi^2\) Test#

The previous application of the \(z\)-test might have been too opportunistic. With enough colleagues, someone is probably going to get a bad break. If, in practice, we only became suspicious of Paul after seeing the data, that should also raise red flags. To address the fairness of this kind of random name drawing, we should ask the question “are all names equally likely” instead of asking if a particular name comes up with a one-third chance.

\[ H_0: \hspace{10pt} \text{all names are equally likely} \]
\[ H_A: \hspace{10pt} \text{all names are not equally likely} \]

In this setup, we don’t reduce this to a Paul vs. Everyone binary. Instead, we calculate a \(\mathbf{\chi^2}\)-statistic,

\[\chi^2 = \sum \frac{\text{(observed frequency - expected frequency)}^2}{\text{expected frequency}}.\]

The sum is taken over all possible names. Note we don’t convert anything to proportions, but we work directly with the counts and expected counts. With our data,

\[\chi^2 = \frac{(10-10)^2 + (15-10)^2 + (5-10)^2)}{10} = 5.\]

There are two degrees of freedom because once we know two of the values from 10, 15, and 5, we know the last. The \(\chi^2\) statistic comes from a \(\chi^2\) distribution with two degrees of freedom. More generally, the statistic comes from a \(\chi^2\) distribution with degrees of freedom equal to the number of terms minus one.

Using the \(\chi^2\) Distribution#

The \(\chi^2\) distribution is characterized by a degrees of freedom parameter.

Critical Values#

Critical values can be found using a table like the one in Fig. 56. These work like the critical value tables for \(t\)-tests.

_images/chi2Table.svg

Fig. 56 \(\chi^2\) critical values#

When our test statistic is greater than the critical value, we should reject the null hypothesis. In the example above, the test statistic is \(\chi^2 = 5\) and there are 2 degrees of freedom. Using \(\alpha = 0.05\), the critical value is 5.9915. Our test statistic is smaller, so we fail to reject the null. The \(\chi^2\)-test turns out to be more charitable toward Paul.

Recall that if the statistic is zero, that means the observed counts perfectly matched the expected counts. Therefore, only a large test statistic indicates data that deviates from what you would expect according to the null hypothesis.

Finding P-values#

We can find the P-value from the appropriate \(\chi^2\) curve. This is the area to the right of the statistic. This will require a calculator and there is no analog to the 68-95 rule that can help us estimate a P-value. And while the \(t\) distribution approached a normal with enough degrees of freedom, a \(\chi^2\) distribution will not.

In Google Sheets, the P-value can be found with =CHISQ.DIST.RT(5,2)}. The calculator embdedded below also gives the P-value.

Assuming a null of equal chances, our data or data more extreme would only arise about 8% of the time. This doesn’t fall below the typical 5% threshold.

_images/chi2Pval.svg
Python Interactive#

Use this Colab notebook to find the area to the right of a value for a given degrees of freedom.

Tests of Independence#

Below are two contingency tables. Each summarizes a different data set, where a single observation is a person’s nationality and primary language. The single-variable summaries are the same for both. Both have 140 observations in Germany, 280 in America, and 70 in Belgium. And both have 140 German speakers, 280 English speakers, and 70 Dutch speakers. They differ in how the two variables are related.

In Table 14, the variables are independent. In aggregate, 4 of 7 people speak English. This ratio is the same within each country, so that nationality isn’t related to language. In Table 15, the opposite is true. In aggregate, 4 of 7 people speak English, but they are all found in America.

Table 14 Contingency Table - Independence#

Germany

America

Belgium

Total

German

40

80

20

140

English

80

160

40

280

Dutch

20

40

10

70

Total

140

280

70

490

Table 15 Contingency Table - Dependence#

Germany

America

Belgium

Total

German

140

0

0

140

English

0

280

0

280

Dutch

0

0

70

70

Total

140

280

70

490

These examples are stark. A test of independence can be used to make the analysis more principled.

The hypotheses#

\[ H_0: \text{Categorical variable $X$ is independent of categorical variable $Y$} \]
\[ H_A: \text{Categorical variable $X$ isn't independent of categorical variable $Y$}\]

The idea of the test is as follows.

Assume the null hypothesis (independence).

  1. We can use the row and column totals to estimate the expected counts.

  2. Devise a statistic that compares expected counts to actual counts.

  3. Compute the degrees of freedom as #Rows-1 \(\times\) #Columns-1.

  4. Compare the statistic to a \(\chi^2\) distribution with the right degrees of freedom. Calculate P-value as the probability a \(\chi^2\) random variable would be greater than the statistic.

  5. Reject the null if the P-value is below \(\alpha\).

The expected values for each cell can be calculated as

\[\text{Expected Value} = \dfrac{\text{Row Total}\times \text{Column Total}}{\text{Table Total}}.\]

The statistic is calculated as

\[\chi^2 = \sum_{\text{all cells}}\frac{(\text{Obs - Exp})^2}{\text{Exp}}.\]

If the observed values are those from Table 15, then the expected values are those in Table 14. The sum is \(\chi^2 = 980\). And the degrees of freedom are calculated as \((3-1)\times(3-1) = 4.\) The statistic is extreme, based on Fig. 56, so we can reject the null of independence. This shouldn’t be that surprising because we had a reasonable amount of data and the data didn’t look independent.

Did we ever use the sample size though? No, but kind of. The degrees of freedom don’t depend on the sample size, so the critical value doesn’t change. But the statistic grows with \(n\) if we fix the observed proportions.

The Assumptions#

The necessary assumptions are similar to proportion tests.

  1. Counted (qualitative) Data

  2. Independence (e.g. individuals being counted are sampled independently)

  3. Randomization (to generalize results)

  4. Expected Cell Frequency (rule of thumb: expectation of at least 5 for each cell [DVVB19])

Constructing Contingency Tables in Google Sheets#

A contingency table can be made using the pivot table functionality in Google Sheets or Excel. The video below shows how to do this (no audio).

Application: The Evolution of Taylor Swift#

This spreadsheet includes a row for each of the 530 Taylor Swift tracks currently on Spotify. Our research question will be have the themes in Swift’s songs changed since she left country for pop? We are therefore interested in two variables, genre and song theme. Both are qualitative variables and somewhat subjective. We will consider every song on 1989 or a later album to be pop. Earlier songs are country. A random sample of 82 songs are labeled by ChatGPT. ChatGPT suggests categories:

  1. Love and relationships

  2. Self-discovery and empowerment

  3. Storytelling and reflection

  4. Other

To be able to construct a contingency table, these categoreies must be mutually exclusive and exhaustive. A song like Love Story is labeled as “love and relationships” despite involving, as the track name suggests, a story. After this, we have the two essential columns.

Next, the contingency table can be constructed.

Table 16 Contingency Table - Song Themes by Album Genre#

Genre

Love and Relationships

Self-discovery and Empowerment

Storytelling and Reflection

Other

Total

Country

8

0

8

5

21

Pop

9

10

15

27

61

Total

17

10

23

32

82

The null specifies that the two variables are independent. Under the null, the data would like the following. Recall that the expected values are found as \(\frac{\text{Row Total}\times \text{Column Total}}{\text{Grand Total}}\).

Table 17 Expected Frequencies Under Null of Independence#

Love and Relationships

Self-discovery and Empowerment

Storytelling and Reflection

Other

Total

Country

4.35

2.56

5.89

8.20

21

Pop

12.65

7.44

17.11

23.80

61

Total

17

10

23

32

82

With this, we can calculate the \(\chi^2\)-statistic.

Table 18 Observed - Expected#

Genre

Love and Relationships

Self-discovery and Empowerment

Storytelling and Reflection

Other

Total

Country

\(\frac{(8 - 4.35)^2}{4.35} = 3.06\)

\(\frac{(0 - 2.56)^2}{2.56} = 2.56\)

\(\frac{(8 - 5.89)^2}{5.89} = 0.76\)

\(\frac{(5 - 8.20)^2}{8.20} = 1.25\)

7.63

Pop

\(\frac{(9 - 12.65)^2}{12.65} = 1.05\)

\(\frac{(10 - 7.44)^2}{7.44} = 0.88\)

\(\frac{(15 - 17.11)^2}{17.11} = 0.26\)

\(\frac{(27 - 23.80)^2}{23.80} = 0.43\)

2.62

Together, the \(\chi^2\)-statistic is about 7.63 + 2.62 = 10.25.

Is the test statistic large enough to lead us to reject the null? There are 2 degrees of freedom, so the critical value would be 5.99 according to Fig. 56. We can reject the null because our statistic is greater. We have evidence that genre and song theme are related. It seems like Swift’s genre shift also included thematic shifts.

Can we trust the analysis? Our calculations are correct, but we should think about the assumptions. Our data is small. In particular, the expected cell count for Country and Self-Discovery is just 2.56. A rule of thumb is that each expected cell count should be at least five. Our test fails this rule of thumb, so we shouldn’t take the analysis that seriously. Further, even if we leave aside the sample size, we must include the usual caveats about causality. There’s nothing in the data or the test that the genre drives different song themes. Swift’s transition to pop also came with her maturation as an artist. Perhaps artists progress to different song themes over time and that is related to the the genre shift only coincidentally.

Python Code#

The Python code below repeats the same steps. The \(\chi^2\)-statistic is slightly different due to rounding.

import pandas as pd
from scipy.stats import chi2_contingency

# Observed data
data = {"love and relationships": [8, 9],
        "self discovery and empowerment": [0, 10],
        "storytelling and reflection": [8, 15],
        "other": [5, 27]}

# DataFrame for observed data
df = pd.DataFrame(data, index=["Country", "Pop"])

# Performing the chi-squared test for independence
chi2, p, dof, expected = chi2_contingency(df, correction=False)
print(f"Chi-squared statistic: {chi2:.2f}")
print(f"P-value: {p:.2f}")
print(f"Degrees of freedom: {dof:g}")
pd.DataFrame(expected, columns=df.columns, index=df.index).round(2)
Chi-squared statistic: 10.24
P-value: 0.02
Degrees of freedom: 3
love and relationships self discovery and empowerment storytelling and reflection other
Country 4.35 2.56 5.89 8.2
Pop 12.65 7.44 17.11 23.8

Exercises#

Exercise 49

Which test should you use for each of the following? Your choices are one-sample \(z\)-test, one-sample \(t\)-test, two-sample \(z\)-test, \(\chi^2\) test, \(\chi^2\) test of independence, or consult a statistician instead.

  1. Does enrollment in early morning classes lower students’ course grades?

  2. Does enrollment in early morning classes lower the likelihood of future STEM course enrollment?

  3. Are married people more likely to be registered Republicans?

  4. Are party affiliation and marital status related or are they independent?

  5. Is a six-sided die fair?

Exercise 50

Travis’s playlist contains 4 songs by Taylor Swift, 2 by Pavarotti, and 4 by Little Richard. He listens to 100 songs on shuffle mode, resulting in Swift being played 30 times, Pavarotti being played 25 times, and Little Richard being played 45 times. Use a \(\chi^2\)-test to determine if shuffle mode is randomizing over each song with equal probability.