Normal Distribution

Normal Distribution#

Important Readings

[FPP07], Chapter 5

The Normal Curve#

The normal curve can be used as an ideal histogram, as if we had an enormous collection of observations for a continuous quantitative variable. The normal curve is symmetric and bell-shaped (though not all symmetric and bell-shaped distributions necessarily follow the normal curve).

The standard normal curve, shown below, is centered at zero and the standard deviation is one. The units on the \(x\)-axis are called standard units. A standard unit is the same as one SD for the standard normal curve.

_images/normalcurve.svg — Fig. 17 The standard normal curve with an average of zero and a standard deviation of one.#

The normal curve extends to positive and negative infinity, but there is negligible area underneath the curve beyond just a few standard units.

Standard Units#

A value is converted to standard units by calculating how many standard deviations it is from the average. This is also called a standardized value or a \(z\)-score.

\[\text{value in standard units} = \dfrac{\text{value - average}}{\text{SD}}.\]

An average IQ is 100 and the SD is 15. Someone scoring 130 is two SDs above the average, so their IQ is 2 in standard units. Standardized data will necessarily have an average of zero and a SD of one.

_images/sleep_standardized_histogram.svg — Fig. 18 Data from the American Time Use Survey. Note the shape of the histogram is not at all changed by converting to standard units–the axes are simply rescaled.#

Converting values to standard units does not mean the data will now follow a normal curve. Above, we examined sleep data from the American Time Use Survey. However, if we consider skewed data, the standardized data will maintain the skew.

_images/earnings_standardized_histogram.svg — Fig. 19 Data from the American Time Use Survey. Again, the shape of the histogram is not at all changed by converting to standard units–the axes are simply rescaled.#

The 68-95 rule#

As suggested by the bell shape, values far from zero are rare according to the normal curve. About 68% of the area under the normal curve is between -1 and 1. About 95% of the area is between -2 and 2. And about 99.7% of the area is between -3 and 3.

Recall that we can think of the normal curve as an idealized histogram. If data follows the normal curve, then 68% of the data will be within one SD of the average and 95% will be within two SDs of the average. We can do some accounting to find how much data is found in the extremes.

How much data is more than two SDs above or below the average?

In the above, we considered both extremes, the data above the average and the data below the average. We can consider a single extreme, using the fact that the normal curve is symmetric, so there is just as much area more than two SDs above the average as there is area more than two SDs below the average.

Assume IQ scores follow the normal curve with an average of 100 and SD=15. What percentage of people score over 130?

Finding Areas Under the Normal Curve#

When the 68-95 rule is not enough to figure out the area under a normal curve for some region, you can use a \(z\)-table or a calculator. Below is a calculator.

Tables#

Tables come in two varieties, either giving an interior area or the cumulative area to the left.

_images/interior_ztable.svg — Fig. 20 The cell value gives the area (as a fraction between 0 and 1) within \(\pm z\) units, where \(z\) is the value implied by the row and column values. These are the same areas as shown in the table on page A-105.#

_images/cumulative_ztable.svg — Fig. 21 The cell value gives the area (as a fraction between 0 and 1) to the left of the value implied by the row and column values.#

Why care?#

The normal distribution has many nice properties, but there are many nice things we aren’t covering in these notes. Its inclusion is justified because the normal distribution is reasonably close to data like hours of sleep and it will sneak up on as again as we consider certain random processes and sample statistics later in the course. The video below shows a normal distribution arising from the random bouncing of some beads.

A beautiful visual demonstration of how mathematical patterns emerge from random events.
(using a Galton board)https://t.co/wpTSNF1aLn pic.twitter.com/lXOhk72Poz
— Lionel Page (@page_eco) September 9, 2019

Exercises#

Exercise 19

Which of the following has the highest value?

A. The 16%ile of a normal curve with average 1.0 and SD 0.5.
B. The 68%ile of a normal curve with average -0.5 and SD 1.0.
C. The 84%ile of a normal curve with average -1.0 and SD 1.0.
D. The 50%ile of a normal curve with average 0.0 and SD 10.

Exercise 20

Do you expect the 68-95 rule to hold more closely in data set A or B? Verify your answer.

A. 0.25, 0.25, 0.25, 1, 1, 1, 1, 1, 2, 2, 16
B. -2, -2, -2, 0, 0, 0, 0, 0, 1, 1, 4

Exercise 21

Students from high schools A and B compete for admission to college U. U admits any student with an SAT score above a threshold. School A has a lower average SAT than School B. Assume SAT scores follow a normal curve for each school. Is it possible that A has a higher admission rate? What if the admission rate for School B is over 50%? Construct an example if possible.