(normal)=
# Normal Distribution

```{admonition} Important Readings
:class: seealso
- {cite}`freedman2007statistics`, Chapter 5
```

## The Normal Curve

The normal curve can be used as an ideal histogram, as if we had an enormous collection of observations for a continuous quantitative variable. The normal curve is symmetric and bell-shaped (though not all symmetric and bell-shaped distributions necessarily follow the normal curve). 

The *standard* normal curve, shown below, is centered at zero and the standard deviation is one. The units on the $x$-axis are called **standard units**. A standard unit is the same as one SD for the standard normal curve. 

```{figure} images/normalcurve.svg
:width: 90%
:name: normalcurve

The standard normal curve with an average of zero and a standard deviation of one.
```

The normal curve extends to positive and negative infinity, but there is negligible area underneath the curve beyond just a few standard units.


## Standard Units

A value is converted to standard units by calculating how many standard deviations it is from the average. This is also called a *standardized* value or a *$z$-score*. 

$$\text{value in standard units} = \dfrac{\text{value - average}}{\text{SD}}.$$

An average IQ is 100 and the SD is 15. Someone scoring 130 is two SDs above the average, so their IQ is 2 in standard units. Standardized data will necessarily have an average of zero and a SD of one. 


```{figure} images/sleep_standardized_histogram.svg
:width: 90%
:name: sleep_data_standard

Data from the American Time Use Survey. Note the shape of the histogram is not at all changed by converting to standard units--the axes are simply rescaled. 
```

Converting values to standard units does *not* mean the data will now follow a normal curve. Above, we examined sleep data from the American Time Use Survey. However, if we consider skewed data, the standardized data will maintain the skew. 

```{figure} images/earnings_standardized_histogram.svg
:width: 90%
:name: income_data_standard

Data from the American Time Use Survey. Again, the shape of the histogram is not at all changed by converting to standard units--the axes are simply rescaled. 
```

## The 68-95 rule

As suggested by the bell shape, values far from zero are rare according to the normal curve. About **68%** of the area under the normal curve is between -1 and 1. About **95%** of the area is between -2 and 2. And about **99.7%** of the area is between -3 and 3. 

Recall that we can think of the normal curve as an idealized histogram. If data follows the normal curve, then 68% of the data will be within one SD of the average and 95% will be within two SDs of the average. We can do some accounting to find how much data is found in the extremes. 

How much data is more than two SDs above or below the average?

```{dropdown} More than two SDs from the average

About 5% of the data will be more than two SDs, either above or below, the average. If 95% of the data is within two SDs, there must be 100% - 95% = 5% left beyond two SDs. 

```

In the above, we considered both extremes, the data above the average and the data below the average. We can consider a single extreme, using the fact that the normal curve is symmetric, so there is just as much area more than two SDs above the average as there is area more than two SDs below the average. 


Assume IQ scores follow the normal curve with an average of 100 and SD=15. What percentage of people score over 130? 

```{dropdown} IQ above 130

A score of 130 is 2 standard units. 5% of the data is greater than 2 or below -2. This area is split evenly in both tails. Therefore, about 2.5% of the data is greater than 2 standard units (or 130 IQ points). 

```


## Finding Areas Under the Normal Curve

When the 68-95 rule is not enough to figure out the area under a normal curve for some region, you can use a $z$-table or a calculator. Below is a calculator.

In [1]:
from bokeh.io import show, output_notebook, output_file
from bokeh.layouts import column
from bokeh.models import CustomJS, Slider, ColumnDataSource, Div
from bokeh.plotting import figure
from scipy.stats import norm
import numpy as np
#import ipywidgets as widgets
from IPython.display import display, HTML

#with widgets.Output():
output_notebook(hide_banner = True)

# Standard normal curve data
x = np.linspace(-5, 5, 1000)
y = norm.pdf(x)
source = ColumnDataSource(data=dict(x=x, y=y, y_shaded=y))

# Bokeh plot setup
width = 450
height = int((9/16)*width)
plot = figure(title="Area under Standard Normal Curve", tools="save",
              x_range=[-5, 5], y_range=[0, max(y)*1.1], width=width, height = height)

plot.line('x', 'y', source=source, line_width=2)
shaded_area = plot.varea(x='x', y1=0, y2='y_shaded', source=source, alpha=0.3)

# Remove Y axis and grid
plot.yaxis.visible = False
plot.ygrid.visible = False
plot.xgrid.visible = False

# Area text
area_text = Div(text="Area: 100.00%")

# JavaScript callback
callback = CustomJS(args=dict(source=source, area_text=area_text), code="""
    const data = source.data;
    const left_limit = left_slider.value;
    const right_limit = right_slider.value;
    const x = data['x'];
    const y = data['y'];
    const y_shaded = data['y_shaded'];

    let area = 0;
    for (let i = 0; i < x.length; i++) {
        if (x[i] >= left_limit && x[i] <= right_limit) {
            y_shaded[i] = y[i];
            area += y[i] * (x[1] - x[0]); // Approximate area calculation
        } else {
            y_shaded[i] = 0;
        }
    }

    const total_area = norm_cdf(right_limit) - norm_cdf(left_limit);
    area_text.text = 'Area: ' + (total_area * 100).toFixed(3) + '%';
    source.change.emit();

    function norm_cdf(value) {
        return (1.0 + erf(value / Math.sqrt(2))) / 2.0;
    }

    function erf(x) {
        // Numerical approximation of error function
        const a1 =  0.254829592;
        const a2 = -0.284496736;
        const a3 =  1.421413741;
        const a4 = -1.453152027;
        const a5 =  1.061405429;
        const p  =  0.3275911;

        const t = 1.0 / (1.0 + p * Math.abs(x));
        const y = 1 - (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t * Math.exp(-x * x);

        return x >= 0 ? y : -y;
    }
""")

# Sliders for left and right limits
w = int(width*.8)
left_slider = Slider(start=-5, end=5, value=-5, step=0.1, title="Left Limit", width=w)
right_slider = Slider(start=-5, end=5, value=5, step=0.1, title="Right Limit", width=w)

left_slider.js_on_change('value', callback)
right_slider.js_on_change('value', callback)

callback.args["left_slider"] = left_slider
callback.args["right_slider"] = right_slider

layout = column(left_slider, right_slider, area_text, plot)

#output_file("normalplotthing.html")

# Show plot
# Define the CSS style
style = """
<style>
.output {
    display: flex;
    align-items: center;
    justify-content: center;
}
</style>
"""

# Apply the style
display(HTML(style))



show(layout)

### Tables

Tables come in two varieties, either giving an interior area or the cumulative area to the left. 



```{figure} images/cumulative_area_interior.svg
:width: 40%
:name: cumulative_area_interior

```

```{figure} images/tikz/interior_ztable.svg
:width: 90%
:name: interior_ztable

The cell value gives the area (as a fraction between 0 and 1) within $\pm z$ units, where $z$ is the value implied by the row and column values. These are the same areas as shown in the table on page A-105.
```


```{figure} images/cumulative_area_normal.svg
:width: 40%
:name: cumulative_area_normal

```

```{figure} images/tikz/cumulative_ztable.svg
:width: 90%
:name: cumulative_ztable

The cell value gives the area (as a fraction between 0 and 1) to the left of the value implied by the row and column values.
```

## Why care? 

The normal distribution has many nice properties, but there are many nice things we aren't covering in these notes. Its inclusion is justified because the normal distribution is reasonably close to data like hours of sleep and it will sneak up on as again as we consider certain random processes and sample statistics later in the course. The video below shows a normal distribution arising from the random bouncing of some beads. 


<div style="display: flex; justify-content: center;">
    <blockquote class="twitter-tweet" data-media-max-width="560">
        <p lang="en" dir="ltr">A beautiful visual demonstration of how mathematical patterns emerge from random events. <br>(using a Galton board)<a href="https://t.co/wpTSNF1aLn">https://t.co/wpTSNF1aLn</a> <a href="https://t.co/lXOhk72Poz">pic.twitter.com/lXOhk72Poz</a></p>&mdash; Lionel Page (@page_eco) <a href="https://twitter.com/page_eco/status/1171014160370388994?ref_src=twsrc%5Etfw">September 9, 2019</a>
    </blockquote> 
</div>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>


## Exercises

   
```{exercise-start}
:label: zscores
```

Which of the following has the highest value? 

A. The 16\%ile of a normal curve with average 1.0 and SD 0.5. <br>
B. The 68\%ile of a normal curve with average -0.5 and SD 1.0. <br>
C. The 84\%ile of a normal curve with average -1.0 and SD 1.0. <br>
D. The 50\%ile of a normal curve with average 0.0 and SD 10. <br>

```{exercise-end}
```

```{exercise-start}
:label: log68
```

Do you expect the 68-95 rule to hold more closely in data set A or B? Verify your answer.

A. 0.25, 0.25, 0.25, 1, 1, 1, 1, 1, 2, 2, 16 <br>
B. -2, -2, -2, 0, 0, 0, 0, 0, 1, 1, 4 <br>


```{exercise-end}
```


```{exercise-start}
:label: schoolsnorm
```

Students from high schools A and B compete for admission to college U. U admits any student with an SAT score above a threshold. School A has a lower average SAT than School B. Assume SAT scores follow a normal curve for each school. Is it possible that A has a higher admission rate? What if the admission rate for School B is over 50%? Construct an example if possible. 


```{exercise-end}
```
