Getting Started

Contents

Getting Started#

Important Readings

Statistics refers to both the discipline and values calculated from data. We can define each after we define data. Data is mostly what you think it is, but our definition might be broader than you expect. Data are a collection of values and their context. Context is why a good data set comes with documentation. Context includes things like what was measured and how the collection was done. Values does not have to mean numbers. This can include text, images, smells, and audio. If it’s something Google or the NSA might collect, it’s probably data.[1]

Data is good. Everyone likes to say they are “data driven.” The trouble with data is, without any machinery for making sense of it, we’re left with a disorienting jumble of numbers. The discipline of statistics helps make that data useful. Below are a few definitions for the discipline.

  1. “The technology of extracting meaning from data.” ([Han08])

  2. “A way of reasoning and a set of tools that help us understand the world.” ([DVVB19])

  3. “The art of making numerical conjectures about puzzling questions.” ([FPP07])

  4. “The discipline that concerns the collection, organization, analysis, and interpretation of data.” (Wikipedia)

The first definition above is presented as a “working definition” and it does well to emphasize a notion of extraction. Our textbook, [FPP07], emphasizes that this is a numerical endeavor. Extracting meaning isn’t just a matter of solving equations, though. Taking a large data set and judging it by its average or anything else we calculated is a reductive exercise, as is any kind of summary (e.g. a map simplifies spatial information). Judging if this oversimplifies what’s in the data or might not be applicable in more general settings requires argument. Indeed, [Abe12] emphasizes the quantitative endeavor and logical rhetoric, “The purpose of statistics is to organize a useful argument from quantitative evidence, using a form of principled rhetoric.” This is all to say we have to use both parts of our brain going forward and we have to be comfortable with some ambiguity.

Finally, we can define statistic in its second use. A statistic is a calculation made from data, and per [Han08], a “numerical fact or summary.” A statistic can be judged by how well it says something useful.

Consider the following two gambles.

  • A. 100% chance of winning $1 million.

  • B. 89% chance of winning $1 million, 10% of $5 million, and 1% chance of nothing.

Gamble A pays $1 million on average and gamble B pays $1.39 million on average. Gamble B is better according to the average, but no statistician would insist that means Gamble B is better than A. The average, as a statistic, does not communicate relevant information about risk. Someone else might look at these two gambles and decide that the important statistic to compute is the worst-case winnings. Statistics won’t help us decide which gamble is better. But statistics can help sensitize us to what is lost and what judgments are implicit in choosing one statistic over another.

All of this can be disappointing to someone who is looking for statistics to resolve questions. Some questions can’t be resolved and others might be resolved only slowly. In the 1890s, some of the first reports were published associating smoking with lung cancer. The United States Surgeon General’s Report on Smoking and Health was not published until 1964. This wasn’t a case of averages failing to adequately capture risk like in the above, rather it was that certain standards of evidence needed to be met. Not all types of studies are equally reliable and it comes down to much more than just sample size.

Exercises#

The data below come from Spotify. Song popularity is based “in the most part, on the total number of plays the track has had and how recent those plays are.”[2] Artist popularity is derived from song popularities. The following exercises emphasize the inherent ambiguity in finding a good statistic.

Exercise 1

Without using the artist-popularity column, use the data to develop a statistic that can be used to rank the artists on popularity.

Exercise 2

Briefly, what are some important caveats regarding what “popularity” actually means here? Hint: Artists come from different time periods and Garth Brooks’ entire catalog is not on Spotify.

Exercise 3

Use the data to develop a statistic that can be used to find the biggest one-hit wonder.