Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@ilstu.edu)

### Course Introduction: Probability, Statistics, & Data Analysis

The Birth Day Problem
The Birth Month Problem
The Raisin Bran Problem
Fundamental Characterisitics of a Data Distribution
Working Definitions for Probability , Statistics, and Data Analysis
Historical Notes
Types of Data
Data Representations

The situations described here are intended to generate discussion of some fundamental concepts we will explore through probability, statistics, and data analysis.

Look around the room.

• Estimate the probability that at least two people in the room share a common birth day (same day and month, not necessarily the same year).
• What do you think is the minimum number of people needed in the room to assure that the probability that at least two people in the room share a common birth day is at least 50%?

Write a sentence or two to explain your response to each of these questions.

Look around the room.

• Estimate the probability that at least two people in the room share a common birth month (same month, not necessarily the same year).

Write a sentence to explain how you came up with this estimate.

Suppose the Kick-a-Poo Milling Company made this claim about their Raisin Bran:

In our 20-ounce box of raisin bran, we average 143 raisins per box.

Now suppose you just opened a 20-ounce box of Kick-a-Poo Raisin Bran and accurately counted 174 raisins. Would this number of raisins be a rare occurrence? Would it seem unusual to you? Write a sentence to explain your response.

For the Raisin Bran Problem, the emphasis is on the lack of necessary information. Although we know the "typical" or "representative" number of raisins in a box--the average is 143 raisins per box--we don't know anything about how the number of raisins in a box varies or deviates from the typical number of raisins in a box. Relatedly, we know nothing about the overall distribution of raisins in a large number of boxes. We need more information if we expect to draw conclusions about the rarity of the number of raisins in the box in question.

Activities to emphasize the variation in the raisins data set as well as the overall shape of the distribution of number of raisins include the following:

1. Create several data sets that have different patterns in deviation from an identical typical or average value of the data sets.
2. Create a data set to represent a sample from the raisin bran production line.
3. To gather evidence for a problem similar to this one, analyze the distribution of the number of raisins in single-serving size boxes of raisin bran cereal (or of the number of raisins in snack-size boxes of raisins or of the number of raisins in single-serving instant oatmeal with raisins).

Here is data for one of the suggestions in #3 above. Students counted the number of raisins in single-serving boxes of raisins and used the data to help answer the following question:

Is it unusual for a box of raisins to contain 86 raisins?

Fundamental Characterisitics of a Data Distribution

The Raisin Bran Problem provides an excellent setting through which to introduce the three important characteristics of a distribution: location, spread, and shape. Exploration of these characteristics is the objective of much of our study of statistics.

 Fundamental Characterisitics of a Data Distribution location: describes the anchor point or center of a distribution; the value that is most characteristic or typical of the entire set of values; often is a single value called a measure of central tendency. spread: describes the variability of the values in a distribution, or how the actual values in a data set deviate from some average value; typically is a single value, such as the range, the variance, the midspread, the standard deviation, or the mean deviation. shape: describes the overall visual appearance of a distribution; we use descriptions such as bell-shaped or normal distribution, uniform distribution, symmetrical, asymmetrical, single-peaked, multipeaked,or skewed; we may describe a shape by referring to its outliers or gaps.

Working Definitions for Probability , Statistics, and Data Analysis

The problem situations posed earlier help give us a bit of insight into just what we mean by probability, statistics, and data analysis. Here are practical definitions of these three elements of mathematics that are the focus of the course.

 Probability , Statistics, and Data Analysis Probability: the study of how frequently an event occurs in relation to all possible alternative events empirical probability (also called experimental probability): probability determined on the basis of existing evidence theoretical probability: probability determined by considering all situations that could happen Statistics: the science and art of gathering, analyzing, and making inferences from data Data Analysis: the breakdown of data into its important component parts Exploratory Data Analysis: analyzing data to learn more about it; characterized by skepticism and openness--skeptical of summary measures that may conceal aspects of or misrepresent the data and open to unanticipated patterns in the data that may be very revealing Confirmatory Data Analysis: data analysis that relies on the use of mathematical models of relationships, together with probability tests of numerical measures to confirm or refute hypotheses

Historical Notes

As fields of study in the mathematical sciences, probability and statistics emerged from real-world roots. The following brief historical chronology is based on an article in the Mathematics Teacher, November 1991, pages 623-630: "A Brief Look at the History of Probability and Statistics" written by James E. Lightner.

Probability had twin roots in the solution of gambling problems and in the handling of statistical data related to such quantitative instruments as mortality tables and insurance rates.

1200 BC: evidence of cubical marked dice evolving from much cruder bones (present arrangement of pips dates to 1400 BC)

3500 BC: Egypt -- "hounds and jackals" -- moving counters according to rules associated with outcome on astragali [bone in human foot, used as cubical die])

no link between gaming and mathematics -- concept of randomness contrary to thinking of the time--God (or many Gods) directed earthly events

late 15th-early 16th centuries: first truly mathematical treatment of probabilities--Italian mathematicians considered mathematical chances in certain gambling games, including dice

1654: two mathematicians' correspondence credited with giving rise to science of mathematical probability: Pascal's and Fermat's "problem of points:"

Two players bet equal amounts that chosen number on die will turn up three times before opponent's choice turns up three times. After some time, one player's number has shown twice while another's has shown once. How should they divide the stakes in the game?

1657: first formal writing about probability: "On Reasoning In Games Of Chance" by Christian Huygens, Dutch physicist

1713: first published book devoted entirely to probability: "The Art of Conjecturing" by Jakob Bernouli

"The theory of probabilities is at bottom nothing but common sense reduced to calculus; it enables us to appreciate with exactness that which accurate minds feel with a sort of instinct for which ofttimes they are unable to account . . . It teaches us to avoid the illusions which often mislead us; . . . there is no science more worthy of our contemplations nor a more useful one for admission to our system of public education." (Laplace, 1812: Analytical Theory of Probability)

Statistics was rooted in the processing of data.

• accurate and systematic counting of economic wealth, population, plunder of war goes back to antiquity: for example, census in Israel
• significant statistical investigation began when merchants (particularly those representing insurance companies) needed probabilistic estimates of events

1662: John Graunt--first person to draw statistical inferences from analyses of mass data--"Natural and Political Observations Made Upon the Bills of Mortality" (Graunt sometimes called the father of statistics)--info drawn from yearly and weekly reports of number of burials in various London church parishes (records arose as early as 1532)--his observations included that there were more male births than female births, that women tended to live longer than men, and that the number of persons dying was fairly constant from year to year

1693: "Degrees of Mortality of Mankind" by Oxford mathematician Edmund Halley--careful study of annuities

1795?: Gauss: explication of the normal curve

1829: beginning of statistical analysis of census data by Quetelet in Belgium

1865: Mendel relates probability to genetics and hybridization

1877: Galton discovers law of regression and the correlation coefficient

1894: Karl Pearson applied probability to biology--created biometrics

1970s: John Tukey develops Exploratory Data Analysis

Types of Data

There are four different types of data, based on the sorts of qualitative or quantitative comparisons that can be made aming the data. The first two data types are typically called qualitative data and the last two are called quantitative data.

 Types of Data Nominal data can be classified into categories. The data are typically labels rather than numbers. Examples include hair color (brown, black, blonde) and make of car (Ford, Honda, Dodge). Ordinal data can be rank ordered and may be labels or numbers. Examples include hair length (bald, short, long) and size of car (compact, mid size, full size). Interval data can be compared based on differences between the values. These values are always numbers, but the scale value zero (0) does not take on significant meaning. Examples include air temperature and oil temperature. Ratio data can be compared according to multiples of values. These data are always numbers, and the scale value zero (0) is significant. Examples include hair length and fuel tank capacity.

Data Representations

A data set can be displayed in its entirity or it can be summarized using one or more summary elements. Such representations are primarily visual or primarily numerical. Early in the course, we focus on developing methods for creating data representations with respect to these four perspectives:

• Visual Displays: Visual data representations that preserve the values in a data set.
• Visual Summaries: Visual representations of data that provide summary information about the data set.
• Numerical Displays: Numerical data representations that preserve the values in a data set.
• Numerical Summaries: Numerical representations of data that provide summary information about the data set.

Several examples of visual displays and visual summaries are shown for the raisins data. You should soon be able to create such visual representations for 1- and 2-variable data sets. You should also know how to use your calculator to help do that. Several examples of numerical displays and numerical summaries also are shown for the raisins data. You will develop skills for calculating or determining such numerical representations for 1- and 2-variable data sets. You should also know how to use your calculator to help do this.