Illinois State University Mathematics Department
MAT 312: Probability and Statistics for Middle School Teachers
Dr. Roger Day (firstname.lastname@example.org)
One-Variable Data Analysis: Initial Explorations
Here are the semester exam score for 21 students in a college statistics course:
Final Exam Scores: College Statistics Course 51 46 31 35 37 51 56 51 43 48 52 33 42 37 27 57 65 36 37 55 42
What do these data tell us about the performance of the group? By looking at the raw data, in the form presented here, it is not easy to make statements that describe group performance. Without a more careful look, we can't be sure of the highest or lowest score in the group. In its raw form, it's difficult to determine whether there are clusters of scores near certain values or test scores that are evenly distributed from low to high.
Here's where data analysis provides a helping hand in our attempts to make sense of these data. Using techniques of data analysis, we can manipulate these data to better address questions about the group's performance. The manipulations include carrying out calculations, showing the data in different formats, making comparisons within and beyond these data, and similar techniques. Our goal here is to describe, illustrate, practice, apply, and expand these techniques.
One of the first techniques you might apply is to order the test scores from greatest to least. The table to the right shows the ordered list of scores. What is revealed by ordering these scores?
At least three characteristics of the data can be revealed by examining an ordered list:
- The largest and the smallest data values are found at the top and bottom of an ordered list. Here, the score of 65 points is the largest value in the data set and the score of 27 points is the smallest value in the data set.
- Repeated data values are easier to identify in an ordered list. Here, a scan of the list shows that a score of 51 points appears three times, a score of 42 appears twice, and a score of 37 appears three times.
- Carrying out a bit of comparision as we scan an ordered list, we can identify gaps between values. Here, a significant gap occurs between scores of 57 points and 65 points and a moderate gap exists between 37 points and 42 points.
- Here is another unordered data set. The table shows the top 25 pitchers in NCAA Division I softball, as determined by earned run average (ERA), for the 2003 season. Concentrate on the column labeled Appearances. This tells the number of games in which each player appeared.
- Order the Appearances data from least to greatest.
- Use the ordered Appearances data to describe the three characteristics just discussed.
Here are the Appearances data, ordered from least to greatest.
21 27 29 29 30 32 32 32 33 33 36 38 38 39 40 40 41 43 44 46 48 48 48 51 52
- We see that 21 is the least number of appearances and 52 is the greatest number of appearances among those in the top 25 ERA rankings.
- Three of the top 25 pitchers appeared in 32 games while three others appeared in 48 games. Two players each appeared in 29 games, 33 games, 38 games, and 40 games. These are the data values that repeat in this data set.
- The only significant gap that occurs is between the player with 21 appearances and the player with 27 appearances.
These examples illustrate several important characteristics of one-variable data sets. Some of these characteristics have been named or defined so we can communicate more effectively about them. Some related characteristics are described here as well.
- Clusters are isolated groups of points.
- Gaps are large spaces between data points.
- Outliers are data values substantially larger or smaller than any other data points.
- The maximum value of a data set is the largest value in the set. It is also called the upper extreme.
- The minimum value of a data set is the smallest value in the set. It is also called the lower extreme.
- The range of a data set is the difference between the maximum and minimum values of a data set.
- The mode of a data set is the value that occurs most often in the data set. A data set can have more than one mode, and if no data value occurs more often than any other, there is no mode.
Return to the NCAA Division I Softball data on pitchers. For the column labeled Runs:
- Order the data set from least to greatest.
- Identify any clusters, gaps, and outliers.
- Determine the minimum and maximum values of this set and calculate the range of these data.
65 57 56 55 52 51 51 51 48 46 43 42 42 37 37 37 36 35 33 31 27
In the next few sections, we look more formally at visual and numerical tools for one-variable data analysis.