Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@math.ilstu.edu)

### Numerical Representations of Data

In the last section we introduced a variety of visual representations for data sets, including visual displays of data and visual summaries of data. In this section, we continue to build our data analysis tools we identify numerical representations for a data set.

Measures of Central Tendency (Location)

We have previously mentioned that one important aspect of the distribution of a data set is its location. We typically represent location with one or more measures of central tendency to describe the center of the distribution. Three often-used numeric measures of central tendency are the mean, the median, and the mode.

The mean represents the arithmetic average of all the data values. That is, we determine the mean by dividing the sum of the data values by the total number of data values. In essence, we "pile up" all the values and then "distribute them equally" over all elements.

The median represents the physical middle of a data set. To determine the median, we arrange the data values in ascending or descending order and then make a physical count to find the middle value. When there are an odd number of data values, this results in exactly one data value to serve as the median. With an even number of data values, this physical count to the middle will put us between two data values. In this case, we find the arithmetic average (mean) of these two middle values and that number is the median of the data set.

The mode represents the value that occurs most frequently in a distribution. In many cases, there is more than one mode in a data set. If no value appears more than any other, there is no mode.

Let's return to a data set we used previously, the 21 scores from a college statistics course, together with a stem plot of that data.

 Final Exam Scores: College Statistics Course 51 46 31 35 37 51 56 51 43 48 52 33 42 37 27 57 65 36 37 55 42

The stem plot helps us to quickly move to the middle score, 43. This is the median. By counting leaves, we see that 37 and 51 both occur three times, so both 37 and 51 are modes. Because there are two modes, we say that the distribution is bimodal. To determine the mean, a bit more work is required. We add the 21 scores and divide by 21, resulting in a mean of 44.38, to the nearest hundredth of a unit.

Which of these measures of central tendency is most effective in representing the location of the distribution of these exam scores? This is a question you should ask with virtually any data set you explore.

There are no hard and fast rules, rather a comment to keep in mind: The mean is more affected by extreme values in a data set than are the median and the mode. Why is this so? Because of this, we say that the mean is a non-resistent statistic or the mean is not a robust statistic. However, the median is robust. The median is resistent to being affected by extreme values.

Let's compare a random sample of salaries from two companies. For which company would you prefer to work?

The mean of each distribution is \$25,000. The medians differ. For Company X it is \$13,476 and for Company Y it is \$25,335.

Consider the effectiveness of these three measures of central tendency:

• When might it be most appropriate to know what measurement occurs most often? In retail sales, there is often talk about the items for which we have had the most sales. What shoe size has been most popular this month? What waist size do we need to reorder? These are situations where the mode is put to effective use.
• When can the mean provide an effective measure of the location of a distribution? When the distribution is close to being a symmetric distribution. In a symmetric distribution, either there are no significant extreme values or there are offsetting extreme values. The mean is often used in education, economics, meteorology, and in inferential statistics.
• The median is useful when distributions are not close to symmetric, that is, when extreme values tend to pull the mean up or down significantly. In education, exam scores are often reported as percentiles, which are essentially position reports: How many students at the 20th percentile? How many at the 70th percentile? The housing industry summarizes home prices using the median rather than the mean. Why might that be?

Recall again the three significant components used to describe a distribution: location, spread, and shape. Measures of variability help us to describe the spread or dispersion of a distribution in order to address questions about the variability among the elements in the data set. Here is an example to help us see how the degree of spread in a distribution affects a distribution.

The following values represent scores of a common exam given to ten students in each of two schools.

Here, the mean of each set of scores is 71.7 and, perhaps surprisingly, the median of each set of data is 72. Do the data sets truly differ in any way? Examine at the line plots of the distributions, shown below.

Using terms we introduced earlier, we say that there is a large gap in the data for School I. The gap is between two clusters of data. One cluster is around 50 or 51 and another is around 91 or 92. We may say that 98 is an outlier, as there is a small gap between the four values clustered from 89 to 92 and the single value 98. For School II, the are no large gaps, no outliers, and essentially only one cluster of data. The middle of the distribution is in the low 70s.

We can begin to describe the spread or dispersion of the data by returning to one of our location measures. In this case, let's use the median, 72, as the location of the center of each data set. How far out from this value do we need to move to capture values? For example, if we go out 6 units in each direction, we have scores ranging from 66 to 78. For School I, no values fall in that range of scores. For School II, however, 8 of the 10 scores, or 80% of the values, are within that range. By going out four times that distance, or 24 units in each direction, we cover the scores 48 to 96. This clearly captures all of the School II values (100% of them) yet contains only 80% of the School I scores.

This example helps to conceptualize the idea of determining the dispersion of values within a distribution. We begin with a location point, a value that somehow represents the center of a distribution, and we move out from that location. As we have with measures of location, we need to agree on ways to represent the spread.

One way is to use the method just described for the schools' data. We could determine how far from the center of the data we must move to capture a particular portion of the data. Our report would be something like this: For Data Set A, the middle 10% of the data are captured with an interval 12 units long, the middle 50% of the data require an interval 32 units long, and to get 90% of the data contained the interval is 51 units long.

We could reverse that process, and begin with common lengths for the intervals and determine the percentage of the data set within them. Our report might look like this: For Data Set B, a 10-unit interval contains 18.5% of the data, a 20-unit interval contains 43% of the data, and a 50-unit interval contains 88.2% of the data.

One measure of spread that resembles this in some ways is called the five-number summary. It anchors the location of the data with the median and then moves out in 25-percentile increments from the median. Thus, the five-number summary reports the five values that correspond to the 0th percentile, the 25th percentile, the 50th percentile, the 75th percentile, and the 100th percentile in an ordered data set. For the schools' data shown previously, here are the five-number summaries:

 School 0th percentile 25th percentile 50th percentile 75th percentile 100th percentile I 47 51 72 91 98 II 65 67 72 77 79

There are two important points to make:

1. Certain names have become standard for each of these values. We know one already, the median. The smallest and largest values are, appropriately, called the lower extreme and the upper extreme, respectively. The 25th percentile is called the lower quartile and the 75th percentile is called the upper quartile. The last two values also can be referred to as the lower hinge and the upper hinge.
2. We already know how to determine the median and the extreme values. To find the hinges, return to your ordered list, the one you used to determine the median, and draw a line to split the data set into two equal parts. If the original data set contained an even number of values, the line is drawn between the two middle values in the data set. If the data set held an odd number of values, the line is drawn through the middle value so that it is in neither of the two parts created. Now, to find the lower hinge, determine the median of the lower portion of data. Likewise for the upper hinge. You will again have to use one method if a portion of data has an even number of elements and another method if it has an odd number of elements, just as you did to determine the median of the data set.

Return once again to the scores from a college statistics exam and generate the five-number summary for that set. The stem plot for that data provides an already-ordered data set.

The median is the middle value, the eleventh from either extreme. The median value is 43. This set has an odd number of elements (n = 21) so our line goes through 43. We are left with two portions of data, each with ten elements. Determine the median of each portion, using the method for an even-numbered data set. The lower hinge is between 36 and 37. Thus, 36.5 is the lower hinge. The upper hinge is between 51 and 52. Therefore, 51.5 is the upper hinge. Our five-number summary is 27, 36.5, 43, 51.5, 65.

How is this information helpful in describing the spread in the distribution of the data set? Most importantly, it provides key values from which we can calculate the spread in various parts of the distribution. There are four descriptions of spread we can determine by subtracting pairs of values from the five-number summary:

1. The range of a distribution is the difference between the lower and upper extremes.
2. The midspread of a distribution is the difference between the upper hinge and the lower hinge. The midspread describes the location of the middle 50% of a distribution.
3. The lowspread of a distribution is the difference between the lower extreme and the median. The lowspread describes the location of the lower 50% of a distribution
4. The highspread of a distribution is the difference between the upper extreme and the median. The highspread describes the location of the upper 50% of a distribution.

From the five-number summary for the exam scores from a college statistics course, we can determine that the range is 38, the midspread is 15, the lowspread is 16, and the highspread is 22. It is most helpful to represent these values along an accurately drawn scale, as shown here.

This visual representation can help us interpret the numerical values we have calculated. The spread of the data appears to be fairly consistent throughout most of distribution. The lowspread and midspread are close in value, the lower three quartiles have similar spread, and the median is shifted only slightly left from the center of the middle 50% of the distribution. There is an exception in the upper 25% of the data. The exam score 65, perhaps considered an outlier, pulls, or spreads out, the upper end of the distribution. The highspread is greater than either the lowspread or the midspread, and the range of scores in the top 25% of the distribution is greater than for any of the other quartiles of data.

By enhancing this visual representation a bit further, we can create what Tukey called a box-and-whiskers plot. As illustrated here, the five-number summary, together with a scale of values, is used to create a two-dimensional representation of the distribution. The box represents the middle 50% of the data, ranging from the lower hinge to the upper hinge. In the box, the median is represented as a segment that divides the location of the middle 50% of the data into two quartiles. From each end of the box, whiskers extend to the extreme values. Each whisker therefore represents the location of 25% of the data. We will eventually use additional techniques to modify how the location of the lowest and highest quartiles are represented.

Here is a back-to-back box-and-whiskers plot to represent the number of home runs hit in the American and National Leagues, two data sets we previously represented with back-to-back stem plots.

With back-to-back box plots, we not only have a visual representation of each distribution, we have a visual tool for meaningful comparisons of distributions. Based on the box plots of the two distributions, we can describe the distributions individually and we can identify similarities and differences between the two distributions.

It is important to consider the strengths and weaknesses of a box-and-whiskers plot for describing a data set. Unlike line plots and stem plots, called visual displays, box plots do not preserve each element of a data set. We can determine a distribution's five-number summary (an example of a numeric summary of a distribution) from a box plot, but cannot resurrect every value in a distribution. Likewise, from a box plot there is no way to determine how many individual values are contained in a distribution. Each box plot provides a visual summary of a distribution, not a visual display of each observed value in a distribution. A box plot provides a better picture of the extremes of a distribution as compared to a stem plot, so box plots are particularly effective in representing characteristics of distributions that are not symmetric.

A way to enhance the visual representation provided in a box plot is to modify the creation of the whiskers. Rather than draw a segment from a hinge to an extreme point, which may lead to an incorrect conclusion that a data set has values all along the whisker, we will establish a more specific procedure:

1. Use the five-number summary to determine the midspread, also called the interquartile range.
2. Multiply the midspread by 1.5.
3. From each of the two hinges, move out a distance equal to the value calculated in Step #2, that is, move out 1.5 midspreads. These locations mark the two inner fences of the distribution.
4. Now determine the most extreme values in your data set that remain on or within each inner fence. Extend each whisker, as a solid segment, from a hinge to that mark.
5. Another pair of marks, called the outer fences, are located a distance of 3 midspreads from each hinge. Mark each data value between an inner and outer fence with an asterisk (*) or a closed circle (). Mark each data value beyond either outer fence with an open circle (o).

Here is a data set and a box plot created from it that employs this procedure for representing extremes in the distribution. The values represent daily high temperatures in degrees Fahrenheit for Bloomington-Normal, Illinois, for the first three weeks of January 1994.

The five-number summary for this data set is -11, 6, 9, 13, 36. From the upper and lower hinges we determine the midspread to be 7. We then calculate 1.5 times the midspread as 10.5, and 3 times the midspread as 21. This makes the inner fences -4.5 (6 - 10.5) and 23.5 (13 + 10.5) and the outer fences -15 (6 - 21) and 34 (13 + 21). We use all of this information to create the box plot shown below.

The two closed circles in the lower tail of the distribution represent the temperatures -11 degrees and -9 degrees. They are between the inner and outer fences in the lower portion of the distribution. The two closed circles in the upper tail of the distribution represent the temperatures 24 degrees and 31 degrees. They are between the inner and outer fences in the upper portion of the distribution. The temperature 36 degrees is represented by an open circle in the upper tail, beyond the outer fence. What does this visual representation tell us about the distribution of temperatures in Bloomington-Normal during the first three weeks of January 1994?

We have discussed how the values in a distribution's five-number summary can be used to help describe the variability in a data set. The five-number summary itself provides a numeric summary of a distribution. By comparing various pairs of values in a five-number summary, we generate other numeric summaries that describe the dispersion of several parts of a distribution: the range, the lowspread, the midspread, and the highspread. Finally, we can represent the five-number summary by creating a box-and-whisker plot. This provides us a visual summary of a distribution.

There are other ways to describe the spread or dispersion of a data set. One method is based on determining to what degree the values in a data set differ from the mean of that data set. These measurements are called deviations from the mean, or simply deviation scores. Deviation scores are an important component in several methods of describing the variability in a data set. We will explore those methods with the following data set.

The mean number of suicides during this ten-year period is 12.8 suicides. In the table below, a deviation score represents how each data value differs from the mean.

To represent an average deviation, we could sum the deviations and divide by the number of values. When we try this with the values in the third column above we get a sum of 0. This is, in fact, the sum we will get for any set of deviations from the mean, because of the arithmetic we carry out to determine the mean. So instead of determining the average deviation from those shown in the third column, we use absolute deviation. For each data value, absolute deviation is just the absolute value of the actual deviation from the mean of the distribution. The last column in the table above shows the absolute deviations. We now sum this column and divide by the number of values to get a mean deviation (that is, the arithmetic average of the absolute deviations) of 3.6 suicides.

When comparing distributions of similar data, we can use the mean deviation to compare the variability among the distributions. If we have 10 years of suicide data from various counties in Illinois, we can use the mean deviation of each data set to compare the variability in suicides among the counties. Here is another example. Which city has less variation in monthly high temperature?

The mean deviation for Franklin is 9.33 degrees and for Jackson it is 19.75 degrees. Based on the mean deviation as a measure of variability in a distribution, Franklin's monthly high temperature shows less variation.

Another way to describe variation in a distribution is to use the variance. The variance is the mean of the squared deviations. When we determined the mean deviation, we avoided a zero sum for the deviations by using absolute value. Here, we square each deviation in order to avoid a zero sum. Thus, to determine the variance of a data set, we find the mean, calculate each deviation from the mean, square each deviation, and then find the arithmetic average of these squared deviations. For the data set with suicide information from McLean County, here is the calculation of the variance.

The last column represents the squared deviations. The column sums to 153.8. This is often referred to as the sum of squares, or the sum of the squared deviations. To find the mean of this column, we divide 153.8 by 10, resulting in a variance of 15.38.

In order to make the units of measure consistent, we must reverse the squaring process we used to generate the variance. Thus, we take the square root of the variance, SQRT(15.38) = 3.92. The square root of the variance, a value expressed in the same units of measure as the values in the data set, is called the standard deviation. It provides yet another measure of the variability of a distribution. Here are symbolic definitions for variance and standard deviation, together with shortcut formulas for easier calculations.

Note the distinction between the sample variance and the population variance. To determine the sample variance, we divide the sum of squares by a value one less than the number of values in the data set (n - 1). For the population variance, we divide by the actual number of values in the data set (n). Why the distinction? When we draw a sample from a set of values and determine the variance of that sample, that variance may not represent the true variance of the entire population of values. To allow for that possible error, we divide the sum of squares by a number smaller than n, and therefore calculate a sample variance that is larger than when calculating the population variance.

Properties of the standard deviation provide us means by which we can further describe the characteristics of a data set. When the data is from a bell-shaped, or normal. distribution of values, the standard deviation helps us describe where chunks of data are located. About 68% of the values in the data set will be within one standard deviation of the mean (that is, from ). About 95% of the values will be within two standard deviations of the mean (that is, from , and virtually all values in the set will be within three standard deviations of the mean (that is, from .

In the next section, we explore further various characteristics of the shape of a distribution.