Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@ilstu.edu)

### Visual Displays and Visual Summaries of Data

In the last section, we began to explore ways of analyzing one-variable data sets. Here we start to look more closely at visual and numerical representations of data that can be used as tools for data analysis. We begin by describing and creating visual displays and visual summaries to represent a data set.

Visual Displays of Data

Line Plots

A line plot provides an ordered display of all the values in a data set. Arranged along a scale, each data value is represented by an X, a dot (.), or a similar symbol (such as *, #, or @). Here again are the final exam scores for 21 students enrolled in a college statistics course.

 Final Exam Scores: College Statistics Course 51 46 31 35 37 51 56 51 43 48 52 33 42 37 27 57 65 36 37 55 42

Arranged in a line plot, the data set looks like this:

A line plot provides information about clusters, gaps, and outliers in the data set.

• Clusters are isolated groups of points.
• Gaps are large spaces between data points.
• Outliers are data values substantially larger or smaller than other data points.

Line plots provide a quick way to organize the elements in a 1-variable data set and are generally more effective when no more than 50 elements are in the set.

Summary: Constructing a Line Plot

1. Determine the largest and smallest values in the data set.
2. Create and label a scale that spans these largest and smallest values.
3. For each individual data value, place a distinctive mark (an X, a dot, or a similar symbol) directly above that position on the scale.
4. When more than one element of the set represents the same value, use more than one mark at that position on the scale.
5. The number of distinctive marks on the line plot should be the same as the number of data values in the set.

Stem-and-Leaf Plots

John Tukey, the mathematician responsible for developing Exploratory Data Analysis during the 1970s, invented the stem-and-leaf plot. A stem-and-leaf plot provides a graphical way to represent an entire data set. As the name indicates, we represent each data point using two parts, a stem and a leaf. Consider the following set of test scores.

 Test Scores 92 87 91 85 76 87 98 90 70 54

To create a stem-and-leaf plot, we use the tens digit of each score as a stem and the units digit as a leaf. In this case, the test scores can be represented as , and so on.

We now want to create a tree that contains all the stems and leaves. We need to first decide how to order the stems, either ascending or descending. Suppose we order them ascending. The tens digit 5 is the smallest and the largest is 9. Our collection of stems displayed in ascending order looks like this:

Notice that although there was no test score represented with a stem of 6, we included that in the progression of stems in order to maintain the equal increments between stems.

What about the leaves? How do we arrange these in relation to the stems? We typically order them from least to greatest horizontally (although ordering is not initially necessary), and we include multiple copies of the same value wherever they appear. Here are the ten test scores arranged in a stem-and-leaf plot.

The stem-and-leaf display has several useful characteristics.

• It provides a visual representation from which the entire numerical data set can be reclaimed.
• As a visual representation, it can provide a picture of the three important components of a distribution: location, spread, and shape.

Consider the following example. The values in the plot represent the percentage of the total number of student loans that were in default in each of the 50 states and the District of Columbia for some year.

Refer to this plot to comment on the location, spread, and shape of the distribution of values:

1. What data values characterize the distribution?
2. How are the values dispersed?
3. What shape does the distribution take on?

Returning to our general discussion of stem-and-leaf plots, we see in the last example that the stems can be chosen to fit the data. For the default-student-loans data, the stems are units digits. What if we had used the tens digit to create the stems? Here is that plot:

Notice that we dropped, or truncated, the digit in the tenths position of each value. In a stem-and leaf plot, leaves typically are single digits. Rather than truncating, we could have rounded the values to retain more accuracy, but that does little to improve the overall usefulness of the graphical representation.

By using the tens digits as the stems, we have collapsed the plot to the point that it provides little value in representing the shape of the distribution. As a rule, we want to have from 5 to 20 stems in a stem-and-leaf plot. In a situation where data values range from 10 to 16, we may increment the stems by 1, using 10 |, 11 |, 12 |, and so on. If there are enough data elements, we can also stretch the stems into finer increments. In this case, we may create stems that increment by one-half the number of different values as in the one-line stem. Here is what the two-line stems look like:

In this plot, tenths digits 0 through 4 are leaves for stems that are numbers; tenths digits 5 through 9 are leaves where the stems are dots. Here, the values 14.2, 14.5, and 14.8 would be represented as follows:

Dots (bullets) or some other non-numeric symbol can be used as stems to "spread out" the data for better views of a distribution. As another example, consider the ordered data set below, showing heights in centimeters of eighth grade girls:

 Heights (cm) of 8th-Grade Girls 141 143 143 145 145 146 146 148 150 151 152 153 153 154 155 156 157 157 157 157 158 159

Here are three different stem-and-leaf plots of this data set.

The use of five-line stems (plot III) allows us to represent values from 140 to 159, where units digits 0 and 1 are leaves of the numeric stem, units digits 2 and 3 are leaves of the second line, units digits 4 and 5 are leaves of the third line, units digits 6 and 7 are leaves of the fourth line, and units digits 8 and 9 are leaves of the fifth line. With each data set, we need to determine the most effective representation to use.

There are shortcomings to stem-and-leaf plots. There is a limit to the size of the data set. A stem and-leaf plot with very few data values, fewer than 10 perhaps, provides us little information about the distribution. On the other hand, if a data set has more than 50 or 60 values, a stem-and-leaf plot may also be ineffective. It may be cumbersome to limit the number of stems used and at the same time reveal meaningful information about the shape of the distribution. As we discuss other ways to represent data sets, we can make further comparisons.

Summary: Constructing a Stem-and-Leaf Plot

1. Determine the largest and smallest values in the data set.
2. Decide on an effective way to create the stems (that is, using one digit, two digits, using a two- or three-line stem, and so on).
3. List the stems in a column from least to greatest or from greatest to least.
4. Create and arrange the leaves associated with each stem; each leaf represents an element from the data set.
5. For an ordered plot, arrange the leaves in order from least to greatest.
6. Include a key to indicate what is represented by a stem and leaf, for example, 2 | 1 = 2.1.

A back-to-back stem-and-leaf plot can be used to compare two data sets. In the data set below, we compare league-leading numbers of home runs in the National League to league-leading numbers of home runs in the American League for a given season. Notice the use of two-line stems and the way that the leaves come out from the stems on the left side of the tree.

• What benefit is there in creating and using a back-to-back plot?
• What information about the differences between the two data sets is revealed through the plot?
• What more would you like to know that is not apparent by examining the back-to-back plot?

Visual Summaries of Data

Histograms

Another way to visually represent the distribution of a data set is to use a histogram. Histograms differ from the two previous types of visual representations in that histograms do not preserve every element of a data set. With a line plot or a stem-and-leaf plot, every element of a data set can be recovered to its original numeric form. This is not the case with histograms nor with box-and whisker plots, another visual summary of a single-variable data set that we will soon discuss.

Here are two examples. On the left is an absolute frequency histogram and on the right is a relative frequency histogram. What distinguishes the two types of frequency histograms?

The key terms are absolute and relative. An absolute frequency histogram shows the actual number of cases within each measurement class (each bar of the histogram). There were 60 students who passed the exam on the first attempt, and 25 students who required six attempts. The number of students is determined by the height of a measurement class as read on the vertical scale. To determine the total number of students represented by the absolute frequency histogram, we can sum the number associated with each measurement class.

The relative frequency histogram has the same shape as the first, but the vertical scale shows values ranging from 0 to no more than 1. The vertical scale on a relative frequency histogram shows the portion of some whole group represented by each measurement class. Here, for example, relative to the entire group of students represented, 0.10 or 10% of them required five attempts to pass the exam. The value 0.10 or 10% is determined just as when reading an absolute frequency histogram: read from the vertical scale to determine the height of each measurement class. What must be true about the sum of the heights of all measurement classes in a relative frequency histogram? Why is that?

Let's return to a previously generated data set, 21 exam scores from a college statistics course, and create a relative frequency histogram of the data set.

 Final Exam Scores: College Statistics Course 51 46 31 35 37 51 56 51 43 48 52 33 42 37 27 57 65 36 37 55 42

A stem plot from the data set will be useful for creating a histogram.

We first need to determine how many measurement classes (or buckets) to include in the histogram and the common width of each measurement class. As described above, each bar in a histogram represents a measurement class. Typically we use from 5 to 20 measurement classes and we choose them so that each data point is in exactly one measurement class.

With the stem plot as a guide, we might choose to use 5 measurement classes. Each stem spans ten values, so our measurement classes would be 20 to 29, 30 to 39, 40 to 49, 50 to 59, and 60 to 69.

We now count the number of scores in each measurement class, determine the total number of elements in the data set, and calculate the relative frequency of data elements in each measurement class. A tally sheet provides an effective tool for organizing a count, although for our case we can simply refer to the stem plot and counted the leaves for each stem.

We use this information to build the relative frequency histogram. Here are some decisions that remain.

• What physical width should we use for each bar in the histogram? This depends on the paper width and the number of measurement classes in your histogram.
• How will we create the horizontal scale? Here, values 20, 30, 40, 50, 60, and 70 represent the measurement class boundaries. We will use the convention often used for histograms, including those created with the TI-83, that when a data value falls on a measurement class boundary, it is included in the second, or greater of the two, measurement classes. On the histogram below, for instance, if a score of 30 had been in the data set, it would be included in the second measurement class.
• What scale should we use for the vertical axis? We know this value will never be greater than 1. Why is that? In the histogram shown here, for the vertical scale I decided to use 10 intervals and I chose the maximum value, 0.35, for easy division without wasting space vertically.
• Are we done? We need to complete the histogram by including appropriate labels for the measurement class boundaries and for the horizontal and vertical axes.

What does this visual disply tell us? We can use the histogram, or its associated tally sheet, to address questions such as the following:

• What portion of the class earned scores of at least 40?
• What portion of the class earned scores less than 60?
• What portion of the class earned scores from 30 to 59?

Summary: Constructing Relative Frequency Histograms

• Determine the extreme values of the data set.
• Create from 5 to 20 equal intervals, or measurement classes, extended to include from the smallest to the largest data value.
• The measurement classes must be chosen so that each measurement (data value) is in exactly one class.
• In general, it is better to use a relatively small number of measurement classes when you have a small number of data values and to use a relatively large number of measurement classes with a data set with a large number of elements.
• Record each value in one and only one measurement class, count the total number of values in the set, and determine the relative frequency of data values in each measurement class.
• Scale and label the vertical and horizontal axes.
• Create a vertical scale that best uses the available space. Use the measurement class with the largest relative frequency as a guide.
• Divide the horizontal axis into the chosen number of measurement classes, each of equal width.
• Build each measurement class rectangle. Each width should be equal and each height should rise to the relative frequency for that class.

Relative frequency histograms provide a straightforward way to visualize the values of a data set. The procedure for creating an absolute frequency histogram is similar to the procedure just described, without the additional steps required to compute relative frequencies. The vertical scale should be constructed to represent actual values rather than relative frequencies.

Compare: Stem Plots and Relative Frequency Histograms.

• Stem plots preserve numeric elements of a data set.
• Stem plot divisions (the stems) are determined by our base-10 number system.
• Histograms allow for user choice in constructing the class widths.
• Histograms take more time to construct by hand than stem plots.
• Stem plots, in general, are a better tool for data sets with small numbers of elements, no more than 50 to 60 values.

A third type of histogram is called a cumulative frequency histogram. It differs from the other two types of histograms in that each measurement class shows the cumulative total of cases through that measurement class. Thus, instead of a vertical bar representing the cases in one measurement classes, it shows all of the cases in or below that measurement class. Here is a cumulative frequency histogram for the placement exam data.

Notice that the vertical scale shows relative values. The third bar from the left shows that 60% of the students required no more than three attempts to pass the exam. We could also have used absolute values on the vertical scale to show the actual number of students. You should be able to recover the values from a relative or absolute frequency histogram when shown a cumulative frequency histogram.

In addition to the visual representations described already, we can also use bar graphs and circle graphs to represent data sets. A bar graph is very similar to an absolute frequency histogram, for it typically shows the actual number of items in each of several categories. A circle graph (or pie chart), as the name implies, is circular shaped. With the circle representing the whole, each section of the circle represents a part of the whole. Therefore, a circle graph is very similar to a relative frequency histogram. Here is a bar graph and a circle graph for the placement exam information used to introduce histograms.

Bar graphs and circle graphs are often used to represent nominal or ordinal data, such as the number of students in each of several mathematics courses (nominal data) or the number of sweaters in each of several sizes (ordinal data).

As mentioned above, box-and-whisker plots provide yet another visual summary of a data set. Before looking at box-and-whisker plots, we need to consider some numerical representations of data. That is the focus of the next section of notes on one-variable data analysis.