Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@math.ilstu.edu)

### Putting It All Together: Location, Spread, and Shape For One-Variable Data Sets

Throughout the previous sections, we have described and illustrated characteristics of locations, spread, and shape for one-variable data sets. Here, we consider all three of these characteristics as we explore and analyze specific data sets.

Example #1

Suppose a manufacturing process creates pipes of length 50 cm. Because the process isn't perfect, some of the pipes may measure less than 50 cm and some may measure more than 50 cm. Every day during the manufacturing process, a random sample of pipes is pulled from the entire group that has been manufactured and their lengths are analyzed. Here are the lengths from a random sample of 30 pipes pulled from those manufactured on a recent day. Assume that the sample pipe lengths are from a population of pipe lengths that has a normal distribution.

The pipe buyer has been assured by the manufacturer that 95% of all pipe lengths are within 0.02 cm of 50 cm. Does this sample support that claim? Write a concise paragraph to respond to this question and include evidence to support your claim.

 50.032 49.993 49.928 50.037 50.017 49.988 50.009 49.988 50.018 49.99 50.013 50.018 49.995 50 50.04 49.973 50.002 50.024 50.034 50.055 50.011 50.034 49.993 49.986 49.958 49.975 49.999 50.023 50.044 50

The mean of the sample is 50.0059 cm with a sample standard deviation of 0.0273 cm. Here are those values represented within a picture of the normal distribution.

From our knowledge of the normal distribution, the manufacturer claims that 95% of the lengths are within the range from 49.98 cm to 50.02 cm. This represents two standard deviations away from the mean of 50.00 cm. However, the sample, assumed to be from a population of pipe lengths that are normally distributed, shows that 95% of the pipe lengths are within the range from 49.9513 cm to 50.0605 cm. This is because the mean of the sample is 50.0059 cm with a sample standard deviation of 0.0273 cm.

The sample shows that the manufacturing process has more error than the manufacturer claims.

Example #2

Here are data about the land speed of 32 different animals. Describe the location, spread and shape of the distribution of these data.

 Land Speed of Various Animals, in Miles Per Hour Cheetah 70 Coyote 43 Mule Deer 35 Human 28 Pronghorn Antelope 61 Gray Fox 42 Jackal 35 Elephant 25 Wildebeest 50 Hyena 40 Reindeer 32 Black Mamba Snake 20 Lion 50 Zebra 40 Giraffe 32 6-Lined Race Runner 18 Thomson's Gazelle 50 Mongolian Wild Ass 40 White-Tailed Deer 30 Wild Turkey 15 Quarter Horse 48 Greyhound 39 Wart Hog 30 Squirrel 12 Elk 45 Whippet 36 Grizzly Bear 30 Pig (domestic) 11 Cape Hunting Dog 45 Rabbit (domestic) 35 Cat (domestic) 30 Chicken 9 Most of these measurements are for maximum speeds over approximate quarter-mile distances. Exceptions include the lion and the elephant, whose speeds were clocked in the act of charging; the whippet, which was times over a 200-yard run (of 13.6 seconds); and the black mamba and six-lined race runner, which were measured over very small distances. Source: The World Almanac and Book of Facts, 1994, p. 175.

On the left below is a TI-83 screen shot showing both a modified box-and-whiskers plot and a histogram, for the data shown above. On the calculator screen, the horizontal scales ranges from 5 to 75 miles per hour (mph), with tick marks every 5 units. The vertical scale ranges from 0 to 15 with tick marks every 2 units. The two additional screen shots show numerical statistics calculated by the TI-83.

The median and mean of the data set are very close, at or near 35 mph. The mode is 30 mph. Based on these numerical summaries and the visual summaries (box plot and histogram), we can safely anchor the distribution at 35 mph. Because the difference between the mean and median is so close to 0, there is little skewness in this distribution.

The 5-number summary for the data is 9-29-35-44-70, resulting in a range of 61, a lowspread of 26, a midspread of 15, and a highspread of 35. These values indicate that the data is more highly concentrated in the middle of the range of speeds and more spread out at the ends of the ditribution. This is supported by the histogram as well, for it shows more values in the middle of the distribution (24 of the 32 values, or 75% of the data, range from 25 to 54 mph) than at either end of the distribution. The TI-83 modified box plot indicates one outlier value, at 70 mph, the speed of the cheetah. You can verify that this speed is more than 1.5 midspreads beyond the 75th percentile.

Although not perfectly a normal distribution, the distribution is somewhat mound-shaped. A perfectly normal distribution for a population with a mean of 35.19 and a standard deviation of 13.82 would have 22 of 32 values (approximately 68% of the data) in the range from 21 mph to 49 mph. Here, there are 21 such values. Likewise, for a perfectly normal distribution, 30 values would range from 7 mph to 63 mph. Here, 31 values fall in that range.

Example #3

The 2000 presidential election was hotly contested and highly controversial. A Federal Elections Commission website provides a variety of data related to this election.

• From this website, generate a table to show the percent of popular vote, by state (including the District of Columbia), earned by Al Gore and George W. Bush. Round percentages to the nearest hundredth of a percent.

Analyze the percentages calculated and present a report of your analysis.

• Create at least two different visual displays and two different visual summaries of the data. For at least one visual display and at least one visual summary, your visual representation should include both data sets for comparative purposes.
• Report on where the center of each data set resides as well as on the variability of each data set.
• Describe the overall shape of the distributions.

Here are a dot plot that compares the data as well as a back-to-back stemp-and-leaf plot. You can use your calculator to create a histogram as well as a box plot. Try showing the two box plots on the same screen for comparison.

Numerical Summaries

 mean median standard deviation 5-number summary Bush 49.63 50.42 10.30 8.95--43.97--50.42--56.84--67.76 Gore 46.01 46.46 10.10 26.34--40.9--46.46--50.63--85.16

For each candidate the mean and median are quite similar so either measure can be seen as the center location of the data. The standard deviation for Bush is slightly higher than for Gore, likely because of the extremely low percentage for the District of Columbia for Bush (8.95%).

Thse distributions are mound-shaped and skewed. The middle 50% of the data for Gore is more highly compressed than for Bush.

Example #4

The dot plots shown here represent lengths of steel rods created by machines A, B, C, and D at a manufacturing plant. The rods are to have length 4.7 inches with an error allowance of 0.1 inches above or below that value. Any rod outside these specifications is not delivered to the buyer.

• Describe each distribution in terms of its location and its spread. Justify your description.

Machine A: The distribution is anchored just beyond 4.7 inches (median) and it appears to be almost uniformly distributed from just greater than 4.6 inches to just less than 4.8 inches.

Machine B: This distribution is also anchored just beyond 4.7 inches (median) but it appears to be widely distributed from just greater than 4.5 inches to approximately 4.95 inches. There are modest gaps at around 4.65 inches and 4.75 inches, with a cluster inbetween and just below the gap at 4.65 inches. Except for the gaps, the data is close to being uniformly distributed.

Machine C: This distribution is anchored at about 4.75 inches (median) but it appears to be bimodal, with data values distributed from 4.5 inches to just less than 4.9 inches. There are clusters of values around the 4.6 inch length and from 4.75 inches to 4.8 inches. Except for the gaps, the data is close to being uniformly distributed. The data are more spread out in the lower 50% of the distribution compared to the upper 50% of the distribution.

Machine D: This distribution is anchored at about 4.73 inches (median) and it appears to almost mound-shaped and symmetrical, with data values distributed from just greater than 4.6 inches to just less than 4.9 inches. The data are more compressed in the middle 50% of the distribution compared to the lower 25% or upper 25% of the distribution.

• Of the four machines, which, if any, need not be checked or altered? Why is that?

Machine A seems to be performing according to specifications. Machine D is not far from that, but is anchored a bit high and spread out more than allowed by the error allowance. Both Machines B and C need attention.

• Which machine seems most stable in production? Which is least stable? How can you tell?

Machine A seems quite stable, given its distribution as alost uniform. Machine D represents a sample we could expect from a normally distributed population, which is what we might expect from this process. Machines B and C seem least stable, with Machine B showing a large variation and Machine C showing inconsistent production across a distribution that is outside the tolerance levels.

• Which machine produces rods farthest from the target length? How did you determine that?

Machine B has the greatest deviation from the target length 4.7 inches. We can see this by comparing each value in the distribution to the desired location 4.7 inches.