Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@math.ilstu.edu)

### Introduction to Multivariate Data Analysis

Introduction to Multivariate Data Analysis

We have devoted much attention to the essential characteristics of a one-variable, or univariate, data set. We have considered concepts and skills used to describe the characteristics of location, spread, and shape. We have learned how to generate measures of central tendency, measures of variation, and descriptions of the shape of a distribution. Perhaps most importantly, our emphasis has been on exploring the data and generating visual and numeric representations to help determine the very nature of a data set.

We now focus our attention on two-variable relationships, or bivariate data sets. Bivariate data sets have as their data values ordered pairs of data. Here is a bivariate data set. It shows the total points and total personal fouls for members of two professional basketball teams.

Does there appear to be any relationship between a player's points scored and his personal fouls? When presented with data sets with two or more variables, or multivariate data, a common question focuses on whether or not some relationship exists among the data.

As we extend our data exploration to multivariate data sets, we begin to explore relationships between the data. Once again, we strive to describe three essential characteristics of relationships: direction, strength, and shape. Concepts and skills associated with these characteristics can help us to effectively describe a multivariate data set.

Exploring Relationships Through Scatter Plots

To explore the data, we will create scatter plots of bivariate data. A scatter plot provides a first look at how two sets of data may relate to each other. We create a scatter plot with a traditional two-dimensional coordinate system, or what you may call an xy-plane. The data pairs that make up the data set are plotted as ordered pairs on the coordinate axes. Here is a scatter plot of the fouls-points data pairs from the table of values above.

Note these components of the scatter plot.

• The horizontal and vertical axes are labeled according to the data being used.
• The scales on each axis are clearly marked.
• It is not required that the scales for each axis be identical.
• At times, we may begin one or both axes at non-zero values.
• All dots or points that represent data pairs are exactly the same size.
• If two or more data pairs are equal, indicate that by including the appropriate numeral on the scatter plot.

A scatter plot gives us a visual display of the bivariate data set. It may reveal characteristics of shape, direction, and strength not apparent from the raw data. We now look at examples to help describe each of these characteristics.

The scatter plots that follow show relationships between pairs of data sets. By the direction of the relationship, we describe how the data pairs increase or decrease with respect to each other. Figure (a) shows the speed of a bicycle during the first 60 feet of travel. It shows that as distance increases, so does speed. We use the word positive to describe the direction of the relationship between distance and speed, because as the values in one data set increase, so do their associated values in the other data set.

Figure (b) shows the temperature of an object that has been placed in a freezer. We see that as time increases, the temperature decreases. We use the word negative to describe the direction of the relationship between time and temperature, because as the values in one data set increase, the associated values in the other data set decrease.

Some bivariate sets of data appear to have neither a positive nor a negative relationship. That is the situation revealed in figure (c) above. It shows the elevation above sea level and the annual rainfall for several cities throughout the world. Here, as elevation increases, there seems to be no apparent direction to the associated rainfall values. They show neither a corresponding increase nor a corresponding decrease. Here, it seems there is no conclusive direction to the relationship.

For an overall look at the strength of the relationship of a bivariate data set, we can apply to a scatter plot a tool called the ellipse test. Examples are shown in the next plots. In each case, an ellipse is used to fully capture the points shown in the scatter plot. In figure (a), the ellipse is long and narrow. Its major axis is much longer than its minor axis. In (b), the major axis is longer than the minor axis, but not to the same degree an in (a). In (c), however, the ellipse may be better described as a circle, because the axes show little difference in length.

The ratio of the axes lengths of an ellipse that surrounds the data points of a scatter plot provides a rough measure of the strength of the linear relationship in a bivariate data set. The higher the ratio (that is, the greater the difference in the two lengths), the stronger the relationship. As the ratio approaches 1:1 (that is, as the lengths grow closer to being equal to each other), the relationship grows weaker. When a circle is required to capture the plotted points, the linear relationship between the data pairs shows virtually no strength. Direction of the relationship, too, is impossible to establish in this case. Be aware, however, that these ratios can be distorted by differences in the scales of the horizontal and vertical axes.

Many of the relationships revealed in scatter plots appear to be linear relationships, as illustrated below in plot (a), and indeed can be justified to be so. Many other relationships, however, do not seem to follow a straight line when plotted. We have at our disposal many other mathematical models to characterize the shape of relationships, shapes such as quadratic (b), exponential (c), and periodic (d), among many others.

We have provided examples to help describe what we mean by the direction, strength, and shape of a relationship. To help reinforce these concepts, draw a separate scatter plot to represent each of the following relationships. In doing so, try to identify real-world contexts that fit these conditions

1. Draw a scatter plot to show a moderately strong positive relationship with a constant rate of increase for the data pairs.
2. Draw a scatter plot to show a weak positive relationship where the data values represented on the vertical axis increase much more quickly than the data values represented on the horizontal axis.
3. Draw a scatter plot to show a perfect negative linear relationship.
4. Draw a scatter plot to show a relationship for which neither direction nor strength can be determined.