Illinois State University Mathematics Department
MAT 312: Probability and Statistics for Middle School Teachers
Dr. Roger Day (firstname.lastname@example.org)
Linear Models for Two-Variable Relationships
Linear Models for Two-Variable Relationships
We have used scatter plots to represent two-variable data sets. A scatter plot may help reveal information about the direction, strength, and shape of possible relationships between two data sets.
Our goal here is to use linear models to describe such relationships. Our development progresses from an informal visual approach to a process-oriented approach that depends on attempting to specify and meet criteria for a line that "best fits" a relationship.
Positioning a Spaghetti Line to Establish a Linear Model
Here is a data set that identifies the age in years for 24 couples who applied for marriage licenses one summer.
As usual, we first create a scatter plot of the data.
There certainly appears to be a strong positive relationship between the variables in the data set. If we place a piece of uncooked spaghetti on the scatter plot, we see that a straight line provides a reasonable model for the general pattern revealed in the scatter plot: as husbands' ages increase, so do wives' ages.
We can generate an equation for the spaghetti line we've postioned. To do so, we need to identify two points on the spaghetti line. For instance, the spaghetti line shown below appears to contain the points (19,16) and (51,50). This leads to the spaghetti-line equation , where x represents a husband's age and y represents a wife's age at the time the marriage license was filed.
How is this equation useful? The slope, 17/16, tells us the constant chenge apparent in the (husband age,wife age) relationship: For each 1-unit increase in the age of a husband, the corresponding change in the age of a wife is 17/16 units. In practical terms, this indicates that as the age of a husband at time of marriage increases by a year, the age of a wife at time of marriage increases by just over a year.
The y-intercept is also apparent from the equation. It is (-67/16), or approximately -4.2. In the context of this situation, this is meaningless, for it says that when x=0, y=-4.2, or that when a man is 0 years old, the corresponding age of his spouse will be -4.2 years. Using the spaghetti-line equation would be meaningless at this point on the graph, for this is outside the reality of the situation.
The equation can also be used to predict the age of a wife for a given husband age. If a man is 46 years old when a marriage license is filed, this spaghetti-line equation predicts that his spouse's age will be just over 44.5 years. Given that the data set contained ages accurate to only the nearest year, we'd likely say the woman's age to be 45 years.
This type of prediction is called interpolation, for the prediction falls within the range of values found in the data set from which we generated the spaghetti-line equation. We could also use the spaghetti-line equation to make an extrapolation, a prediction outside the range of the values in the original data set. If a man is 94 years old, our spaghetti-line equation predicts the woman he marries will be 96 years old. In general, it is more risky to extrapolate from a data set than it is to interpolate within in it. Whay do you think that might be?
Is there anything special about the position of the spaghetti line we've considered here? Not particularly. It was simply a spaghetti line that seemed to "fit" the scatter plot of the data.
There are two factors within the spaghetti line we can adjust: the slope of the line and its relative position up and down without changing the slope. Here are a few more spaghetti lines drawn on the same scatter plot. The red and the blue spaghetti lines have the same slope as the original line. We might say we have shifted the original up or down. The green and black spaghetti lines, however, show a change in slope compared to the original line.
How will we decide which of the many possible spaghetti lines to use as a model for the positive linear relationship between age at marriage of husbands and wives that appears in the scatter plot? As with many decisions we make, we'll need to establish some criteria and evaluate our choices and results based on that criteria. The next section describes one method we can use. In subsequent notes, others will be illustrated.
The first process we will explore and practice is called the Median-Median Line of best fit. As the name implies, it is based on use of medians. The process involves the determination of physical locations in the data set that are representative of the data.
As you recall from our discussion of 1-variable data analysis, the median is a resistant or robust statistic because it is not significantly influenced by extreme, or outlier, values. Another measure of location, the mean of a data set, is not resistent, for the mean of a data set is strongly influenced by outlier values, simply based upon how the mean is calculated.
For that reason, a median-median line of best fit may be your choice because you do not want a fit line to be strongly influenced by extreme or outlier values. This, of course, will make more sense when you learn about another best-fit technique that does depend on the mean, a technique called least-squares regression.
To illustrate how to calculate a median-median line for a two-variable data set, we will use the data pairs shown in the table below. The values represent nine subjects' optical reaction to a light stimulus before and after eating. The higher the number, the slower the reaction.
We begin, as usual, with a scatter plot, shown above to the right of the data table. Here are the steps we take to create a Median-Median line.
Divide the points in the scatter plot into three groups.
We position vertical lines to separate the points into three sets, with as close as possible to an equal number of points in each set.
The scatter plot below shows the two vertical red lines positioned on the original scatter plot, with three data points in each group.
When it is not possible to divide the points equally, we should make the division as equitable as possible, trying to maintain symmetry in our division. In a set of 10 ordered pairs, for instance, it is best to try and place 4 points in the middle group and 3 points in each of the outside groups. For a set of 20 points, try to place 7 points in the first and last group and 6 points in the middle group.
Determine the median-median point in each of the three groups of data.
A median-median point is an ordered pair that represents the physical middle of a group. We work our way from left to right within a group and identify the median value among the first elements in the ordered pairs. Likewise, we work our way top to bottom within a group and identify the median value among the second elements of the ordered pairs.
In our example, the left-most group contains the ordered pairs (1,4), (2,3), and (3,5). He see from the ordered pairs or from the scatter plot that the value 2 is the median first value and that 4 is the median second value. Thus, the median-median point for the first group is (2,4). In a similar manner, we determine the other two median-median points to be (5,4) and (8,7).
Notice in our example that for the left group, the median-median point was not one of the data points, whereas for the middle and right groups the median-median points were actual data values. It is not required that a median-median point be one of the actual data values, but it may occur.
Create a line using the two outside median-median points.
Use these two ordered pairs to create an equation for this line. In our example, we determine the equation for the line containing the points (2,4) and (8,7). You should be able to show that this line has the equation y=0.5x+3.
Identify an ordered pair one-third the distance from the line to the middle median-median point.
This requires a few steps. First, use the equation to determine the ordered pair on that line that has the same first coordinate as does your middle median-median point. Here, when we let x=5 in the equation, y is 5.5. Thus, the desired point on the line is (5,5.5).
We now measure the distance from the line to the middle median-median point. This is just the difference of the y coordinates, 5.5-4, or 1.5 units.
Calculate one-third of this distance and move that far from the line toward the median-median point. Here, one-third of 1.5 is 0.5 units, so we move 0.5 units down from the line toward the median-median point. This results in the coordinate (5,5).
Write the equation of the line that goes through the new point with slope the same as the original line.
Graphically, what we are doing is sliding or shifting the original line one-third of the way from its original position toward the middle median-median point. This maintains the same slope as determined by the outside median-median points yet acknowledges that the middle group of data points represents one-third of the entire set.
Our original equation was y=0.5x+3, and we move down 0.5 units. The slope remains the same and we decrease the y-intercept by 0.5. This results in a median-median line with equation y=0.5x+2.5.
To practice the technique of determining the equation for a Median Median Line of best fit, use the data set to the right. This data contains the tests scores of 10 students on a writing exam and on a mathematics exam.
You can also practice by determining the Median-Median Line equation for the couples' ages data.
Student ID Writing
1 50 43 2 59 56 3 78 56 4 93 90 5 55 60 6 90 86 7 57 64 8 81 94 9 54 45 10 56 52
A median-median line is useful when the data contain outliers that would strongly influence the mean of either the first or second elements in the two-variable data set. The median-median technique relies on the determination and use of physical middle in each of three portions of the data. Median-median points from two of the groups are used to determine the slope of the best-fit line. The line's final position up and down is then altered by moving it one-third the distance from its original position toward the middle median-median point. This uses the fact that the middle group of data points accounts for one-third the data points in the entire set.