Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@ilstu.edu)

### Fitting Lines to Scatter Plots Using Least-Squares Linear Regression

Fitting Lines to Scatter Plots of Data

As discussed in earlier notes, we described two ways to determine an equation for a linear model of a two-variable data set. The first, called a spaghetti line, is simply an eyeballing technique by which we place a straight line on a scatter plot using our best visual judgment about the placement of the line. We mentioned at least two criteria we might take into account in placing a spaghetti line:

1. Place the line so that about half the points in the scatter plot are above the line and about half the points are below the line.
2. Position the line so that it is close to as many points as possible. That is, make the distances from the line to the points as small as possible.

The second technique we practiced for positioning a line of best fit on a scatter plot was called the Median-Median Line. As its name implies, the median-median line is based on identifying representative points that are medians of both data sets when the data are partitioned into three groups using vertical lines. Median-median points from the outside groups determine the slope of the median-median line. We then slide our first line one-third the way from its original position toward the middle median-median point, thereby acknowledging that the middle group carries one-third the weight of the entire data set.

In these notes, we present another technique for determining a line of best fit for a scater plot of data. This technique, called least-squares linear regression, or the least-squares line of best fit, is based on positioning a line so as to minimize the sum of all the squared distances from the line to the actual data points. Your challenge in mastering this material is to not only understand and be able to carry out the technique but also to compare its strengths and weaknesses with other best-fit techniques you are learning about. As with other methods we're learning, least-squares linear regression can be carried out with a calculator.

Least-Squares Linear Regression

Before describing the technique used to determine the equation of a least-squares regression line, we need to look at three important component parts of the process. These are residuals, sum-of-squares error, and the centroid. We will again use the optical reaction to stimulus data we used to develop the median-median line.

One of the criteria we previously identifed to judge the goodness of fit of a linear model was the distance from each point in the plot to the line representing the linear model of the data. We have a particular name for these distances when a model is positioned on a scatter plot. They are called residual values, or simply residuals.

The line in the plot shown above is the median-median line we calculated previously. Its equation is y=0.5x+2.5, where x is the optical reaction score before eating and y is the optical reaction score after eating. The table below shows the original data set, the predicted y values for each original x value (symbolized as y', pronounced "y prime"), and the residual value for each data point, y-y'.

 reaction score prior to eating meal (x) reaction score after eating meal (y) predicted after-meal reaction score (y'=0.5x+2.5) residual value (y-y') 3 5 4 1 4 6 4.5 1.5 2 3 3.5 -0.5 1 4 3 1 5 4 5 -1 7 8 6 2 9 5 7 -2 6 2 5.5 -3.5 8 7 6.5 0.5

The residual values provide us some measure of how well the line fits the data, that is, the goodness of fit. Based on the informal criteria we've already identified, we'd like the residual values to be as small as possible and for about half of them to be positive and half negative. The latter criteria corresponds to our criteria that stipulates about half the points should be on each side of the line.

To compare more than one of the many possible lines of fit, we could compare the residuals for each possible line. One way to carry out that comparison is to simply sum the residuals for a particular line of fit and compare that to the sum of the residuals for another possible line of fit. A problem with this, however, is that negative and positive residuals tend to balance or counteract each other, and so the sums may not reveal as much about the goodness of fit as we would like. How have we dealt with this problem in the past?

Two options are readily apparent: use either the sum of the absolute values of the residuals or the sum of the squares of the residuals. Both of these options eliminate the problem with negative residual values. In statistics, we have traditionally used the second option. We call this calculation the sum of the squared error terms, or sum--of-squares error, abbreviated SSE. This provides us a quantitative measure with which to compare the goodness of fit for two or more potential lines of fit for a data set: The smaller the SSE, the better the fit. The first table below extends our previous table to show the SSE for the median-median line of best fit. The next table shows the same calculations for a different prediction equation, one some of us could have found by eyeballing a spaghetti line on the original scatter plot.

 reaction score prior to eating meal (x) reaction score after eating meal (y) predicted after-meal reaction score (y'=0.5x+2.5) residual value (y-y') square of each residual (y-y')^2 3 5 4 1 1 4 6 4.5 1.5 2.25 2 3 3.5 -0.5 0.25 1 4 3 1 1 5 4 5 -1 1 7 8 6 2 4 9 5 7 -2 4 6 2 5.5 -3.5 12.25 8 7 6.5 0.5 0.25 SSE = 26

 reaction score prior to eating meal (x) reaction score after eating meal (y) predicted after-meal reaction score residual value (y-y') square of each residual (y-y')^2 3 5 3.8571 1.1429 1.3061 4 6 4.2857 1.7143 2.9388 2 3 3.4286 -0.4286 0.18367 1 4 3 1 1 5 4 4.7143 -0.7143 0.5102 7 8 5.5714 2.4286 5.8980 9 5 6.4286 -1.4286 2.0408 6 2 5.1429 -3.1429 9.8766 8 7 6 1 1 SSE = 24.7551

Our calculations show that the second equation, , provides a slightly better fit, based on the criteria that we seek to minimize SSE. It is important to emphasize the last statement: One equation is better than the other based on a specific criteria. The first equation, the median-median line, may be a better equation based on some other criteria, such as consideration of outliers, for instance.

Another important location in a two-variable data set is called the centroid of the data. The centroid is the ordered pair determined by the mean of each variable in the data set. For the optical stimulus response measures, the centroid is (5,4.89). The centroid provides for us a measure of location for the two-variable data set, for the two components are each an average value of a one-variable data set. Here is the scatter plot of the optical stimulus response measures with the centroid included on the scatter plot.

Another strategy for determining a line of best fit to model the data is to look at several lines that contain the centroid but each with a different slope. We calculate the SSE for each line and use the line with the smallest SSE. The plot below shows 7 different lines that contain the centroid. The table that follows the plot shows the SSE for each of the seven lines.

 line # centroid another point on line slope SSE 1 (5,4.89) (0,4.89) 0 28.89 2 (5,4.89) (0,4) 0.178 24.39 3 (5,4.89) (0,3) 0.378 23.85 4 (5,4.89) (0,2) 0.578 28.12 5 (5,4.89) (0,1) 0.778 37.19 6 (5,4.89) (0,0) 0.978 51.05 7 (5,4.89) (1,0) 1.222 74.52 8 (5,4.89) (0,3.39) 0.300 23.49

Among the seven lines plotted above, the table shows that line 3 has the smallest SSE. I used my spreadsheet and the y-intercept values for lines 2 and 3 to try to find another line through the centroid with a smaller SSE. By guessing and checking I found line 8, with slope 0.300 and y-intercept approximately 3.39, having an SSE of 23.49.

How will we know when we have found a line of fit that gives us the smallest SSE, assuming that we want to meet that criteria? As I just described, I used a spreadsheet and a trial-and-check process to zoom in on what I found to be as the smallest SSE. Even with a spreadsheet, this is not a most efficient process!

It turns out that we can call on another branch of mathematics, Calculus, to help us out. Our goal is to minimize SSE, and the tools of Calculus are well developed for determining conditions that minimize a specified characterisitic. Calculus shows us that the line with smallest SSE contains the centroid of the data. Calculus also provides a formula we can use to calculate the slope and y-intercept of the desired line of best fit. Although you can rely directly on the formula, your calculator has a built-in routine for determining the line of best fit that satisfies the least-squares criteria. As you might have already guessed, this line is called the least-squares line of best fit because it is the best-fit line based on the least-squares criteria.