Illinois State University Mathematics Department

 MAT 312: Probability and Statistics for Middle School Teachers Dr. Roger Day (day@ilstu.edu)

### Judging the Goodness of Fit of a Proposed Model

Refresher

In our study of two-variable data sets, we have identified important characteristics of relationships between data sets (direction, strength, shape) and we have generated numerical and visual representations of such relationships (centroid, median-median points, scatter plot). After a brief review of linear relationships (finding slope and intercepts, determining equations, making real-world interpretations), we went on to consider a variety of ways to generate linear models, or lines of best fit, for two-variable relationships. These included spaghetti lines, median-median lines, and least-squares linear regression lines.

Judging Goodness of Fit

We conclude this discussion of two-variable data analysis by considering criteria we bring together to use in judging how good and how appropriate models of best fit may be for given data sets. There are four criteria we will mention and illustrate:

1. scatter plot with ellipse test or with proposed model graphed on the scatter plot,
2. sum-of-the-squared errors (SSE),
3. correlation coefficient (r), and
4. residual plots.

We will use the data set below to help focus on each of these criteria.

 Wholesale Price for Used Ford LTD Automobiles Age of Car (Model Year) Value 10 (1989) 2600 9 (1990) 3320 8 (1991) 3560 7 (1992) 4315 6 (1993) 4905 5 (1994) 6005 4 (1995) 7585 3 (1996) 9235 2 (1997) 11010 1 (1998) 14025

The scatter plot above reveals a strong negative relationship. This ought to motivate us to determine one or more models for the relationship that appears to exist. Our focus has been on linear models, so we begin by generating a least-squares linear regression model as well as a median-median line of best fit. We also have generated several non-linear models. The information in the table below is from a TI-83 calculator using the data set above.

 Type of Model, Equation of Model and Correlation Coefficient (r) Scatter Plot with Graph of Model Residual Plot SSE 10,568,230.61 11,258,096.94 1,043,349.51 663,968.45 17,095,492.69

In addition to the two linear models we've discussed in class (least-squares linear regression line and median-median line), the table above provides information for three non-linear models, including an exponential model, a logarithmic model, and a power model. In courses such as MAT 207 you may have studied properties and characteristics of these and other non-linear models. While such characteristics are vitally important in selecting and then justifying the use of a particular model of best fit, we will not take up a discussion of those characteristics here.

Each row of the table shows information about one model. The left entry in each row in the table above shows TI-83 screens displaying the names of the particular models that have been generated, the equation of each model, and, when calculated by the TI-83, the correlation coefficients for the models. The correlation coefficient provides one measure of goodness of fit. The closer this value is to 1 or to -1, the better the fit. The value of r ranges from -0.957 to -0.997 for the four models for which correlation coefficients have been computed. If we only use this measure of goodness of fit, all four models would be judged to have strong negative relationships.

The second entry in each row shows a scatter plot of the original data set upon which a graph of the best-fit model has been superimposed. In our check of this visual representation, the perfect fit will show all points of the scatter plot (the original data set) precisely on the graph of the best-fit model. The exponential and logarithmic regression models appear to be closest to meeting this ideal. We also see in the graphs of the linear models that the lines appear to separate the points in the scatter plot into distinct clusters, with some clusters being below the line of best fit and some above it. This sort of pattern in the scatter plot showing the graph of the best-fit model ought to raise your level of concern. The next entry in each row focuses on this.

That entry is called the residual plot. This is a scatter plot of (x,residual) for each model. The first element in the ordered pair is just the set of first elements from the original data set. In this case, this represents the age of each car. The send entry in each ordered pair is the residual value associated with that age of car and the particular model of best fit in question. This plot is very important to examine in making judgements about the appropriateness of any best-fit model. If a pattern is detected in the residual plot, you should take great caution in accepting that model as appropriate to represent the data. In contrast, a residual plot that shows no discernible pattern, rather a random scattering of points, provides evidence to support a judgement that the model is appropriate for the data set.

While varying degrees of patterns or randomness could be argued for the residual plots displayed in the table, it seems that the logarithmic and exponential models generate the least patterning in the residual plots while the linear models lead to residual plots with more obvious patterns. As you progress in your modeling capabilities and understanding, you may also learn how specific patterns in residual plots can be used to revise and improve your models of best fit.

The last entry in each row shows the sum of the squares residual values for the best-fit model, abbreviated SSE (sum of squares error). This is a criterion we discussed as the basis for generating a least-squares regression line of best fit. We can calculate the SSE for any proposed model and compare them. The smaller the SSE, the better the model fits the data, based on minimizing SSE. For the models above, the logarithmic and exponential models have much smaller SSE that the other models.

Having generated, analyzed, and compared various numerical and visual representations, it is appropriate to summarize the information and, if possible, make judgements about which model or models might be most appropriate for the data.

Using the information in the table, we can strongly argue that the logarithmic model is most appropriate for the data set on car age and car value. This model generates the correlation coefficient closest to -1, a graph that comes closest to intersecting all points in the scatter plot of the data, has a rsidual plot that has a random appearance to it, and has produced the smallest SSE. Our cnclusions are based on the other models produced and displayed in the table. You might generate other models that surpass the logarithmic model based on these characteristics. You are encouraged to look for such.

Another factor to keep in mind is the context of the problem situation. For car age and value, a general trend might be that cars lose value in proportion to their current value. That could account for the rapid loss of value in early years and the leveling off in value as the car increases in age. It is unlikely that a car will have a negative value, so extrapolating with a negative-sloped linear model will eventually be inappropriate, whereas an exponential or logarithmic model will exhibit this sort of leveling behavior.