Problems With A Point Logo
Home
  Mathematics Problems

Return To Problem List
View Printable Version (PDF)

Finding the line of best fit
Answers

When Barbara, Carmen, and Denice looked at the following plot, they thought the points lay more or less in a line. They each drew a different line, using different ideas about what line would fit the data best.
You might have seen this in the problem set, “Tennis, anyone?”
Top Earnings, WTA (Women’s Tennis) 2001
PIC

When statisticians and other mathematicians want to fit a line (or another curve) to data, they often use numerical methods and strict guidelines about what the curve of best fit would be.

Carmen found a line with equation y = -0.16x + 2.44. Denice found a line with equation y = -0.19x + 2.58. Both lines miss each point. The difference

actual value -  predicted value
is called a residual.

PIC

For example, Carmen’s line predicts the ninth ranked player’s earnings to be $1 million: -0.16(9) + 2.44 = 1. The actual earnings was $0.946385. The residual is 0.946385 - 1, which is -0.053615. Denice’s line predicts the earnings to be $0.87, so the residual is 0.946385 - 0.87, which is 0.033615.

  1. Residuals are calculated: actual value - predicted value. What does the sign on the residual tell you?
  2. What does the absolute value of the residual represent?
  3. Calculate the absolute values of the residuals and complete the following table. For the last row, find the sums of the last two columns.






    Rank Actual
    Predicted Earnings
    |Residuals|




    Earnings Carmen Denice Carmen Denice






    1 2.522610






    2 2.238624






    3 2.086263






    4 1.832242






    5 1.465116






    6 1.314659






    7 1.155716






    8 0.995704






    9 0.946385






    10 0.867702






    Sum:






  4. Which line, Carmen’s or Denice’s, has a lower sum of the absolute value of the residuals? Why would that line be better to use for predicting players’ earnings than the other? Using the absolute value of the residuals gives you a way to decide if one line is better than another. However, it’s difficult to use absolute values to generalize the method so that you can find the best line to use for any data set.

    One line of “best” fit is the one that makes the squares of the residuals as small as possible. A formula was found for this line, using techniques to make the sum of the squares of the residuals as small as possible. The resulting line is called the least squares regression line. (“Regression” simply means that some numerical way is used to decide how well the line fits the data. This is “least squares regression” because the way to decide is to make squares of the residuals the least possible.) The calculations are not really complicated, but even with just a few data points, it can get messy. Fortunately, graphing calculators and statistical software can use the formula to find the least squares regression line easily.
    Suppose you have a data set {(x1, y1), ..., (xn, yn)}. The standard deviation in the x variable is Sx and in the y variable is Sy. Let x and y be the means of the x and y values, respectively. Then the least squares regression line is y = a + bx, where

       1   sum n (xi- x)(yi- y)
b= n--1-   ---(Sx)2-----
      i=1
    and
       -   -
a= y -by

  5. Use a graphing calculator or other software to find the least squares regression line. (This is sometimes just called linear regression.)
  6. Use the line you just found to complete the following table, similar to the one in problem 3.
    You can copy Denice’s predicted earnings column from problem 3, but you’ll need to calculate values for the other columns.






    Rank Actual
    Predicted Earnings
    Residuals2




    Earnings Denice Regression Denice Regression






    1 2.522610






    2 2.238624






    3 2.086263






    4 1.832242






    5 1.465116






    6 1.314659






    7 1.155716






    8 0.995704






    9 0.946385






    10 0.867702






    Sum:






Answers
Problem

  1. If the residual is negative, the predicted value is too high. (That is, the point is below the line.) If the residual is positive, the predicted value is too low. (The point is above the line.)
  2. The (vertical) distance from the point to the line. Or, the distance from the predicted point to the actual point.
  3. The table is as follows:
    Teacher’s Note: You might have students complete both tables in pairs, or using spreadsheets or the data list capabilities of a graphing calculator.






    Rank Actual
    Predicted Earnings
    |Residuals|




    Earnings Carmen Denice Carmen Denice






    1 2.522610 2.28 2.39 0.242610 0.132610






    2 2.238624 2.12 2.2 0.118624 0.038624






    3 2.086263 1.96 2.01 0.126263 0.076263






    4 1.832242 1.8 1.82 0.032242 0.012242






    5 1.465116 1.64 1.63 0.174884 0.164884






    6 1.314659 1.48 1.44 0.165341 0.125341






    7 1.155716 1.32 1.25 0.164284 0.094284






    8 0.995704 1.16 1.06 0.164296 0.064296






    9 0.946385 1 0.87 0.053615 0.076385






    10 0.867702 0.84 0.68 0.027702 0.187702






    Sum: 1.269861 0.972631






  4. Denice’s line has a lower sum. That means the total distance between the predicted and actual points is smaller than for Carmen’s line--so overall, Denice’s line comes closer to the points than Carmen’s.
  5. The line (using only the first five digits for each parameter) is y = -0.19135x + 2.59492.
  6. The table is as follows:
    To calculate the squares of the residuals, you subtract each predicted value from the corresponding actual (observed) value, then square the result.






    Rank Actual
    Predicted Earnings
    Residuals2




    Earnings Denice Regression Denice Regression






    1 2.522610 2.39 2.4036 0.01759 0.01417






    2 2.238624 2.2 2.2122 0.00149 0.00070






    3 2.086263 2.01 2.0209 0.00582 0.00428






    4 1.832242 1.82 1.8295 0.00015 0.00001






    5 1.465116 1.63 1.6382 0.02719 0.02995






    6 1.314659 1.44 1.4468 0.01571 0.01747






    7 1.155716 1.25 1.2555 0.00889 0.00995






    8 0.995704 1.06 1.0641 0.00413 0.00468






    9 0.946385 0.87 0.8728 0.00583 0.00542






    10 0.867702 0.68 0.6814 0.03523 0.03470






    Sum: 0.12203 0.12132







Return To Problem List View Printable Version (PDF)