|
Finding the line of best fit
When Barbara, Carmen, and Denice looked at the following plot,
they thought the points lay more or less in a line. They each drew a
different line, using different ideas about what line would fit the
data best.
You
might
have
seen
this
in
the
problem
set,
“Tennis,
anyone?”
Top Earnings, WTA (Women’s Tennis) 2001
When statisticians and other mathematicians want to fit a line (or
another curve) to data, they often use numerical methods
and strict guidelines about what the curve of best fit would
be.
Carmen found a line with equation y = -0.16x + 2.44. Denice found
a line with equation y = -0.19x + 2.58. Both lines miss each point.
The difference
is
called a residual.
For example, Carmen’s line predicts the ninth ranked player’s
earnings to be $1 million: -0.16(9) + 2.44 = 1. The actual earnings
was $0.946385. The residual is 0.946385 - 1, which is -0.053615.
Denice’s line predicts the earnings to be $0.87, so the residual is
0.946385 - 0.87, which is 0.033615.
- Residuals are calculated: actual value - predicted value.
What does the sign on the residual tell you?
- What does the absolute value of the residual represent?
- Calculate the absolute values of the residuals and complete
the following table. For the last row, find the sums of the
last two columns.
|
|
|
|
|
| | Rank | Actual | Predicted Earnings | |Residuals|
| | |
|
|
|
| | | Earnings | Carmen | Denice | Carmen | Denice |
|
|
|
|
|
| | 1 | 2.522610 | | | | |
|
|
|
|
|
| | 2 | 2.238624 | | | | |
|
|
|
|
|
| | 3 | 2.086263 | | | | |
|
|
|
|
|
| | 4 | 1.832242 | | | | |
|
|
|
|
|
| | 5 | 1.465116 | | | | |
|
|
|
|
|
| | 6 | 1.314659 | | | | |
|
|
|
|
|
| | 7 | 1.155716 | | | | |
|
|
|
|
|
| | 8 | 0.995704 | | | | |
|
|
|
|
|
| | 9 | 0.946385 | | | | |
|
|
|
|
|
| | 10 | 0.867702 | | | | |
|
|
|
|
|
| | | | | Sum: | | |
|
|
|
|
|
| | |
- Which line, Carmen’s or Denice’s, has a lower sum of the
absolute value of the residuals? Why would that line be
better to use for predicting players’ earnings than the
other?
Using the absolute value of the residuals gives you a way to
decide if one line is better than another. However, it’s difficult
to use absolute values to generalize the method so that you can
find the best line to use for any data set.
One line of “best” fit is the one that makes the squares of the
residuals as small as possible. A formula was found for this line,
using techniques to make the sum of the squares of the residuals
as small as possible. The resulting line is called the least squares
regression line. (“Regression” simply means that some numerical
way is used to decide how well the line fits the data. This is
“least squares regression” because the way to decide is to make
squares of the residuals the least possible.) The calculations are
not really complicated, but even with just a few data points, it
can get messy. Fortunately, graphing calculators and statistical
software can use the formula to find the least squares regression
line easily.
Suppose
you
have
a
data
set
{(x1, y1), ..., (xn, yn)}.
The
standard
deviation
in
the
x
variable
is
Sx
and
in
the
y
variable
is
Sy.
Let
x
and
y
be
the
means
of
the
x
and
y
values,
respectively.
Then
the
least
squares
regression
line
is
y = a + bx,
where
and
- Use a graphing calculator or other software to find the least
squares regression line. (This is sometimes just called linear
regression.)
- Use the line you just found to complete the following table,
similar to the one in problem 3.
You
can
copy
Denice’s
predicted
earnings
column
from
problem 3,
but
you’ll
need
to
calculate
values
for
the
other
columns.
|
|
|
|
|
| | Rank | Actual | Predicted Earnings | Residuals2
| | |
|
|
|
| | | Earnings | Denice | Regression | Denice | Regression |
|
|
|
|
|
| | 1 | 2.522610 | | | | |
|
|
|
|
|
| | 2 | 2.238624 | | | | |
|
|
|
|
|
| | 3 | 2.086263 | | | | |
|
|
|
|
|
| | 4 | 1.832242 | | | | |
|
|
|
|
|
| | 5 | 1.465116 | | | | |
|
|
|
|
|
| | 6 | 1.314659 | | | | |
|
|
|
|
|
| | 7 | 1.155716 | | | | |
|
|
|
|
|
| | 8 | 0.995704 | | | | |
|
|
|
|
|
| | 9 | 0.946385 | | | | |
|
|
|
|
|
| | 10 | 0.867702 | | | | |
|
|
|
|
|
| | | | | Sum: | | |
|
|
|
|
|
| | |
Answers
- If the residual is negative, the predicted value is too high.
(That is, the point is below the line.) If the residual is
positive, the predicted value is too low. (The point is above
the line.)
- The (vertical) distance from the point to the line. Or, the
distance from the predicted point to the actual point.
- The table is as follows:
Teacher’s
Note:
You
might
have
students
complete
both
tables
in
pairs,
or
using
spreadsheets
or
the
data
list
capabilities
of
a
graphing
calculator.
|
|
|
|
|
| | Rank | Actual | Predicted Earnings | |Residuals|
| | |
|
|
|
| | | Earnings | Carmen | Denice | Carmen | Denice |
|
|
|
|
|
| | 1 | 2.522610 | 2.28 | 2.39 | 0.242610 | 0.132610 |
|
|
|
|
|
| | 2 | 2.238624 | 2.12 | 2.2 | 0.118624 | 0.038624 |
|
|
|
|
|
| | 3 | 2.086263 | 1.96 | 2.01 | 0.126263 | 0.076263 |
|
|
|
|
|
| | 4 | 1.832242 | 1.8 | 1.82 | 0.032242 | 0.012242 |
|
|
|
|
|
| | 5 | 1.465116 | 1.64 | 1.63 | 0.174884 | 0.164884 |
|
|
|
|
|
| | 6 | 1.314659 | 1.48 | 1.44 | 0.165341 | 0.125341 |
|
|
|
|
|
| | 7 | 1.155716 | 1.32 | 1.25 | 0.164284 | 0.094284 |
|
|
|
|
|
| | 8 | 0.995704 | 1.16 | 1.06 | 0.164296 | 0.064296 |
|
|
|
|
|
| | 9 | 0.946385 | 1 | 0.87 | 0.053615 | 0.076385 |
|
|
|
|
|
| | 10 | 0.867702 | 0.84 | 0.68 | 0.027702 | 0.187702 |
|
|
|
|
|
| | | | | Sum: | 1.269861 | 0.972631 |
|
|
|
|
|
| | |
- Denice’s line has a lower sum. That means the total distance
between the predicted and actual points is smaller than for
Carmen’s line--so overall, Denice’s line comes closer to the
points than Carmen’s.
- The line (using only the first five digits for each parameter) is
y = -0.19135x + 2.59492.
- The table is as follows:
To
calculate
the
squares
of
the
residuals,
you
subtract
each
predicted
value
from
the
corresponding
actual
(observed)
value,
then
square
the
result.
|
|
|
|
|
| | Rank | Actual | Predicted Earnings | Residuals2
| | |
|
|
|
| | | Earnings | Denice | Regression | Denice | Regression |
|
|
|
|
|
| | 1 | 2.522610 | 2.39 | 2.4036 | 0.01759 | 0.01417 |
|
|
|
|
|
| | 2 | 2.238624 | 2.2 | 2.2122 | 0.00149 | 0.00070 |
|
|
|
|
|
| | 3 | 2.086263 | 2.01 | 2.0209 | 0.00582 | 0.00428 |
|
|
|
|
|
| | 4 | 1.832242 | 1.82 | 1.8295 | 0.00015 | 0.00001 |
|
|
|
|
|
| | 5 | 1.465116 | 1.63 | 1.6382 | 0.02719 | 0.02995 |
|
|
|
|
|
| | 6 | 1.314659 | 1.44 | 1.4468 | 0.01571 | 0.01747 |
|
|
|
|
|
| | 7 | 1.155716 | 1.25 | 1.2555 | 0.00889 | 0.00995 |
|
|
|
|
|
| | 8 | 0.995704 | 1.06 | 1.0641 | 0.00413 | 0.00468 |
|
|
|
|
|
| | 9 | 0.946385 | 0.87 | 0.8728 | 0.00583 | 0.00542 |
|
|
|
|
|
| | 10 | 0.867702 | 0.68 | 0.6814 | 0.03523 | 0.03470 |
|
|
|
|
|
| | | | | Sum: | 0.12203 | 0.12132 |
|
|
|
|
|
| | |
|