Honors Precalculus: Linear Regression
with Vertical Least Squares Example
Mt Lebanon HS 2004-5
David Kosbie
Note: this example uses, but does not derive, several equations for linear regression with vertical least squares. If you are interested, you can learn more about the derivation of these equations from MathWorld (http://mathworld.wolfram.com/LeastSquaresFitting.html).
Also, for the most part, computations are rounded to the nearest hundredth or to two significant digits, whichever is more precise. This produces a minor discrepancy between this answer and the values produced by your calculator's built-in linear regression function.
Problem: Without using a calculator (except for simple arithmetic), given the following table of (x,y) points, find the line of best fit, ŷ, and find its correlation coefficient, R. Also, find the correlation coefficient for z(x) = 2x + 1, and show that this line is not as good a fit as ŷ.
i | x_{i } | y_{i } |
1 | 1 | 3 |
2 | 2 | 5 |
3 | 4 | 8 |
Note: the equation z(x) = 2x + 1 was used to generate this nearly-linear data: the first two points lie on z(x), and the third point very-nearly lies on z(x) -- (4,8) does not lie on z(x), but (4,9) does. So we would expect z(x) to be a good approximation for ŷ. This problem proves that it's not the best, however, since its correlation coefficient will be smaller than ŷ's.
Step 1: Find x
x =
(∑x_{i})/n
= (1 + 2 + 4) / 3
= 7/3
=~ 2.33
Step 2: Find y
y =
(∑y_{i})/n
= (3 + 5 + 8) / 3
= 16/3
=~ 5.33
Step 3: Compute SSxx and SSxy:
Note that:
SSxx = ∑(x_{i} -
x)^{2
} SSxy = ∑(x_{i}
- x)(y_{i}
- y)^{
}
Step 3.1: First add the columns (x_{i} - x) and (y_{i} - y):
i | x_{i } | y_{i } | (x_{i} - x) | (y_{i} - y)_{ } |
1 | 1 | 3 | 1 - 2.33 = -1.33 | 3 - 5.33 = -2.33 |
2 | 2 | 5 | 2 - 2.33 = -0.33 | 5 - 5.33 = -0.33 |
3 | 4 | 8 | 4 - 2.33 = 1.67 | 8 - 5.33 = 2.67 |
Step 3.2: Use these new columns to easily compute two more columns:
i | x_{i } | y_{i } | (x_{i} - x) | (y_{i} - y) | (x_{i} - x)^{2} | (x_{i} - x)(y_{i} - y) |
1 | 1 | 3 | -1.33 | -2.33 | 1.77 | 3.10 |
2 | 2 | 5 | -0.33 | -0.33 | 0.11 | 0.11 |
3 | 4 | 8 | 1.67 | 2.67 | 2.79 | 4.46 |
Step 3.3: Add a "sum" row (just for the final two columns):
i | x_{i } | y_{i } | (x_{i} - x) | (y_{i} - y) | (x_{i} - x)^{2} | (x_{i} - x)(y_{i} - y) |
1 | 1 | 3 | -1.33 | -2.33 | 1.77 | 3.10 |
2 | 2 | 5 | -0.33 | -0.33 | 0.11 | 0.11 |
3 | 4 | 8 | 1.67 | 2.67 | 2.79 | 4.46 |
sum | - | - | - | - | 4.67 | 7.67 |
Step 3.4: Use these sums for SSxx and SSxy:
SSxx = ∑(x_{i} -
x)^{2}
= 4.67
SSxy = ∑(x_{i} -
x)(y_{i}
- y)
= 7.67
Step 4: Compute "a" (the slope of ŷ):
a = SSxy / SSxx
= 7.67 / 4.67
= 1.64
Step 5: Compute "b" (the y-intercept of ŷ):
First, since ŷ is a line, we know that:
ŷ = ax + b
Next, we note that the point (x,y)
sits on ŷ, so we substitute this point:
y = ax
+ b
From above, we determined that:
a = 1.64, x
= 2.33, and y = 5.33
So we substitute these values into
y = ax + b to get:
5.33 = (1.64)(2.33) + b
= 3.82 + b
Thus:
b = 5.33 - 3.82
= 1.51
Thus, we know that:
ŷ
= 1.64 x + 1.51
Step 6: Compute SSdev, SSres, and SSres_{z}
Here we are calling SSres_{z} the sum of the "residuals" using z(x) rather than ŷ(x).
We do all three of these together since they are nearly identical. In each case, we are summing the squares of the vertical distances from y_{i} to some other line.
For SSdev, we use vertical distances to the line y =
y = 5.33
For SSres, we use vertical distances to the line y = ŷ
= 1.64 x + 1.51
For SSresz, we use vertical distances to the line y = z(x) = 2x + 1
Thus, note that:
SSdev = ∑(y_{i} -
y)^{2}
SSres = ∑(y_{i} - ŷ_{i})^{2}
^{
} SSres_{z} = ∑(y_{i}
- z_{i})^{2}
Step 6.1: First add the columns
y,
ŷ_{i}, and z_{i}:
(Of course, we don't really need the column y, as it is constant, but this reinforces the fact that all three values -- SSdev, SSres, and SSres_{z} -- are computed in the same way.)
i | x_{i } | y_{i } | y | ŷ_{i} = 1.64 x + 1.51 |
z_{i} = 2x_{i} + 1 |
1 | 1 | 3 | 5.33 | =
(1.64)(1) + 1.51 = 1.64 + 1.51 = 3.15 |
= 2(1) + 1 = 2 + 1 = 3 |
2 | 2 | 5 | 5.33 | = (1.64)(2)
+ 1.51 = 3.28 + 1.51 = 4.79 |
= 2(2) + 1 = 4 + 1 = 5 |
3 | 4 | 8 | 5.33 | = (1.64)(4)
+ 1.51 = 6.56 + 1.51 = 8.07 |
= 2(4) + 1 = 8 + 1 = 9 |
We can pause briefly to reflect on these numbers. Notice, as we already knew, that z is a perfect fit at x=1 and x=2, and is off by only one at x=3. So z is a pretty good fit. However, ŷ looks pretty good, too. 3.15 is really close to 3, 4.79 is really close to 5, and 8.07 is really, really close to 8. Which is better? Let's continue...
Step 6.2: Add columns for the squares of the vertical difference:
i | x_{i } | y_{i } | y | ŷ_{i} | z_{i} | (y_{i} - y)^{2} | (y_{i} - ŷ_{i})^{2} | (y_{i} - z_{i})^{2} |
1 | 1 | 3 | 5.33 | 3.15 | 3 | =(3
- 5.33)^{2} = (-2.33)^{2} = 5.43 |
=(3
- 3.15)^{2} = (-0.15)^{2} = 0.022 |
=(3 - 3)^{2} =(0)^{2} = 0 |
2 | 2 | 5 | 5.33 | 4.79 | 5 | =(5 - 5.33)^{2} = (-0.33)^{2} = 0.11 |
=(5 - 4.79)^{2} = (0.21)^{2} = 0.044 |
=(5 - 5)^{2} =(0)^{2} = 0 |
3 | 4 | 8 | 5.33 | 8.07 | 9 | =(8 - 5.33)^{2} = (2.67)^{2} = 7.13 |
=(8 - 8.07)^{2} = (0.07)^{2} = 0.005 |
=(8 - 9)^{2} =(-1)^{2} = 1 |
Step 6.3: Add a "sum" row (just for the final three columns):
i | x_{i } | y_{i } | y | ŷ_{i} | z_{i} | (y_{i} - y)^{2} | (y_{i} - ŷ_{i})^{2} | (y_{i} - z_{i})^{2} |
1 | 1 | 3 | 5.33 | 3.15 | 3 | 5.43 | 0.022 | 0.0 |
2 | 2 | 5 | 5.33 | 4.79 | 5 | 0.11 | 0.044 | 0.0 |
3 | 4 | 8 | 5.33 | 8.07 | 9 | 7.13 | 0.005 | 1.0 |
sum | - | - | - | - | - | 12.67 | 0.071 | 1.0 |
Step 6.4:
Use these sums for SSdev, SSres, and SSres_{z}:
SSdev = ∑(y_{i} -
y)^{2}
= 12.67
SSres = ∑(y_{i} - ŷ_{i})^{2}
= 0.071
^{
} SSres_{z} = ∑(y_{i}
- z_{i})^{2} =
1.0
We pause again to reflect on these numbers. Think about the figures on the bottom of page 327. Each of these numbers (12.67, 0.071, and 1.0) represent the sum of the areas of the gray squares for a different line. The smaller this total area, the better the fit of that line (right?). So we can see that SSres, at 0.071, seems like a very good fit indeed! But we don't know for sure until we compute R, so let's continue.
Step 7: Compute R and R_{z}
R^{2} = (SSdev - SSres) / SSdev
= (12.67 - 0.071) / 12.67
= 12.599 / 12.67
= 0.994
R = (0.994)^{½}
= 0.997
R_{z}^{2} = (SSdev - SSres_{z}) /
SSdev
= (12.67 - 1.0) / 12.67
= 11.67 / 12.67
= 0.921
R_{z} = (0.921)^{½}
= 0.960
Step 8: Reflection
Step 8.1: Check our work
To remind you, our results are:
ŷ = 1.64 x + 1.51 with R^{2}
= 0.994 and R = 0.997.
Here is what the TI-83's linear regression function returns on this data set:
LinReg
y=ax+b
a=1.642...
b=1.5
r^{2}=0.9943...
r=0.9971...
These are almost exactly the same answers as ours! Whew!
Step 8.2: Is z(x) a good fit?
We found that R_{z} = 0.96, so yes, z(x) = 2x + 1 is in fact a good fit for the data!
Step 8.3: Is ŷ a good fit?
We found that R = 0.997, so ŷ = 1.64 x + 1.51 is a great fit for the data!
Step 8.4: Which is better, z(x) or ŷ(x)?
Since R > R_{z}, we conclude that even though both functions are good fits for the data, ŷ is the better fit.
Step 8.5: Is this result expected?
Yes! After all, ŷ is supposed to be the best fit. (Again, we've not proven this -- check out the link to MathWorld, above, for that proof.)