The least squares algorithm is a fundamental method in statistics and data analysis for finding the best-fitting line through a set of data points. Given a collection of data points, we want to find a linear model y equals m x plus b that best represents the relationship between x and y variables.
Different lines can be drawn through the same set of points, but which one is the best? The residuals, shown as vertical distances from each point to the line, help us measure how well a line fits the data. Our goal is to find the line that minimizes these prediction errors.
To understand how we measure the quality of a fit, we need to examine residuals. A residual is the difference between the actual y value and the predicted y value from our line. Each data point has its own residual, which can be positive or negative.
But why do we square the residuals instead of just adding them up? First, squaring prevents positive and negative errors from canceling each other out. Second, it penalizes larger errors more heavily than smaller ones. This mathematical approach gives us the Sum of Squared Errors, or SSE.
Different lines produce different SSE values. Here we compare our good fitting line with a poor fitting line. The poor line has much larger residuals and therefore a much higher SSE value. The best fitting line is the one that minimizes this sum of squared errors.
To find the optimal slope and intercept, we need to minimize the sum of squared errors function. We start by expressing SSE as a function of both m and b, where each residual is y i minus m x i minus b, and we square and sum all these terms.
Next, we take partial derivatives of SSE with respect to both m and b. The partial derivative with respect to m involves the x i terms, while the partial derivative with respect to b gives us the sum of all residuals. Setting both partial derivatives equal to zero gives us the conditions for minimization.
These conditions lead to the normal equations. By rearranging the partial derivative equations, we get a system of two linear equations in two unknowns: m and b. The first equation relates the sum of x i y i products to the parameters, while the second relates the sum of y values.
Solving this system of equations gives us the final least squares formulas. The slope m is calculated using the covariance of x and y divided by the variance of x. The intercept b is simply the mean of y minus m times the mean of x. These formulas guarantee the minimum sum of squared errors.
Let's work through a complete example with five data points. We have x values from 1 to 5, and corresponding y values. First, we calculate the means: x-bar equals 3.0 and y-bar equals 5.62.
Next, we calculate the numerator of the slope formula by finding the sum of the products of deviations from the means. Each x minus x-bar is multiplied by the corresponding y minus y-bar. This gives us 16.32.
For the denominator, we calculate the sum of squared deviations of x from its mean. Each x minus x-bar is squared and summed, giving us 10. Now we can calculate the slope: m equals 16.32 divided by 10, which equals 1.632.
The intercept is calculated as y-bar minus m times x-bar, which gives us 0.724. Our final least squares line is y equals 1.632 x plus 0.724. This line passes through the mean point and minimizes the sum of squared residuals better than any other possible line.
The least squares regression line has several important properties. First, it always passes through the point x-bar, y-bar, which is the center of the data. This is a fundamental geometric property that ensures the line is anchored at the data's center point.
Second, the sum of all residuals equals zero. This means that the positive and negative deviations from the line perfectly balance out. Third, the line minimizes the sum of squared vertical distances from all points to the line.
For interpretation, the slope tells us how much y changes for each unit increase in x. The intercept represents the y-value when x equals zero. In our example, the slope of 1.632 means y increases by about 1.63 units for each unit increase in x.
R-squared measures how well our line fits the data by comparing explained variance to total variance. Total variance is measured from each point to the overall mean, while explained variance is from the predicted values to the mean. An R-squared of 0.94 means our model explains 94% of the variance in the data.