Linear regression is a fundamental statistical method used to model the relationship between variables. It finds the best fit line through data points to make predictions. The linear equation y equals mx plus b represents this relationship, where y is the predicted value, m is the slope, x is the input variable, and b is the y-intercept. The goal is to find a line that passes as close as possible to all data points, minimizing the overall distance between the line and the actual observations.
Residuals are the key to understanding how well our regression line fits the data. A residual is the difference between an observed value and the predicted value from our line. We draw vertical lines from each data point to the fitted line to visualize these residuals. Positive residuals occur when the actual value is above the line, while negative residuals occur when it's below. Since positive and negative residuals can cancel each other out, we use squared residuals to ensure all errors contribute positively to our measure of fit quality.
The least squares method provides the mathematical foundation for linear regression. We define a cost function that measures the sum of squared residuals for any given slope and intercept values. This creates a surface where each point represents a different line and its corresponding error. The goal is to find the minimum point on this surface, which corresponds to the optimal line parameters. The contour lines show different levels of error, with the center representing the best fit. Gradient arrows point toward the minimum, showing how optimization algorithms would navigate to find the optimal solution.
Now we derive the mathematical formulas for calculating the optimal regression coefficients. The slope formula involves the sum of products of deviations from the means, divided by the sum of squared x-deviations. The intercept is calculated as the y-mean minus the slope times the x-mean. Let's work through a concrete example with five data points. First, we calculate the means: x-bar equals 3 and y-bar equals 5.2. Then we compute the deviations from these means and apply our formulas. The slope comes out to 2.0, and the intercept is negative 0.8, giving us our final regression line.
Model evaluation metrics help us assess the quality of our regression model. R-squared measures the proportion of variance in the dependent variable that's explained by the model, ranging from 0 to 1, with higher values indicating better fit. Mean Squared Error quantifies the average squared difference between observed and predicted values, with lower values being better. The correlation coefficient measures the strength of linear relationship between variables. In our example, an R-squared of 0.92 means 92 percent of the variance is explained by our model, indicating excellent performance. We can compare good and poor fitting models visually to understand these metrics better.