A statistical process for estimating the relationship among variables.
Widely used predicting and forecasting.
We will study the least squares estimation method.
Simple linear regression equation: \(Y = \beta_{0} + \beta_{1} X\). \(\beta_{0}\) and \(\beta_{1}\) are also called model coefficients, or parameters, or weights. We will use the dataset to produce estimates of our weights, so that we can predict new values using the linear regression equation.
Since we estimate our weights, there will be some error from the actual and predicted.
This is our optimization problem (we want to minimize the residuals for all of the data points):
\[ \text{min } \sum_{i=1}^{N} ( y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1} x_{i} )^{2} \]
This sum of residuals is known as the residual sum of squares (RSS):
\[ RSS = \epsilon_{1}^{2} + \epsilon_{2}^{2} + ... + \epsilon_{n}^{2} \]
It turns out that we can derive our weights analytically from the RSS thanks to calculus:
\[ \hat{\beta}_{1} = \frac{\sum ( y_{i} - \bar{y} ) ( x_{i} - \bar{x} ) }{\sum ( x_{i} - \bar{x} )^{2}} \]
\[ \hat{\beta}_{0} = \bar{y} - \hat{\beta}_{1} \bar{x} \]
In reality, we need a catch-all ( \(\epsilon\) ) for what is missed by our model:
\[ Y = \beta_{0} + \beta_{1} X + \epsilon \]
How well does a model fit its data?
The following stats describe this:
Residual analysis helps tell us whether a model fits its data: