what is a regression in machine learning?

zakir abbas
14 min readMar 6, 2022

Regression in machine learning

In the discipline of machine learning, regression analysis is a key concept of the Classical Linear Regression Model (CLRM). . It’s classified as supervised learning because the algorithm is taught both input and output labels. Estimating how one variable influences the other, aids in the establishment of a link between the variables.

Assume you’re in the market for a car and have decided that gas mileage will be a deciding factor in your purchase. How would you go about predicting the miles per gallon of some prospective rides? Because you know the car’s many characteristics (weight, horsepower, displacement, and so on), regression is a viable option. You can use regression techniques to identify the relationship between the MPG and the input data by plotting the average MPG of each automobile given its features. The regression function might be written as $Y = f(X)$, with Y representing the MPG and X being the input features like weight, displacement, horsepower, and so on. The desired function is $f$, and this curve lets us determine if buying or not buying is helpful. Regression is the name for this technique.

What is R?

R is a statistical computing and graphics language and environment. It is a GNU project that is similar to the S language and environment established by John Chambers and colleagues at Bell Laboratories (previously AT&T, now Lucent Technologies). R can be thought of as a more advanced version of S. Although there are some significant differences, much of the code built for S works in R without modification.

R is highly extendable and offers a wide range of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical tools. The S programming language is frequently used for statistical methods research, while R provides an Open-Source option for getting involved.

One of R’s advantages is how simple it is to create well-designed publication-quality graphs, complete with mathematical symbols and calculations when needed. The defaults for small design choices in visuals have been carefully chosen, but the user retains complete control.

R is accessible in source code form as Free Software under the provisions of the Free Software Foundation’s GNU General Public License. It compiles and operates on a wide range of UNIX and related systems (including FreeBSD and Linux), as well as Windows and MacOS.

Only half of the job is done when it comes to creating a linear regression model. The model must correspond to the assumptions of linear regression in order to be used in practice, and there are 10 assumptions on which a linear regression model is based. These ten assumptions are:

  1. The regression model is linear in parameters
  2. The mean of residuals is zero
  3. Homoscedasticity of residuals or equal variance
  4. No autocorrelation of residuals
  5. The X variables and residuals are uncorrelated
  6. The number of observations must be greater than the number of Xs
  7. The variability in X values is positive
  8. The regression model is correctly specified
  9. No perfect multicollinearity
  10. Normality of residuals

Assumption 1

The regression model is linear in parameters

According to Assumption 1, the dependent variable must be a linear mixture of the explanatory variables and error terms. Assumption 1 demands that the stated model be linear in terms of parameters, but not in terms of variables. Equations 1 and 2 provide a model that is linear in terms of both parameters and variables. It’s worth noting that Equations 1 and 2 depict the same model in slightly different notation.

In order for OLS to work the specified model has to be linear in parameters. Note that if the true relationship between and is nonlinear it is not possible to estimate the coefficient in any meaningful way. Equation 3 shows an empirical model which is of quadratic nature.

CLRM’s basic assumption is that the model’s parameters must be linear. OLS is unable to estimate Equation 3 in any way that is useful. However, assumption 1 does not necessitate a linear model in terms of variables. In Equation 4, OLS will generate a significant estimate of B 1.

The approach of ordinary least squares (OLS) allows us to estimate models with linear parameters, even if the variables are nonlinear. On the contrary, even if the variables are linear, it is impossible to estimate models with nonlinear parameters.

Finally, every OLS model should include all relevant explanatory factors, and all explanatory variables included in the model should be relevant. The omitted variables bias is caused by not incorporating all relevant variables. In regression analysis, this is a very important problem.

Assumption 2

The mean of residuals is zero

When it comes to verifying a regression model, residual analysis is crucial. The ith residual is the difference between the observed value of the dependent variable, yi, and the value predicted by the estimated regression equation, ŷi. These residuals, computed from the available data, are treated as estimates of the model error, ε. As such, they are used by statisticians to validate the assumptions concerning ε. Therefore, let’s check the mean of the residuals. If it is zero (or very close), then this assumption is held true for that model. This is default unless you explicitly make amends, such as setting the intercept term to zero.

We tested this assumption with our data in “R” and discovered that the mean of residuals is close to zero, indicating that the assumption is correct for our model.

Assumption 3

Homoscedasticity of residuals or equal variance

The variance of error terms is similar throughout the independent variable values, according to this assumption. A plot of standardized residuals vs expected values can be used to determine if points are evenly distributed across all independent variable values. At least two independent variables, which might be nominal, ordinal, or interval/ratio level variables, are required in multiple linear regression.

Once the regression model is built, we would like to place two plots side by side using the set par (mfrow=c (2, 2)) in “R”, then, plot the model using plot(lm. mod). This produces four plots. The top-left and bottom-left plots show how the residuals vary as the fitted values increase.

The first graph shows residuals against fitted values. The plot of residuals vs anticipated values can be used to verify the linearity and homoscedasticity assumptions. Our residuals would take on a definite shape or a recognizable pattern if the model did not match the linear model assumption. It’s bad if your plot resembles a parabola, for example. Your residual scatterplot should resemble the night sky, with no discernible patterns. If the linearity assumption is met, the red line running through your scatterplot should be straight and horizontal, not curved. We check to see if the residuals are evenly distributed around the y = 0 line to see if the homoscedasticity assumption is met.

What did we come up with? Three data points with high residuals were automatically flagged by R. (observations Toyota Corolla, Fiat, and Pontiac Firebird). Aside from that, our residuals appear to be non-linear, as the curving red line shown through our residuals demonstrates (they were quadratic and resemble the shape of a parabola). Our data also appear to be heteroscedastic, as they are not uniformly distributed around the y = 0 line.

The residuals are used to test the normality assumption, which can be done with a QQ-plot by comparing the residuals to “ideal” normal data along the 45-degree line.

What did we come up with? The same three data points with high residuals were automatically detected by R. (observations Toyota Corolla, Fiat and Pontiac Firebird). However, aside from those three data points, observations in the QQ-plot do not lie well along the 45-degree line, implying an abnormality exists.

The third plot is a scale-location plot (square rooted standardized residual vs. predicted value). This is useful for checking the assumption of homoscedasticity. In this particular plot we are checking to see if there is a pattern in the residuals. If the red line you see on your plot is flat and horizontal with equally and randomly spread data points (like the night sky), you’re good. If your red line has a positive slope to it, or if your data points are not randomly spread out, you’ve violated this assumption.

The fourth plot helps us find influential cases, if any are present in the data. Note: Outliers may or may not be influential points. Influential outliers are of the greatest concern. Depending on whether they are included or excluded from the analysis, they may have an impact on the results. If you’re good and don’t have any influential examples, you’ll hardly notice a dashed red curve, if at all (Cook’s distance is represented by the red dashed curved line). You’re fine if you don’t notice a red Cook’s distance curving line, or if one is just barely visible in the corner of your plot but none of your data points fall within it. If some of your data points transcend the distance line, you’re not doing so well/you have significant data points. In this circumstance, there is a clear pattern that can be seen. Heteroscedasticity exists as a result. Now let’s have a look at a different model.

The points now appear to be random, and the line appears to be flat, with no upward or downward trend. As a result, the homoscedasticity requirement can be accepted.

Assumption 4

No autocorrelation of residuals

This is applicable especially for time series data. Autocorrelation is the correlation of a time Series with lags of itself. When the residuals are autocorrelated, it means that the current value is dependent of the previous (historic) values and that there is a definite unexplained pattern in the Y variable that shows up in the disturbances.

Below, are 3 ways you could check for autocorrelation of residuals.

Using acf plot

The X-axis corresponds to the lags of the residual, increasing in steps of 1. The very first line (to the left) shows the correlation of residual with itself (Lag0), therefore, it will always be equal to 1.

If the residuals were not autocorrelated, the correlation (Y-axis) from the immediate next line onwards will drop to a near-zero value below the dashed blue line (significance level). Clearly, this is not the case here. So we can conclude that the residuals are autocorrelated.

Using runs test

With a p-value < 2.2e-16, we reject the null hypothesis that it is random. This means there is a definite pattern in the residuals.

Using the Durbin-Watson test.

Durbin Watson examines whether the errors are autocorrelated with themselves. The null states that they are not autocorrelated (what we want). This test could be especially useful when you conduct a multiple (times series) regression. For example, this test could tell you whether the residuals at time point 1 are correlated with the residuals at time point 2 (they shouldn’t be). In other words, this test is useful to verify that we haven’t violated the independence assumption. So, Durbin Watson also confirms our finding.

Add lag1 of residual as an X variable to the original model to ratify it. The slide function in the Data Combine package can be used to accomplish this.

Let’s see if this strategy is able to solve the problem of residual autocorrelation.

The correlation values drop below the dashed blue line from lag1 itself, unlike the “acf” plot of lmMod (i.e., “R” syntax for evaluating linear model). As a result, autocorrelation cannot be proven.

0.3362 is the p-value. It is impossible to reject the null hypothesis that it is random. We can’t rule out the null hypothesis with a p-value of 0.3362. As a result, we may be confident that residuals are not autocorrelated.

We can’t rule out the null hypothesis that actual autocorrelation is zero because of the high p value of 0.667. As a result, this model meets the condition that residuals should not be autocorrelated.

If the assumption of autocorrelation of residuals is not satisfied after adding lag1 as an X variable, you might want to try adding lag2 or be creative in creating relevant derived explanatory variables or interaction terms. This is more like a work of art than a computer algorithm.

Assumption 5

The X variables and residuals are uncorrelated How to check?

Do a correlation test on the X variable and the residuals.

The linear trend of the results is provided by regression; residuals are the randomness that is “leftover” after fitting a regression model. Because the linear trend has been removed by the regression, the correlation between the explanatory variable(s) and the residuals is/are 0. When plotting residuals against explanatory factors, however, you may notice patterns; such patterns indicate that there is more going on than a straight line, such as curvature, etc.

Because the p-value is so high, the null hypothesis that the true correlation is 0 cannot be ruled out. As a result, the assumption is correct for this model.

Assumption 6

The number of observations must be greater than the number of Xs

We can see it directly by looking at the statistics, and it’s straightforward to follow.

Assumption 7

The variability in X values is positive

A statistical measurement of the dispersion between values in a data collection is known as variance. Variance expresses how far each number in the set deviates from the mean, and thus from every other number in the set. This symbol is frequently used to represent variation: 2. Analysts and traders use it to gauge market volatility and security. This implies that the X values in a sample cannot all be the same (or even nearly the same).

How to check?

The variance in the X variable above is much larger than 0. So, this assumption is satisfied.

Assumption 8

The regression model is correctly specified

If the regression equation contains all of the required predictors, including any necessary transformations and interaction factors, the regression model is correctly stated. That is, the model has no missing, redundant, or superfluous predictors. Of course, this is the best-case scenario, and it’s what we’re hoping for! This means that the model equation should be specified accurately if the Y and X variables have an inverse relationship:

Assumption 9

No perfect multicollinearity

A variance inflation factor is a method that may be used to determine the degree of multicollinearity in a dataset. When a person wishes to assess the effect of numerous variables on a specific result, they employ a multiple regression. The dependent variable is the result that the independent variables-the model’s inputs-have an effect on. When one or more independent variables or inputs have a linear relationship or correlation, multicollinearity exists. There is no perfect linear relationship between explanatory variables. If the variance inflation factor of a predictor variable were 5.27 (√5.27 = 2.3), this means that the standard error for the coefficient of that predictor variable is 2.3 times larger than if that predictor variable had 0 correlation with the other predictor variables.

How to check?

Using Variance Inflation factor (VIF). But, What is VIF?

VIF is a metric that is calculated for each X variable in a linear model. If a variable’s VIF is high, it suggests that the information in that variable has already been explained by other X variables in the model, implying that that variable is more redundant. As a result, the smaller the VIF (2) the better. VIF is calculated as VIF=1 / (1Rsq), where Rsq is the Rsq term for the model with the specified X as response versus all other Xs that were used as predictors.

How to rectify this?

  1. Either iteratively remove the X var with the highest VIF or,
  2. See the correlation between all variables and keep only one of all highly correlated pairs.

The VIF should not exceed 4 for any of the X variables, according to the convention. That is, we will not allow any of the RSq of the Xs (the model that was generated with that X as a response variable and the remaining Xs as predictors) to exceed 75% => 1/(1–0.75) => 1/0.25 => 4.

Assumption 10

Normality of residuals

The functions qqnorm and qqplot in R can be used to make Q-Q graphs. The qqnorm command generates a Normal Q-Q graphic. R shows data in sorted order versus quantiles from a conventional Normal distribution when you feed it a vector of data. Consider the trees data set that is with R. It gives dimensions for the girth, height, and volume of 31 felled black cherry trees. Height is one of the variables. Can we assume that our sample of Heights is drawn from a normally distributed population? The residuals should be spread normally. If the estimates are computed using the greatest likelihood method (rather than OLS), the Y and Xs are also regularly distributed.

This can be visually checked using the plot (top right plot).

This assumption is tested using the qqnorm() plot in the top-right corner. It is a fully normal distribution if all points fall exactly on the line. However, some departure is to be expected, especially near the ends (see upper right), although the deviations should be minor, if not negligible.

Check Assumptions Automatically

Global Validation of Linear Models Assumptions is abbreviated as gvlma. Many assumptions underpin linear regression analysis. We will not be able to accept the regression results if we disregard them and these assumptions are not met. Fortunately, R offers a plethora of packages that can take care of a lot of the heavy job for us. A simple function can be used to test the assumptions of our linear regression. Fit a basic regression model first: The gvlma () function from gvlma can be used to verify the key assumptions of a linear model.

Three of the assumptions have been proven false. This is likely due to the fact that there are only 50 data points in the dataset, and even two or three outliers can degrade the model’s quality. As a result, the most urgent solution is to eliminate the outliers and rebuild the model. To come to your own decision, glance at the diagnostic plot below. The data points 23, 35, and 49 are shown as outliers in the above plot. Let’s take them out of the data and rebuild the model from scratch.

Despite the fact that the adjustments appear tiny, they are getting closer to adhering to the assumptions. There’s still one more item to explain. That is to say, the plot is in the lower right corner. It’s a graph showing standardized residuals vs leverage. The amount of influence each data point has on the regression is measured by leverage. The figure also depicts Cook’s distance values, which show how much the fitted values would change if a point were removed.

The regression can be dramatically distorted if a point far from the centroid has a big residual. The red smoothed line should stay near to the mid-line for a reasonable regression model, and no point should have a significant cook’s distance (i.e. should not have too much influence on the model.)

RPubs is a web-based publishing platform for R Markdown publications. If you come across an interesting article, you might wish to copy the script and attempt duplicating it on your own computer. The RPubs package can assist you in copying and pasting the script (or the result) without having to do so manually. HTML document, depicting all assumptions of Classical Linear Regression Model in machine learning with RPubs .

Originally published at https://www.ssaazs.com.

--

--

zakir abbas

I am an ex-professor of Bahria University Karachi. My specialty is Finance, Portfolio Management, Investment Banking; Commercial Banking, Economics.