Assumptions of Classical Linear Regression Model-CLRM-Based on secondary data

zakir abbas
5 min readJul 9, 2022

--

A linear regression model is predicated on four assumptions: There is a linear relationship between X and the mean of Y.Homoscedasticity occurs when the variance of the residual is the same for all values of X.Independence: Observations are distinct from one another.Y is normally distributed for any fixed value of X. Linearity:
Normality: 2nd assumption of CLRM: None of the independent variables have a linear relationship with any other independent variables.

Above is the 2nd assumption of the Classical Linear Regression Model (CLRM) and we detect it by replacing independent one by one as the dependent variable and run a regression to determine the coefficient of determination R and if R such model says showing zero value then variance inflation factor would be 1. Formula, VIF = Therefore, any value greater than 1 shows the problem of VIF and the statistician says, greater than 5 shows the severity and some say greater than or equal to 10 will be considered as the severity of the problem. And in this report, we will follow the greater than or equal 10 VIF as severity. The Illness of the model is that “Inefficiency of the coefficient”

Removals of Multicollinearity

There are four methods that are used to remove multicollinearity,

1. If the Variance inflation factor ( VIF) is less than 10. Leave the method alone

2. If VIF is greater than 10 then use the following methods

A. Exclude the variable (however, only control variable can be excluded)

B. Change the measure (e.g., use growth rate FDI instead of FDI in $term)

C. Increase the sample size

Detection, Illness, and Removals of Autocorrelation

4th assumption of CLRM: Error term observations are independent of each other OR they are not correlated to each other.

Above is the 4th assumption of the Classical Linear Regression Model (CLRM) and if the error term correlates to its previous values the problem of autocorrelation exists. Error is a mistake and mistakes should be random, no one commits mistakes deliberately or intentionally. Suppose Z was an important variable but due to lack of literature review or without reading/understanding theory, researchers forget to include Z into the model and as result, this omission Z will become part of the error term. In this case when the error term shows positive autocorrelation like +ve, +ve, +ve ……Or -ve, -ve, -ve …… and when shows consistent error value in terms of +ve, -ve , +ve, -ve ……we call it negative autocorrelation. So, we have to look into whether such values of error terms are there.

Issues in the significance

We use the DW method to determine the autocorrelation problem by using the DW method and the range of DW values is 0 to 4, If the DW value is 2 then there is no autocorrelation.

If DW is less than 2 then there is positive autocorrelation

If DW is greater than 2 then there are negative autocorrelation exits.

Severity test

We use serial correlation LM test to check the severity of the problem,

The hypothesis is that Ho: there is no autocorrelation, if the p-value is greater than 0.05, Accept Null hypothesis that there is no autocorrelation.

If the p-value is greater than 0.05, Reject the Null hypothesis that there is autocorrelation.

Detection, Illness, and Removals of Heteroscedasticity

6th assumption of CLRM: Error term has constant variance.

Above is the 6th assumption of the Classical Linear Regression Model (CLRM) and according to it, there should be constant variance among different groups. In this example we have taken data of 31 countries which consist of low-income, medium-income, and high-income countries, it is cross-sectional data, i.e., data pertains to the year 1992 and various entities (31 countries) having respective GDPs and Consumption. The Illness of this model is that when variance is none constant (i.e., heteroskedasticity exists) significance becomes doubtful.

Understand the Concept of Functional Form of Regression

As we know OLS’s first assumption that “model should be linear in parameter” is applicable to parameters not on a variable, e.g., Y = α + β X +e and Y = α + β X + β X preceding are linear in variables while following is non-linear in a variable, since it is non-linear in variable but linear in parameters, so OLS still estimate it. However, if the functional form of the regression equation is nonlinear then we need to apply a non-linear form like TO = 6L — 0.4L , TP is total productivity of labor and L is the number of labors used in production.

Various forms of non-linear Forms of Regression

How do we know what is the functional form of the regression equation?

Firstly, theory guides us to use the non-functional form, like the Laffer curve says that the relationship between Tax-revenue and tax rate is non-functional.

Secondly, there is a test known as the Ramsey RESET test that says if we take a square of estimated “y” as an independent variable and if it shows significant value, then we have to apply a non-linear form of a regression model.

STATIONARY SERIES AND UNIT ROOT TEST

The concept of stationary and non-stationary series includes the ADF test (Augmented Dickey-Fuller test). Following are typical characteristics of stationary series,

Autoregressive Distributed lag model (ARDL) and Granger Causality with EViews

There are two techniques through which we understand the relationship between or among the variables namely,

  1. Interdependence techniques (like Covariance & Correlation)
  2. Dependence techniques (Like regression in which we know which is dependent on whom, causation is also a dependence technique). If we regress some values with its lag values and the effect which will arise due to such lag values is considered as causation, i.eIn such case dependent variable will be an effect of some cause. If we have a model in which independent variables and dependent variables are the same but independent variables are coming with lag values and if they regress with each other, then we call it autoregressive. The Autoregressive Distributed lag model (ARDL) is used quite often by researchers to understand the causality and its effect on the dependent variable. The test to determine causality in EViews is performed by Granger causality.

Understanding the Dummy Variables

In this example data are taken from France daily average salary area wise and gender-wise at different age levels, we have 222 observations, where Salary is a dependent variable and Area and Age are independent variables According to this data there are two scenarios,

  1. How much age and area cause change in salary and irrespective of gender?
  2. How much age and area put an impact on salary by considering gender as a dummy variable- Male and female?

Use of Log in Regression Model

Basically, the mathematical function of the log is used for two reasons,

Originally published at https://www.ssaazs.com.

--

--

zakir abbas

I am an ex-professor of Bahria University Karachi. My specialty is Finance, Portfolio Management, Investment Banking; Commercial Banking, Economics.