This article will discuss some problems in implementing multivariate linear regression. These problems are:

  1. What to do with the collinearity phenomenon in linear regression?
  2. How to extract feature importance of a linear regression, is it just the coefficient of the model?

Collinearity

What is collinearity

Collinearity is a condition in which some of the independent variables are highly correlated. Collinearity will occur more often when the number of predictors is very large, as we can imagine that there is a higher chance that some predictor are correlated with others. However, there are some typical problems with collinearity.

Problems of collinearity

  1. Large variance of coefficients. The coefficients will be very sensitive to small changes of the data, p-values will be not significant.
  2. The credibility of the estimated coefficients compromised. The statistical power of the regression model is weakened, p-values that indicates variables that are statistically significant may NOT be trustworthy.

However, there are some aspects which collinearity doesn’t affect, such as the performance of the model. So there is no need to deal with collinearity in some situation.

When to Deal with Collinearity?

  1. When the collinearity effect is very severe that compromised model’s generalizing ability.
  2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, it depends on the variables that are of interest. If the important variables are not affected, then there is no need to deal with collinearity.
  3. As mentioned before, the model’s predictions, precision of the predictions, and the goodness-of-fit statistics won’t be affected. So if those are the concern of the model, then there is no need to reduce even severe multicollinearity.

Variance Inflation Factor (VIF)

In linear models, VIF is a quantity for each predictor variable that quantifies severity of multicollinearity

Definition

In statistics, the variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone.

Put it in simple words, VIF is measuring how well other features can linearly represent feature $k$. It would be clearer in mathematical form.

Mathematical Definition

The VIF for predictor $X_i$ is defined as $$ \text{VIF}i = \frac{1}{1 - R_i^2} $$ where $R_i^2$ is the coefficient of determination of the regression of $X_j$ on the other covariates. The model from which $R_i^2$ is calculated looks like this: $$ X_i = \beta_0 + \beta_1X_1 + \dots + \beta{i-1}X_{i-1} + \beta_{i+1}X_{i+1} + \dots + \beta_mX_m + \epsilon $$

The coefficient of determination of a regression represents how much of the variance in the original data is explained by the model and can be calculated as $$ R^2 = 1 - \frac{\sum_i(y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2} $$

Observation

NOTE: detailed math processed refer to reference 2 and 3.

Consider a linear model with $m$ predictors $$ Y = X\beta + \epsilon $$ Using OLE to estimate coefficients $\beta$, the normal equation is $$ \hat \beta = (X’X)^{-1}X’Y $$ Thus, the covariance matrix of $\hat \beta$ is $$ \begin{align} \Sigma(\hat \beta) &= E[\hat \beta^2] - E[\hat \beta]^2 \\ &= ((X’X)^{-1}X’)^2 E[\epsilon^2] \\ &= \sigma^2(X’X)^{-1} \end{align} $$ with $$\Sigma_{ij} = cov(\hat \beta_i, \hat \beta_j)$$ and $\Sigma_{ii} = var(\hat \beta_i)$.

Thus, $$\hat{var}(\hat \beta_i) = \sigma^2 [(X’X)^{-1}]{ii}$$, by calculating the $ii$ entry in the matrix, we can get $$ \begin{align} \hat{var}(\hat \beta_i) &= \sigma^2[X^T_iX_i - \hat \beta_i^T(X^T{-i}X_{-i})\hat \beta_i]^{-1} \\ &= \sigma^2 \frac{1}{\text{RSS}_i} \\ &= \frac{\sigma^2}{(n-1)\hat{var} (X_i)} \frac{1}{1 - R_i^2} \end{align} $$ The last step uses the fact that $R_i^2 = 1 - \frac{RSS_i}{TSS_i}$ and that$\hat{var}(X_i) = \frac{TSS_i}{n-1}$.

This reflect the fact that the variance of coefficient comes from several parts:

  • $\sigma^2$: The noise from the data, the more scatter the data is, the larger the variance;
  • $\frac{1}{n-1}$: The size of the data, the smaller the size, the large the variance;
  • $\hat{var}(X_i)$: The variance of a predictor will have inversed influence on the variance of the corresponding coefficient, the larger the variance of predictor, the smaller the variance of coefficient;
  • VIF: The larger the VIF is, the larger the variance will be.

VIF measures the part of variance generated by having linear correlation with other predictors.

Interpretation

VIF, as mentioned before, has several interpretations

  1. The quotient of the variance in a model with multiple terms by the variance of a model with one term alone.
  2. How well other features can linearly represent feature $i$.
  3. The part of variance of coefficient $\beta_j$ generated by having linear correlation with other predictors.

And the rule of thumb for categorizing VIF is as follows:

  1. $\text{VIF} = 1$ : Not correlated;
  2. $1 < \text{VIF} < 5$ : Moderately correlated;
  3. $\text{VIF} > 5$ : Highly correlated.

How to Deal with Collinearity?

There is no simple and universal way, but there are some directions that we can put effort on:

  • Remove some of the highly correlated independent variables.
  • Linearly combine the independent variables, such as adding them together.
  • Perform an analysis designed for highly correlated variables, trying to combine them, such as principal components analysis (PCA) or partial least squares (PLS) regression.

Feature Importance, Pitfalls and the Right Thinking

How about using coefficient values as the raking of feature importance?

The First Concern

Obviously there is a problem simply using the magnitude of coefficients to rank the feature importance. For instance, consider a linear model: $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon $$ Consider replacing $X_2$ with $X_2’ = 0.5 X_2$. imaginably, $\beta_2 = 2 \beta_2$ while $\beta_1$ won’t change in this model. In practice, $X_2$ and $X_2’$ could just be the same quantity in different units. That means by scaling variables, we can change the importance of variables at will, which does not make sense.

Standardized Variables? Collinearity!

What about we fix the scale of variables by standardizing the all variables to be of mean 0 and std 1? Well, though the magnitude of coefficients won’t change with the units of covariates, there are still concerns here. Collinearity, as we discussed above, if happens in the model, will cause large variance in the coefficients, thus the feature importance ranking will still be not stable in this scenario. Here we have an simple example.

Experiment of Collinearity in Linear Regression
> # Generte X and Y
> latent_variable = seq(0,10,length.out=200)
> Y <- 2 * latent_variable
> x1 <- latent_variable + 0.5*rnorm(200)
> x2 <- x1 + 0.2*rnorm(200)

> # standardize the data
> x1s = scale(x1)
> x2s = scale(x2)
> mean(x1s)
[1] 1.160194e-16
> var(x1s)
     [,1]
[1,]    1

We constructed a latent variable (not visible to the model), with the dependent variable $Y$ to be proportional to it. Then, we constructed two variables $x_1$ and $x_2$, which are latent variable plus some noise, with $x_2$ is a slightly noisier version of $x_1$. And then they were standardized.

From the construction, we expect $x_1$ will be slightly more important than $x_2$, since $x_2$ is noisier (note that the noise additionally added to $x_2$ are independent to $x_1$’s noise). Single variable models are construct the test this.

> # Single variable linear regression
> x1model <- lm(Y ~ x1s)
> summary(x1model)

Call:
lm(formula = Y ~ x1s)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6830 -0.5936 -0.0363  0.5508  3.1349 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00000    0.07028  142.28   <2e-16 ***
x1s          5.73189    0.07046   81.35   <2e-16 ***
---
Signif. codes:  
0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 0.994 on 198 degrees of freedom
Multiple R-squared:  0.9709,	Adjusted R-squared:  0.9708 
F-statistic:  6618 on 1 and 198 DF,  p-value: < 2.2e-16

> 
> x2model <- lm(Y ~ x2s)
> summary(x2model)

Call:
lm(formula = Y ~ x2s)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.06413 -0.62072 -0.04847  0.68914  2.80826 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00000    0.07550  132.46   <2e-16 ***
x2s          5.71868    0.07568   75.56   <2e-16 ***
---
Signif. codes:  
0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 1.068 on 198 degrees of freedom
Multiple R-squared:  0.9665,	Adjusted R-squared:  0.9663 
F-statistic:  5709 on 1 and 198 DF,  p-value: < 2.2e-16

Both models look promising, and we know the coefficient for both two variables would be around $5.7$. But when we add them in the same model.

> # multivariate linear regression
> lr <- lm(Y ~ x1s + x2s)
> summary(lr)

Call:
lm(formula = Y ~ x1s + x2s)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.67838 -0.59166 -0.03018  0.55140  3.13235 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00000    0.07046 141.922  < 2e-16 ***
x1s          5.68375    1.03254   5.505 1.14e-07 ***
x2s          0.04825    1.03254   0.047    0.963    
---
Signif. codes:  
0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 0.9965 on 197 degrees of freedom
Multiple R-squared:  0.9709,	Adjusted R-squared:  0.9707 
F-statistic:  3292 on 2 and 197 DF,  p-value: < 2.2e-16

We can see that the model only sees the value in $x_1$ and almost “throwed” $x_2$. The coefficient of $x_2$ is nearly 0 and p-value is not significant. If we look at their VIF values, we can see the severe collinearity effect between them.

# VIF
> library(car)
> car::vif(lr)
     x1s      x2s 
213.6636 213.6636 

Note: There will be cases when the most important features also have large variance if there are moderate collinearity effect. It is a question to ask that what kind of feature is considered most important? The most stable one or the most heavily loaded one?

To conclude, when we are using coefficient to rank variable importance. It could make sense, after dealing with scaling and collinearity. But we should be very clear what is the definition of feature importance and what does it mean when in a linear model.

Feature Importance of Linear Regression

Here are some potential feasible ways, not just applicable to linear regression but also other models (black box) too. Demos are using R package relaimpo, which calculates relative importance for linear regression, with dataset Carseats from Data for an Introduction to Statistical Learning with Applications in R in package ISLR.

Simple relative feature importance

  1. The metric first: $R^2$ value for each variable from $k$ regression models with one predictor only. e.g. $R^2$ of model $Y = \beta_0 + \beta_iX_i + \epsilon$;
  2. The metric last: The increase in $$R^2$$ of adding $$X_i$$ assuming that other variables are already in the model. e.g. $$R^2_{total} - R^2_i$$ where $$R^2_{total}$$ from the model $$Y = \beta_0 + \sum_j b_jX_j + \epsilon, j = 1,2,\dots, k$$ and $$R^2_i$$ is from the model that consists all other variables except $$X_i$$;
  3. The metric betasq: Standardized coefficient (similar to what we discussed above);

Computer-intensive relative importance metrics

  1. The metric lmg: Averaged sequential $R^2$ with every possible orderings. Sequential $R^2$ for a variable is the increased $R^2$ when adding that variable to the model. However, this value will change when adding variables in different orderings for each variable. So this method is taking average of all possible orderings.
  2. The metric pmvd: Improved lmg method, weighted average of sequential $R^2$. It assure that a variable with actual coefficient 0 in the model will also be allocated an feature importance of 0.

Demo

> help("Carseats")
> library(ISLR)
> library(relaimpo)
     Sales          CompPrice       Income      
 Min.   : 0.000   Min.   : 77   Min.   : 21.00  
 1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75  
 Median : 7.490   Median :125   Median : 69.00  
 Mean   : 7.496   Mean   :125   Mean   : 68.66  
 3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00  
 Max.   :16.270   Max.   :175   Max.   :120.00  
  Advertising       Population        Price      
 Min.   : 0.000   Min.   : 10.0   Min.   : 24.0  
 1st Qu.: 0.000   1st Qu.:139.0   1st Qu.:100.0  
 Median : 5.000   Median :272.0   Median :117.0  
 Mean   : 6.635   Mean   :264.8   Mean   :115.8  
 3rd Qu.:12.000   3rd Qu.:398.5   3rd Qu.:131.0  
 Max.   :29.000   Max.   :509.0   Max.   :191.0  
  ShelveLoc        Age          Education    Urban    
 Bad   : 96   Min.   :25.00   Min.   :10.0   No :118  
 Good  : 85   1st Qu.:39.75   1st Qu.:12.0   Yes:282  
 Medium:219   Median :54.50   Median :14.0            
              Mean   :53.32   Mean   :13.9            
              3rd Qu.:66.00   3rd Qu.:16.0            
              Max.   :80.00   Max.   :18.0            
   US     
 No :142  
 Yes:258  
# Helper function of dataset Carseats
> help("Carseats")

Sales of Child Car Seats

Description

A simulated data set containing sales of child car seats at 400 different stores.

Usage

Carseats

Format

A data frame with 400 observations on the following 11 variables.

Sales Unit sales (in thousands) at each location

CompPrice Price charged by competitor at each location

Income Community income level (in thousands of dollars)

Advertising Local advertising budget for company at each location (in thousands of dollars)

Population Population size in region (in thousands)

Price Price company charges for car seats at each site

ShelveLoc A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

Age Average age of the local population

Education Education level at each location

Urban A factor with levels No and Yes to indicate whether the store is in an urban or rural location

US A factor with levels No and Yes to indicate whether the store is in the US or not

> # Select numeric columns
> nums <- unlist(lapply(Carseats, is.numeric)) 
> linmod <- lm(Sales ~ ., data = Carseats[,nums])
> summary(linmod)

Call:
lm(formula = Sales ~ ., data = Carseats[, nums])

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0598 -1.3515 -0.1739  1.1331  4.8304 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.7076934  1.1176260   6.896 2.15e-11 ***
CompPrice    0.0939149  0.0078395  11.980  < 2e-16 ***
Income       0.0128717  0.0034757   3.703 0.000243 ***
Advertising  0.1308637  0.0151219   8.654  < 2e-16 ***
Population  -0.0001239  0.0006877  -0.180 0.857092    
Price       -0.0925226  0.0050521 -18.314  < 2e-16 ***
Age         -0.0449743  0.0060083  -7.485 4.75e-13 ***
Education   -0.0399844  0.0371257  -1.077 0.282142    
---
Signif. codes:  0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 1.929 on 392 degrees of freedom
Multiple R-squared:  0.5417,	Adjusted R-squared:  0.5335 
F-statistic: 66.18 on 7 and 392 DF,  p-value: < 2.2e-16

> calc.relimp(linmod, type = c("first", "last", "betasq", "lmg"))
Response variable: Sales 
Total response variance: 7.975626 
Analysis based on 400 observations 

7 Regressors: 
CompPrice Income Advertising Population Price Age Education 
Proportion of variance explained by model: 54.17%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

                    lmg         last       first       betasq
CompPrice   0.085463002 1.677996e-01 0.004106084 2.600424e-01
Income      0.019008134 1.603602e-02 0.023089100 1.627012e-02
Advertising 0.078847156 8.756412e-02 0.072633905 9.496515e-02
Population  0.001720914 3.796537e-05 0.002547320 4.182253e-05
Price       0.293822615 3.921527e-01 0.197981150 6.016889e-01
Age         0.060944719 6.551305e-02 0.053738398 6.655961e-02
Education   0.001854091 1.356228e-03 0.002699347 1.376560e-03

References

  1. Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
  2. Variant inflation factor
  3. Variance of Coefficients in a Simple Linear Regression
  4. For linear classifiers, do larger coefficients imply more important features?
  5. Relative Importance for Linear Regression in R: The Package relaimpo
  6. A REVIEW OF STATISTICAL METHODS FOR DETERMINATION OF RELATIVE IMPORTANCE OF CORRELATED PREDICTORS AND IDENTIFICATION OF DRIVERS OF CONSUMER LIKING