This article will discuss some problems in implementing multivariate linear regression. These problems are:

  1. What to do with the collinearity phenomenon in linear regression?
  2. How to extract feature importance of a linear regression, is it just the coefficient of the model?

Collinearity

What is collinearity

Collinearity is a condition in which some of the independent variables are highly correlated. Collinearity will occur more often when the number of predictors is very large, as we can imagine that there is a higher chance that some predictor are correlated with others. However, there are some typical problems with collinearity.

Problems of collinearity

  1. Large variance of coefficients. The coefficients will be very sensitive to small changes of the data, p-values will be not significant.
  2. The credibility of the estimated coefficients compromised. The statistical power of the regression model is weakened, p-values that indicates variables that are statistically significant may NOT be trustworthy.

However, there are some aspects which collinearity doesn’t affect, such as the performance of the model. So there is no need to deal with collinearity in some situation.

When to Deal with Collinearity?

  1. When the collinearity effect is very severe that compromised model’s generalizing ability.
  2. Multicollinearity affects only the specific independent variables that are correlated. Therefore, it depends on the variables that are of interest. If the important variables are not affected, then there is no need to deal with collinearity.
  3. As mentioned before, the model’s predictions, precision of the predictions, and the goodness-of-fit statistics won’t be affected. So if those are the concern of the model, then there is no need to reduce even severe multicollinearity.

Variance Inflation Factor (VIF)

In linear models, VIF is a quantity for each predictor variable that quantifies severity of multicollinearity

Definition

In statistics, the variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone.

Put it in simple words, VIF is measuring how well other features can linearly represent feature kk. It would be clearer in mathematical form.

Mathematical Definition

The VIF for predictor XiX_i is defined as \text{VIF}i = \frac{1}{1 - R_i^2} $$ where $R_i^2$ is the coefficient of determination of the regression of $X_j$ on the other covariates. The model from which $R_i^2$ is calculated looks like this: $$ X_i = \beta_0 + \beta_1X_1 + \dots + \beta{i-1}X_{i-1} + \beta_{i+1}X_{i+1} + \dots + \beta_mX_m + \epsilon

The coefficient of determination of a regression represents how much of the variance in the original data is explained by the model and can be calculated as R2=1i(yiy^i)2i(yiyˉ)2 R^2 = 1 - \frac{\sum_i(y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}

Observation

NOTE: detailed math processed refer to reference 2 and 3.

Consider a linear model with mm predictors Y=Xβ+ϵ Y = X\beta + \epsilon Using OLE to estimate coefficients β\beta, the normal equation is β^=(XX)1XY \hat \beta = (X’X)^{-1}X’Y Thus, the covariance matrix of β^\hat \beta is $$ \begin{align} \Sigma(\hat \beta) &= E[\hat \beta^2] - E[\hat \beta]^2 \\ &= ((X’X)^{-1}X’)^2 E[\epsilon^2] \\ &= \sigma^2(X’X)^{-1} \end{align} $$ with Σij=cov(β^i,β^j)\Sigma_{ij} = cov(\hat \beta_i, \hat \beta_j) and Σii=var(β^i)\Sigma_{ii} = var(\hat \beta_i).

Thus, \hat{var}(\hat \beta_i) = \sigma^2 [(X’X)^{-1}]{ii}$$, by calculating the $ii$ entry in the matrix, we can get $$ \begin{align} \hat{var}(\hat \beta_i) &= \sigma^2[X^T_iX_i - \hat \beta_i^T(X^T{-i}X_{-i})\hat \beta_i]^{-1} \\ &= \sigma^2 \frac{1}{\text{RSS}_i} \\ &= \frac{\sigma^2}{(n-1)\hat{var} (X_i)} \frac{1}{1 - R_i^2} \end{align} The last step uses the fact that Ri2=1RSSiTSSiR_i^2 = 1 - \frac{RSS_i}{TSS_i} and thatvar^(Xi)=TSSin1\hat{var}(X_i) = \frac{TSS_i}{n-1}.

This reflect the fact that the variance of coefficient comes from several parts:

  • σ2\sigma^2: The noise from the data, the more scatter the data is, the larger the variance;
  • 1n1\frac{1}{n-1}: The size of the data, the smaller the size, the large the variance;
  • var^(Xi)\hat{var}(X_i): The variance of a predictor will have inversed influence on the variance of the corresponding coefficient, the larger the variance of predictor, the smaller the variance of coefficient;
  • VIF: The larger the VIF is, the larger the variance will be.

VIF measures the part of variance generated by having linear correlation with other predictors.

Interpretation

VIF, as mentioned before, has several interpretations

  1. The quotient of the variance in a model with multiple terms by the variance of a model with one term alone.
  2. How well other features can linearly represent feature ii.
  3. The part of variance of coefficient βj\beta_j generated by having linear correlation with other predictors.

And the rule of thumb for categorizing VIF is as follows:

  1. VIF=1\text{VIF} = 1 : Not correlated;
  2. 1<VIF<51 < \text{VIF} < 5 : Moderately correlated;
  3. VIF>5\text{VIF} > 5 : Highly correlated.

How to Deal with Collinearity?

There is no simple and universal way, but there are some directions that we can put effort on:

  • Remove some of the highly correlated independent variables.
  • Linearly combine the independent variables, such as adding them together.
  • Perform an analysis designed for highly correlated variables, trying to combine them, such as principal components analysis (PCA) or partial least squares (PLS) regression.

Feature Importance, Pitfalls and the Right Thinking

How about using coefficient values as the raking of feature importance?

The First Concern

Obviously there is a problem simply using the magnitude of coefficients to rank the feature importance. For instance, consider a linear model: Y=β0+β1X1+β2X2+ϵ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon Consider replacing X2X_2 with X2=0.5X2X_2’ = 0.5 X_2. imaginably, β2=2β2\beta_2 = 2 \beta_2 while β1\beta_1 won’t change in this model. In practice, X2X_2 and X2X_2’ could just be the same quantity in different units. That means by scaling variables, we can change the importance of variables at will, which does not make sense.

Standardized Variables? Collinearity!

What about we fix the scale of variables by standardizing the all variables to be of mean 0 and std 1? Well, though the magnitude of coefficients won’t change with the units of covariates, there are still concerns here. Collinearity, as we discussed above, if happens in the model, will cause large variance in the coefficients, thus the feature importance ranking will still be not stable in this scenario. Here we have an simple example.

Experiment of Collinearity in Linear Regression
> # Generte X and Y
> latent_variable = seq(0,10,length.out=200)
> Y <- 2 * latent_variable
> x1 <- latent_variable + 0.5*rnorm(200)
> x2 <- x1 + 0.2*rnorm(200)

> # standardize the data
> x1s = scale(x1)
> x2s = scale(x2)
> mean(x1s)
[1] 1.160194e-16
> var(x1s)
     [,1]
[1,]    1

We constructed a latent variable (not visible to the model), with the dependent variable YY to be proportional to it. Then, we constructed two variables x1x_1 and x2x_2, which are latent variable plus some noise, with x2x_2 is a slightly noisier version of x1x_1. And then they were standardized.

From the construction, we expect x1x_1 will be slightly more important than x2x_2, since x2x_2 is noisier (note that the noise additionally added to x2x_2 are independent to x1x_1’s noise). Single variable models are construct the test this.

> # Single variable linear regression
> x1model <- lm(Y ~ x1s)
> summary(x1model)

Call:
lm(formula = Y ~ x1s)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6830 -0.5936 -0.0363  0.5508  3.1349 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00000    0.07028  142.28   <2e-16 ***
x1s          5.73189    0.07046   81.35   <2e-16 ***
---
Signif. codes:  
0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 0.994 on 198 degrees of freedom
Multiple R-squared:  0.9709,	Adjusted R-squared:  0.9708 
F-statistic:  6618 on 1 and 198 DF,  p-value: < 2.2e-16

> 
> x2model <- lm(Y ~ x2s)
> summary(x2model)

Call:
lm(formula = Y ~ x2s)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.06413 -0.62072 -0.04847  0.68914  2.80826 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00000    0.07550  132.46   <2e-16 ***
x2s          5.71868    0.07568   75.56   <2e-16 ***
---
Signif. codes:  
0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 1.068 on 198 degrees of freedom
Multiple R-squared:  0.9665,	Adjusted R-squared:  0.9663 
F-statistic:  5709 on 1 and 198 DF,  p-value: < 2.2e-16

Both models look promising, and we know the coefficient for both two variables would be around 5.75.7. But when we add them in the same model.

> # multivariate linear regression
> lr <- lm(Y ~ x1s + x2s)
> summary(lr)

Call:
lm(formula = Y ~ x1s + x2s)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.67838 -0.59166 -0.03018  0.55140  3.13235 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.00000    0.07046 141.922  < 2e-16 ***
x1s          5.68375    1.03254   5.505 1.14e-07 ***
x2s          0.04825    1.03254   0.047    0.963    
---
Signif. codes:  
0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 0.9965 on 197 degrees of freedom
Multiple R-squared:  0.9709,	Adjusted R-squared:  0.9707 
F-statistic:  3292 on 2 and 197 DF,  p-value: < 2.2e-16

We can see that the model only sees the value in x1x_1 and almost “throwed” x2x_2. The coefficient of x2x_2 is nearly 0 and p-value is not significant. If we look at their VIF values, we can see the severe collinearity effect between them.

# VIF
> library(car)
> car::vif(lr)
     x1s      x2s 
213.6636 213.6636 

Note: There will be cases when the most important features also have large variance if there are moderate collinearity effect. It is a question to ask that what kind of feature is considered most important? The most stable one or the most heavily loaded one?

To conclude, when we are using coefficient to rank variable importance. It could make sense, after dealing with scaling and collinearity. But we should be very clear what is the definition of feature importance and what does it mean when in a linear model.

Feature Importance of Linear Regression

Here are some potential feasible ways, not just applicable to linear regression but also other models (black box) too. Demos are using R package relaimpo, which calculates relative importance for linear regression, with dataset Carseats from Data for an Introduction to Statistical Learning with Applications in R in package ISLR.

Simple relative feature importance

  1. The metric first: R2R^2 value for each variable from kk regression models with one predictor only. e.g. R2R^2 of model Y=β0+βiXi+ϵY = \beta_0 + \beta_iX_i + \epsilon;
  2. The metric last: The increase in R2R^2 of adding XiX_i assuming that other variables are already in the model. e.g. Rtotal2Ri2R^2_{total} - R^2_i where Rtotal2R^2_{total} from the model Y=β0+jbjXj+ϵ,j=1,2,,kY = \beta_0 + \sum_j b_jX_j + \epsilon, j = 1,2,\dots, k and Ri2R^2_i is from the model that consists all other variables except XiX_i;
  3. The metric betasq: Standardized coefficient (similar to what we discussed above);

Computer-intensive relative importance metrics

  1. The metric lmg: Averaged sequential R2R^2 with every possible orderings. Sequential R2R^2 for a variable is the increased R2R^2 when adding that variable to the model. However, this value will change when adding variables in different orderings for each variable. So this method is taking average of all possible orderings.
  2. The metric pmvd: Improved lmg method, weighted average of sequential R2R^2. It assure that a variable with actual coefficient 0 in the model will also be allocated an feature importance of 0.

Demo

> help("Carseats")
> library(ISLR)
> library(relaimpo)
     Sales          CompPrice       Income      
 Min.   : 0.000   Min.   : 77   Min.   : 21.00  
 1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75  
 Median : 7.490   Median :125   Median : 69.00  
 Mean   : 7.496   Mean   :125   Mean   : 68.66  
 3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00  
 Max.   :16.270   Max.   :175   Max.   :120.00  
  Advertising       Population        Price      
 Min.   : 0.000   Min.   : 10.0   Min.   : 24.0  
 1st Qu.: 0.000   1st Qu.:139.0   1st Qu.:100.0  
 Median : 5.000   Median :272.0   Median :117.0  
 Mean   : 6.635   Mean   :264.8   Mean   :115.8  
 3rd Qu.:12.000   3rd Qu.:398.5   3rd Qu.:131.0  
 Max.   :29.000   Max.   :509.0   Max.   :191.0  
  ShelveLoc        Age          Education    Urban    
 Bad   : 96   Min.   :25.00   Min.   :10.0   No :118  
 Good  : 85   1st Qu.:39.75   1st Qu.:12.0   Yes:282  
 Medium:219   Median :54.50   Median :14.0            
              Mean   :53.32   Mean   :13.9            
              3rd Qu.:66.00   3rd Qu.:16.0            
              Max.   :80.00   Max.   :18.0            
   US     
 No :142  
 Yes:258  
# Helper function of dataset Carseats
> help("Carseats")

Sales of Child Car Seats

Description

A simulated data set containing sales of child car seats at 400 different stores.

Usage

Carseats

Format

A data frame with 400 observations on the following 11 variables.

Sales Unit sales (in thousands) at each location

CompPrice Price charged by competitor at each location

Income Community income level (in thousands of dollars)

Advertising Local advertising budget for company at each location (in thousands of dollars)

Population Population size in region (in thousands)

Price Price company charges for car seats at each site

ShelveLoc A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site

Age Average age of the local population

Education Education level at each location

Urban A factor with levels No and Yes to indicate whether the store is in an urban or rural location

US A factor with levels No and Yes to indicate whether the store is in the US or not

> # Select numeric columns
> nums <- unlist(lapply(Carseats, is.numeric)) 
> linmod <- lm(Sales ~ ., data = Carseats[,nums])
> summary(linmod)

Call:
lm(formula = Sales ~ ., data = Carseats[, nums])

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0598 -1.3515 -0.1739  1.1331  4.8304 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.7076934  1.1176260   6.896 2.15e-11 ***
CompPrice    0.0939149  0.0078395  11.980  < 2e-16 ***
Income       0.0128717  0.0034757   3.703 0.000243 ***
Advertising  0.1308637  0.0151219   8.654  < 2e-16 ***
Population  -0.0001239  0.0006877  -0.180 0.857092    
Price       -0.0925226  0.0050521 -18.314  < 2e-16 ***
Age         -0.0449743  0.0060083  -7.485 4.75e-13 ***
Education   -0.0399844  0.0371257  -1.077 0.282142    
---
Signif. codes:  0***0.001**0.01*0.05.’ 0.1 ‘ ’ 1

Residual standard error: 1.929 on 392 degrees of freedom
Multiple R-squared:  0.5417,	Adjusted R-squared:  0.5335 
F-statistic: 66.18 on 7 and 392 DF,  p-value: < 2.2e-16

> calc.relimp(linmod, type = c("first", "last", "betasq", "lmg"))
Response variable: Sales 
Total response variance: 7.975626 
Analysis based on 400 observations 

7 Regressors: 
CompPrice Income Advertising Population Price Age Education 
Proportion of variance explained by model: 54.17%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

                    lmg         last       first       betasq
CompPrice   0.085463002 1.677996e-01 0.004106084 2.600424e-01
Income      0.019008134 1.603602e-02 0.023089100 1.627012e-02
Advertising 0.078847156 8.756412e-02 0.072633905 9.496515e-02
Population  0.001720914 3.796537e-05 0.002547320 4.182253e-05
Price       0.293822615 3.921527e-01 0.197981150 6.016889e-01
Age         0.060944719 6.551305e-02 0.053738398 6.655961e-02
Education   0.001854091 1.356228e-03 0.002699347 1.376560e-03

References

  1. Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
  2. Variant inflation factor
  3. Variance of Coefficients in a Simple Linear Regression
  4. For linear classifiers, do larger coefficients imply more important features?
  5. Relative Importance for Linear Regression in R: The Package relaimpo
  6. A REVIEW OF STATISTICAL METHODS FOR DETERMINATION OF RELATIVE IMPORTANCE OF CORRELATED PREDICTORS AND IDENTIFICATION OF DRIVERS OF CONSUMER LIKING