This article will discuss some problems in implementing multivariate linear regression. These problems are:
- What to do with the collinearity phenomenon in linear regression?
- How to extract feature importance of a linear regression, is it just the coefficient of the model?
Collinearity
What is collinearity
Collinearity is a condition in which some of the independent variables are highly correlated. Collinearity will occur more often when the number of predictors is very large, as we can imagine that there is a higher chance that some predictor are correlated with others. However, there are some typical problems with collinearity.
Problems of collinearity
- Large variance of coefficients. The coefficients will be very sensitive to small changes of the data, p-values will be not significant.
- The credibility of the estimated coefficients compromised. The statistical power of the regression model is weakened, p-values that indicates variables that are statistically significant may NOT be trustworthy.
However, there are some aspects which collinearity doesn’t affect, such as the performance of the model. So there is no need to deal with collinearity in some situation.
When to Deal with Collinearity?
- When the collinearity effect is very severe that compromised model’s generalizing ability.
- Multicollinearity affects only the specific independent variables that are correlated. Therefore, it depends on the variables that are of interest. If the important variables are not affected, then there is no need to deal with collinearity.
- As mentioned before, the model’s predictions, precision of the predictions, and the goodness-of-fit statistics won’t be affected. So if those are the concern of the model, then there is no need to reduce even severe multicollinearity.
Variance Inflation Factor (VIF)
In linear models, VIF is a quantity for each predictor variable that quantifies severity of multicollinearity
Definition
In statistics, the variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone.
Put it in simple words, VIF is measuring how well other features can linearly represent feature $k$. It would be clearer in mathematical form.
Mathematical Definition
The VIF for predictor $X_i$ is defined as $$ \text{VIF}i = \frac{1}{1 - R_i^2} $$ where $R_i^2$ is the coefficient of determination of the regression of $X_j$ on the other covariates. The model from which $R_i^2$ is calculated looks like this: $$ X_i = \beta_0 + \beta_1X_1 + \dots + \beta{i-1}X_{i-1} + \beta_{i+1}X_{i+1} + \dots + \beta_mX_m + \epsilon $$
The coefficient of determination of a regression represents how much of the variance in the original data is explained by the model and can be calculated as $$ R^2 = 1 - \frac{\sum_i(y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2} $$
Observation
NOTE: detailed math processed refer to reference 2 and 3.
Consider a linear model with $m$ predictors $$ Y = X\beta + \epsilon $$ Using OLE to estimate coefficients $\beta$, the normal equation is $$ \hat \beta = (X’X)^{-1}X’Y $$ Thus, the covariance matrix of $\hat \beta$ is $$ \begin{align} \Sigma(\hat \beta) &= E[\hat \beta^2] - E[\hat \beta]^2 \\ &= ((X’X)^{-1}X’)^2 E[\epsilon^2] \\ &= \sigma^2(X’X)^{-1} \end{align} $$ with $$\Sigma_{ij} = cov(\hat \beta_i, \hat \beta_j)$$ and $\Sigma_{ii} = var(\hat \beta_i)$.
Thus, $$\hat{var}(\hat \beta_i) = \sigma^2 [(X’X)^{-1}]{ii}$$, by calculating the $ii$ entry in the matrix, we can get $$ \begin{align} \hat{var}(\hat \beta_i) &= \sigma^2[X^T_iX_i - \hat \beta_i^T(X^T{-i}X_{-i})\hat \beta_i]^{-1} \\ &= \sigma^2 \frac{1}{\text{RSS}_i} \\ &= \frac{\sigma^2}{(n-1)\hat{var} (X_i)} \frac{1}{1 - R_i^2} \end{align} $$ The last step uses the fact that $R_i^2 = 1 - \frac{RSS_i}{TSS_i}$ and that$\hat{var}(X_i) = \frac{TSS_i}{n-1}$.
This reflect the fact that the variance of coefficient comes from several parts:
- $\sigma^2$: The noise from the data, the more scatter the data is, the larger the variance;
- $\frac{1}{n-1}$: The size of the data, the smaller the size, the large the variance;
- $\hat{var}(X_i)$: The variance of a predictor will have inversed influence on the variance of the corresponding coefficient, the larger the variance of predictor, the smaller the variance of coefficient;
- VIF: The larger the VIF is, the larger the variance will be.
VIF measures the part of variance generated by having linear correlation with other predictors.
Interpretation
VIF, as mentioned before, has several interpretations
- The quotient of the variance in a model with multiple terms by the variance of a model with one term alone.
- How well other features can linearly represent feature $i$.
- The part of variance of coefficient $\beta_j$ generated by having linear correlation with other predictors.
And the rule of thumb for categorizing VIF is as follows:
- $\text{VIF} = 1$ : Not correlated;
- $1 < \text{VIF} < 5$ : Moderately correlated;
- $\text{VIF} > 5$ : Highly correlated.
How to Deal with Collinearity?
There is no simple and universal way, but there are some directions that we can put effort on:
- Remove some of the highly correlated independent variables.
- Linearly combine the independent variables, such as adding them together.
- Perform an analysis designed for highly correlated variables, trying to combine them, such as principal components analysis (PCA) or partial least squares (PLS) regression.
Feature Importance, Pitfalls and the Right Thinking
How about using coefficient values as the raking of feature importance?
The First Concern
Obviously there is a problem simply using the magnitude of coefficients to rank the feature importance. For instance, consider a linear model: $$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon $$ Consider replacing $X_2$ with $X_2’ = 0.5 X_2$. imaginably, $\beta_2 = 2 \beta_2$ while $\beta_1$ won’t change in this model. In practice, $X_2$ and $X_2’$ could just be the same quantity in different units. That means by scaling variables, we can change the importance of variables at will, which does not make sense.
Standardized Variables? Collinearity!
What about we fix the scale of variables by standardizing the all variables to be of mean 0 and std 1? Well, though the magnitude of coefficients won’t change with the units of covariates, there are still concerns here. Collinearity, as we discussed above, if happens in the model, will cause large variance in the coefficients, thus the feature importance ranking will still be not stable in this scenario. Here we have an simple example.
Experiment of Collinearity in Linear Regression
> # Generte X and Y
> latent_variable = seq(0,10,length.out=200)
> Y <- 2 * latent_variable
> x1 <- latent_variable + 0.5*rnorm(200)
> x2 <- x1 + 0.2*rnorm(200)
> # standardize the data
> x1s = scale(x1)
> x2s = scale(x2)
> mean(x1s)
[1] 1.160194e-16
> var(x1s)
[,1]
[1,] 1
We constructed a latent variable (not visible to the model), with the dependent variable $Y$ to be proportional to it. Then, we constructed two variables $x_1$ and $x_2$, which are latent variable plus some noise, with $x_2$ is a slightly noisier version of $x_1$. And then they were standardized.
From the construction, we expect $x_1$ will be slightly more important than $x_2$, since $x_2$ is noisier (note that the noise additionally added to $x_2$ are independent to $x_1$’s noise). Single variable models are construct the test this.
> # Single variable linear regression
> x1model <- lm(Y ~ x1s)
> summary(x1model)
Call:
lm(formula = Y ~ x1s)
Residuals:
Min 1Q Median 3Q Max
-2.6830 -0.5936 -0.0363 0.5508 3.1349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.00000 0.07028 142.28 <2e-16 ***
x1s 5.73189 0.07046 81.35 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.994 on 198 degrees of freedom
Multiple R-squared: 0.9709, Adjusted R-squared: 0.9708
F-statistic: 6618 on 1 and 198 DF, p-value: < 2.2e-16
>
> x2model <- lm(Y ~ x2s)
> summary(x2model)
Call:
lm(formula = Y ~ x2s)
Residuals:
Min 1Q Median 3Q Max
-3.06413 -0.62072 -0.04847 0.68914 2.80826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.00000 0.07550 132.46 <2e-16 ***
x2s 5.71868 0.07568 75.56 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.068 on 198 degrees of freedom
Multiple R-squared: 0.9665, Adjusted R-squared: 0.9663
F-statistic: 5709 on 1 and 198 DF, p-value: < 2.2e-16
Both models look promising, and we know the coefficient for both two variables would be around $5.7$. But when we add them in the same model.
> # multivariate linear regression
> lr <- lm(Y ~ x1s + x2s)
> summary(lr)
Call:
lm(formula = Y ~ x1s + x2s)
Residuals:
Min 1Q Median 3Q Max
-2.67838 -0.59166 -0.03018 0.55140 3.13235
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.00000 0.07046 141.922 < 2e-16 ***
x1s 5.68375 1.03254 5.505 1.14e-07 ***
x2s 0.04825 1.03254 0.047 0.963
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9965 on 197 degrees of freedom
Multiple R-squared: 0.9709, Adjusted R-squared: 0.9707
F-statistic: 3292 on 2 and 197 DF, p-value: < 2.2e-16
We can see that the model only sees the value in $x_1$ and almost “throwed” $x_2$. The coefficient of $x_2$ is nearly 0 and p-value is not significant. If we look at their VIF values, we can see the severe collinearity effect between them.
# VIF
> library(car)
> car::vif(lr)
x1s x2s
213.6636 213.6636
Note: There will be cases when the most important features also have large variance if there are moderate collinearity effect. It is a question to ask that what kind of feature is considered most important? The most stable one or the most heavily loaded one?
To conclude, when we are using coefficient to rank variable importance. It could make sense, after dealing with scaling and collinearity. But we should be very clear what is the definition of feature importance and what does it mean when in a linear model.
Feature Importance of Linear Regression
Here are some potential feasible ways, not just applicable to linear regression but also other models (black box) too. Demos are using R package relaimpo
, which calculates relative importance for linear regression, with dataset Carseats
from Data for an Introduction to Statistical Learning with Applications in R
in package ISLR
.
Simple relative feature importance
- The metric
first
: $R^2$ value for each variable from $k$ regression models with one predictor only. e.g. $R^2$ of model $Y = \beta_0 + \beta_iX_i + \epsilon$; - The metric
last
: The increase in $$R^2$$ of adding $$X_i$$ assuming that other variables are already in the model. e.g. $$R^2_{total} - R^2_i$$ where $$R^2_{total}$$ from the model $$Y = \beta_0 + \sum_j b_jX_j + \epsilon, j = 1,2,\dots, k$$ and $$R^2_i$$ is from the model that consists all other variables except $$X_i$$; - The metric
betasq
: Standardized coefficient (similar to what we discussed above);
Computer-intensive relative importance metrics
- The metric
lmg
: Averaged sequential $R^2$ with every possible orderings. Sequential $R^2$ for a variable is the increased $R^2$ when adding that variable to the model. However, this value will change when adding variables in different orderings for each variable. So this method is taking average of all possible orderings. - The metric
pmvd
: Improvedlmg
method, weighted average of sequential $R^2$. It assure that a variable with actual coefficient 0 in the model will also be allocated an feature importance of 0.
Demo
> help("Carseats")
> library(ISLR)
> library(relaimpo)
Sales CompPrice Income
Min. : 0.000 Min. : 77 Min. : 21.00
1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75
Median : 7.490 Median :125 Median : 69.00
Mean : 7.496 Mean :125 Mean : 68.66
3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00
Max. :16.270 Max. :175 Max. :120.00
Advertising Population Price
Min. : 0.000 Min. : 10.0 Min. : 24.0
1st Qu.: 0.000 1st Qu.:139.0 1st Qu.:100.0
Median : 5.000 Median :272.0 Median :117.0
Mean : 6.635 Mean :264.8 Mean :115.8
3rd Qu.:12.000 3rd Qu.:398.5 3rd Qu.:131.0
Max. :29.000 Max. :509.0 Max. :191.0
ShelveLoc Age Education Urban
Bad : 96 Min. :25.00 Min. :10.0 No :118
Good : 85 1st Qu.:39.75 1st Qu.:12.0 Yes:282
Medium:219 Median :54.50 Median :14.0
Mean :53.32 Mean :13.9
3rd Qu.:66.00 3rd Qu.:16.0
Max. :80.00 Max. :18.0
US
No :142
Yes:258
# Helper function of dataset Carseats
> help("Carseats")
Sales of Child Car Seats
Description
A simulated data set containing sales of child car seats at 400 different stores.
Usage
Carseats
Format
A data frame with 400 observations on the following 11 variables.
Sales
Unit sales (in thousands) at each location
CompPrice
Price charged by competitor at each location
Income
Community income level (in thousands of dollars)
Advertising
Local advertising budget for company at each location (in thousands of dollars)
Population
Population size in region (in thousands)
Price
Price company charges for car seats at each site
ShelveLoc
A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
Age
Average age of the local population
Education
Education level at each location
Urban
A factor with levels No and Yes to indicate whether the store is in an urban or rural location
US
A factor with levels No and Yes to indicate whether the store is in the US or not
> # Select numeric columns
> nums <- unlist(lapply(Carseats, is.numeric))
> linmod <- lm(Sales ~ ., data = Carseats[,nums])
> summary(linmod)
Call:
lm(formula = Sales ~ ., data = Carseats[, nums])
Residuals:
Min 1Q Median 3Q Max
-5.0598 -1.3515 -0.1739 1.1331 4.8304
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.7076934 1.1176260 6.896 2.15e-11 ***
CompPrice 0.0939149 0.0078395 11.980 < 2e-16 ***
Income 0.0128717 0.0034757 3.703 0.000243 ***
Advertising 0.1308637 0.0151219 8.654 < 2e-16 ***
Population -0.0001239 0.0006877 -0.180 0.857092
Price -0.0925226 0.0050521 -18.314 < 2e-16 ***
Age -0.0449743 0.0060083 -7.485 4.75e-13 ***
Education -0.0399844 0.0371257 -1.077 0.282142
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.929 on 392 degrees of freedom
Multiple R-squared: 0.5417, Adjusted R-squared: 0.5335
F-statistic: 66.18 on 7 and 392 DF, p-value: < 2.2e-16
> calc.relimp(linmod, type = c("first", "last", "betasq", "lmg"))
Response variable: Sales
Total response variance: 7.975626
Analysis based on 400 observations
7 Regressors:
CompPrice Income Advertising Population Price Age Education
Proportion of variance explained by model: 54.17%
Metrics are not normalized (rela=FALSE).
Relative importance metrics:
lmg last first betasq
CompPrice 0.085463002 1.677996e-01 0.004106084 2.600424e-01
Income 0.019008134 1.603602e-02 0.023089100 1.627012e-02
Advertising 0.078847156 8.756412e-02 0.072633905 9.496515e-02
Population 0.001720914 3.796537e-05 0.002547320 4.182253e-05
Price 0.293822615 3.921527e-01 0.197981150 6.016889e-01
Age 0.060944719 6.551305e-02 0.053738398 6.655961e-02
Education 0.001854091 1.356228e-03 0.002699347 1.376560e-03
References
- Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
- Variant inflation factor
- Variance of Coefficients in a Simple Linear Regression
- For linear classifiers, do larger coefficients imply more important features?
- Relative Importance for Linear Regression in R: The Package
relaimpo
- A REVIEW OF STATISTICAL METHODS FOR DETERMINATION OF RELATIVE IMPORTANCE OF CORRELATED PREDICTORS AND IDENTIFICATION OF DRIVERS OF CONSUMER LIKING