Feature Selection Methods for Machine Learning
By Charles Holbert
February 14, 2020
Introduction
Feature Selection is a core concept in machine (statistical) learning that can have significant impacts on model performance. Feature Selection (also referred to as variable selection) is the process by which you select those features which contribute most to your prediction variable or output. The data features that are used to train a model have a large influence on the performance of that model; irrelevant or partially relevant features can negatively impact model performance. As such, feature selection and data cleaning are the first and most important steps of model design.
One of the primary reasons to measure the strength or relevance of the predictor variables is to filter which varaibles should be used as inputs in a model. In many data analyses and modeling projects, hundreds or even thousands of features are collected. From a practical perspective, a model with more features often becomes harder to interpret and is costly to compute. Some models are more resistant to non-informative predictors (e.g., the Lasso and tree-based methods) than others. As the number of attributes becomes large, exploratory analysis of all the predictors may be infeasible. Data driven feature selection can be useful when sifting through large amounts of data. Ranking predictors based on a quantitative relevance score is critical in creating an effective predictive model. Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.
Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method does so by creating new combinations of attributes, whereas feature selection methods include and exclude attributes in the data without changing them. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data (Guyon and Elisseeff 2003).
This factsheet presents various methods that can be used to identify the most important predictor variables that explain the variance of the response variable. In this context, variable importance is taken to mean an overall quantification of the relationship between the predictor and outcome.
Data
To illustrate the concepts of feature selection, we will use the Ozone data from the mlbench package. The Ozone data contains daily ozone levels and meterorologic data in the Los Angeles basin during 1976. It is available as a data frame with 203 observations and 13 variables, where each observation represents one day.
Prior to jumping into the feature selection, let’s explore the data using a combination of summary statistics (means and medians, variances, and counts) and graphical presentations. First, let’s load and inspect the structure of the data.
# Read data
inputData <- read.csv(
"./data/ozone1.csv",
stringsAsFactors = FALSE,
header = TRUE
)
str(inputData)
## 'data.frame': 203 obs. of 13 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day_of_month : int 5 6 7 8 9 12 13 14 15 16 ...
## $ Day_of_week : int 1 2 3 4 5 1 2 3 4 5 ...
## $ ozone_reading : num 5.34 5.77 3.69 3.89 5.76 6.39 4.73 4.35 3.94 7 ...
## $ pressure_height : int 5760 5720 5790 5790 5700 5720 5760 5780 5830 5870 ...
## $ Wind_speed : int 3 4 6 3 3 3 6 6 3 2 ...
## $ Humidity : int 51 69 19 25 73 44 33 19 19 19 ...
## $ Temperature_Sandburg : int 54 35 45 55 41 51 51 54 58 61 ...
## $ Temperature_ElMonte : num 45.3 49.6 46.4 52.7 48 ...
## $ Inversion_base_height: int 1450 1568 2631 554 2083 111 492 5000 1249 5000 ...
## $ Pressure_gradient : int 25 15 -33 -28 23 9 -44 -44 -53 -67 ...
## $ Inversion_temperature: num 57 53.8 54.1 64.8 52.5 ...
## $ Visibility : int 60 60 100 250 120 150 40 200 250 200 ...
Let’s take a closer look at the data by creating summary statistics for each variable. This can be useful for identifying missingdata, if any at present.
summary(inputData)
## Month Day_of_month Day_of_week ozone_reading
## Min. : 1.000 Min. : 1.0 Min. :1.000 Min. : 0.72
## 1st Qu.: 3.000 1st Qu.: 9.0 1st Qu.:2.000 1st Qu.: 4.77
## Median : 6.000 Median :15.0 Median :3.000 Median : 8.90
## Mean : 6.522 Mean :15.7 Mean :3.005 Mean :11.37
## 3rd Qu.:10.000 3rd Qu.:23.0 3rd Qu.:4.000 3rd Qu.:16.07
## Max. :12.000 Max. :31.0 Max. :5.000 Max. :37.98
## pressure_height Wind_speed Humidity Temperature_Sandburg
## Min. :5320 Min. : 0.000 Min. :19.00 Min. :25.00
## 1st Qu.:5690 1st Qu.: 3.000 1st Qu.:46.00 1st Qu.:51.50
## Median :5760 Median : 5.000 Median :64.00 Median :61.00
## Mean :5746 Mean : 4.867 Mean :57.61 Mean :61.11
## 3rd Qu.:5830 3rd Qu.: 6.000 3rd Qu.:73.00 3rd Qu.:71.00
## Max. :5950 Max. :11.000 Max. :93.00 Max. :93.00
## Temperature_ElMonte Inversion_base_height Pressure_gradient
## Min. :27.68 Min. : 111 Min. :-69.00
## 1st Qu.:49.64 1st Qu.: 869 1st Qu.:-14.00
## Median :56.48 Median :2083 Median : 18.00
## Mean :56.54 Mean :2602 Mean : 14.43
## 3rd Qu.:66.20 3rd Qu.:5000 3rd Qu.: 43.00
## Max. :82.58 Max. :5000 Max. :107.00
## Inversion_temperature Visibility
## Min. :27.50 Min. : 0.0
## 1st Qu.:51.26 1st Qu.: 60.0
## Median :60.98 Median :100.0
## Mean :60.69 Mean :122.2
## 3rd Qu.:70.88 3rd Qu.:150.0
## Max. :90.68 Max. :350.0
Graphical displays of the data provide added insight into data that are not possible to visualize and understand by reviewing summary statistics. Although some problems in the data can be identified by reviewing summary statistics, other problems are easier to find visually. Numerical reduction methods do not retain all the information contained in the data.
Let’s use the ggpairs function of the GGally package to build a scatterplot matrix. Scatterplots of each pair of numeric variable are drawn on the left part of the figure, the Spearman correlation coefficient is displayed on the right, and variable distribution is shown on the diagonal.
# Quick visual analysis of data
library(GGally)
ggpairs(
data = inputData,
upper = list(
continuous = wrap("cor",
method = "spearman",
size = 2,
alignPercent = 1)
),
lower = list(continuous = wrap("points", size = 0.1))
) +
theme(strip.text = element_text(size = 2.5))
We can quickly inspect other aspects of the data using the DataExplorer package. The package is flexible with input data classes and can manage most any dataframe-like object.
# Plot histograms
DataExplorer::plot_histogram(inputData)
# Plot density
DataExplorer::plot_density(inputData)
# Box plots
DataExplorer::plot_boxplot(inputData, by = 'ozone_reading', ncol = 4)
Let’s create a correlation heatmap to visually inspect the correlation between the varables. The colors of the heatmap correspond to the nonparametric Spearman’s rank-order correlation, which is used to measure the strength of a monotonic relationship between paired data. Whereas Pearson’s correlation evaluates the linear relationship between two continuous variables, Spearman’s correlation evaluates the monotonic relationship where the variables tend to change together, but not necessarily at a constant rate.
DataExplorer::plot_correlation(
data = inputData,
cor_args = list(method = 'spearman', use = 'complete.obs')
)
Now that we have examined the data, let’s move on to evaluating variable importance.
Variable Importance
Near Zero Variance
Zero variance variables only contain a single unique value and thus, provide no useful information to a model. Additionally, variables that have near-zero variance also offer very little, if any, information to a model. Furthermore, these variables can cause problems during resampling as there is a high probability that a given sample will only contain a single unique value (the dominant value) for that feature. A rule of thumb for detecting near-zero variance features is (Boehmke and Greenwell 2019):
- The fraction of unique values over the sample size is low (say
\(\leq\)
10%). - The ratio of the frequency of the most prevalent value to the frequency of the second most prevalent value is large (say
\(\geq\)
20%).
If both of these criteria are true then it is often advantageous to remove the variable from the model. For the Ozone data, we do not have any zero or near-zero variance variables.
# Find zero and near-zero variance variables using caret
library(caret)
preds <- colnames(inputData[, -grep("ozone_reading", colnames(inputData))])
nearZeroVar(x = inputData[, preds], saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## Month 1.000000 5.911330 FALSE FALSE
## Day_of_month 1.000000 15.270936 FALSE FALSE
## Day_of_week 1.046512 2.463054 FALSE FALSE
## pressure_height 1.000000 23.645320 FALSE FALSE
## Wind_speed 1.257143 5.418719 FALSE FALSE
## Humidity 2.272727 29.556650 FALSE FALSE
## Temperature_Sandburg 1.125000 28.571429 FALSE FALSE
## Temperature_ElMonte 1.000000 67.487685 FALSE FALSE
## Inversion_base_height 15.750000 61.576355 FALSE FALSE
## Pressure_gradient 1.000000 52.216749 FALSE FALSE
## Inversion_temperature 1.000000 71.428571 FALSE FALSE
## Visibility 1.190476 11.330049 FALSE FALSE
Maximal Information Coefficient, Rank Correlation, and pseudo \(R^2\)
The maximal information coefficient (MIC) is a measure of two-variable dependence designed specifically for rapid exploration of many-dimensional data sets (Reshef et al. 2018). MIC is part of a larger family of maximal information-based nonparametric exploration (MINE) statistics, which can be used to identify important relationships in datasets (Reshef et al. 2016). The nonparametric Spearman’s rank correlation coefficient can be used to measure the relationship between each predictor and the outcome. A third metric uses a pseudo \(R^2\)
statistic derived from the residuals of a locally weighted regression model (LOESS) fit to each predictor (Kuhn and Johnson 2016).
# Use apply function to estimate correlations between predictors and rate
corr_values <- apply(inputData[, preds],
MARGIN = 2,
FUN = function(x, y) cor(x, y,
method = 'spearman'),
y = inputData[, "ozone_reading"])
# The caret function filterVarImp with the nonpara = TRUE option (for
# nonparametric regression) creates a LOESS model for each predictor
# and quantifies the relationship with the outcome.
library(caret)
loess_results <- filterVarImp(
x = inputData[, preds],
y = inputData[, "ozone_reading"],
nonpara = TRUE
)
# The minerva package can be used to calculate the MIC statistics
# between the predictors and outcomes. The mine function computes
# several quantities including the MIC value.
library(minerva)
mine_values <- mine(
inputData[, preds],
inputData[, "ozone_reading"]
)
# Several statistics are calculated
names(mine_values)
## [1] "MIC" "MAS" "MEV" "MCN" "MICR2" "GMIC" "TIC"
mic_values <- cbind.data.frame(preds, round(mine_values$MIC, 4))
# Merge results
imout <- cbind.data.frame(
mic_values,
data.frame(corr_values),
loess_results
)
colnames(imout) <- c('Variable', 'MIC', 'Corr', 'R2')
imout[order(-imout$MIC), -1]
## MIC Corr R2
## Temperature_Sandburg 0.5972 0.796290240 0.662406921
## Temperature_ElMonte 0.5600 0.753758794 0.668355181
## Inversion_temperature 0.5234 0.720327814 0.568031585
## Visibility 0.4063 -0.499758768 0.226853234
## Inversion_base_height 0.4061 -0.532412005 0.336852964
## pressure_height 0.3898 0.600408565 0.444451196
## Humidity 0.3447 0.478489867 0.242830771
## Month 0.3315 -0.000260642 0.001951453
## Pressure_gradient 0.2658 0.204629227 0.170865783
## Day_of_month 0.2543 0.099574829 0.006995751
## Day_of_week 0.2173 -0.029226969 0.001406995
## Wind_speed 0.1590 0.094935337 0.006691008
Random Forest
Random forests are a modification of bagged decision trees that build a large collection of de-correlated trees to further improve predictive performance (James et al. 2017). Random forests can be very effective to find a set of predictors that best explains the variance in the response variable. In the permutation-based importance measure, the out-of-bag (OOB) sample is passed down for each tree and the prediction accuracy is recorded. Then the values for each variable (one at a time) are randomly permuted and the accuracy is again computed. The decrease in accuracy as a result of this randomly shuffling of feature values is averaged over all the trees for each predictor. The variables with the largest average decrease in accuracy are considered most important.
library(party)
cf1 <- cforest(
ozone_reading ~ . , data = inputData,
control = cforest_unbiased(mtry = 2, ntree = 50)
)
# get variable importance, based on mean decrease in accuracy
data.frame(varimp = sort(varimp(cf1), decreasing = TRUE))
## varimp
## Temperature_ElMonte 18.4687775
## Inversion_temperature 11.8875902
## Temperature_Sandburg 8.4917985
## pressure_height 6.0274801
## Inversion_base_height 4.2490796
## Humidity 3.9881207
## Visibility 2.6436278
## Pressure_gradient 2.2242490
## Month 1.9831525
## Wind_speed 0.1498457
## Day_of_week -0.1392286
## Day_of_month -0.5737410
# conditional = True, adjusts for correlations between predictors
data.frame(
varimp = sort(varimp(cf1, conditional = TRUE),
decreasing = TRUE)
)
## varimp
## Temperature_ElMonte 4.42813373
## Inversion_temperature 2.49159236
## Temperature_Sandburg 1.94808306
## Inversion_base_height 1.24091707
## pressure_height 1.19544476
## Visibility 0.99478895
## Humidity 0.85732274
## Pressure_gradient 0.64274014
## Month 0.26635417
## Wind_speed -0.02340743
## Day_of_week -0.07643532
## Day_of_month -0.09119428
Relative Importance
Using calc.relimp in the relaimpo package, the relative importance of variables used in a linear regression model can be determined as a relative percentage. This function implements six different metrics for assessing relative importance of regressors in the model. The recommended metric is lmg, which is the \(R^2\)
contribution averaged over orderings among regressors (Lindeman, Merenda and Gold 1980; Chevan and Sutherland 1991).
library(relaimpo)
lmMod <- lm(ozone_reading ~ . , data = inputData)
# calculate relative importance scaled to 100
relImportance <- calc.relimp(lmMod, type = "lmg", rela = TRUE)
data.frame(relImportance = sort(relImportance$lmg, decreasing = TRUE))
## relImportance
## Temperature_ElMonte 0.2179985762
## Temperature_Sandburg 0.2006324013
## Inversion_temperature 0.1621225687
## pressure_height 0.1120464264
## Humidity 0.1011576835
## Inversion_base_height 0.0864158303
## Visibility 0.0614176817
## Pressure_gradient 0.0279614158
## Month 0.0205933505
## Wind_speed 0.0070146997
## Day_of_month 0.0022105122
## Day_of_week 0.0004288536
Multivariate Adaptive Regression Splines
The earth package implements variable importance based on multivariate adaptive regression splines (MARS), which is an algorithm that extends linear models to capture nonlinear relationships (Friedman 1991; Friedman 1993). Variable importance is estimated using three criteria: generalized cross validation (GCV), the number of subset models that include the variable (nsubsets), and residual sum of squares (RSS).
library(earth)
marsModel <- earth(ozone_reading ~ ., data = inputData)
ev <- evimp (marsModel)
plot(ev, cex.axis = 0.8, cex.var = 0.8, cex.legend = 0.7)
Stepwise Regression
Stepwise regression consists of iteratively adding and removing predictors, in the predictive model, in order to find the subset of variables in the dataset resulting in the best performing model (Hastie et al. 2016). The step function in base R uses the Akaike information criterion (AIC) for weighing the choices, which takes proper account of the number of parameters fit; at each step an add or drop will be performed that minimizes the AIC score. The AIC rewards models that achieve a high goodness-of-fit score and penalizes them if they become overly complex. By itself, the AIC score is not of much use unless it is compared with the AIC score of a competing model. The model with the lower AIC score is expected to have a superior balance between its ability to fit without overfitting the data (Hastie et al. 2016).
# First fit the base (intercept only) model
base.mod <- lm(ozone_reading ~ 1 , data = inputData)
# Now run stepwise regression with all predictors using a hybrid
# stepwise-selection strategy that considers both forward and
# backward moves at each step, and select the best of the two.
all.mod <- lm(ozone_reading ~ . , data = inputData)
stepMod <- step(
base.mod, scope = list(lower = base.mod, upper = all.mod),
direction = "both", trace = 0, steps = 1000
)
# Get the shortlisted variables
shortlistedVars <- names(unlist(stepMod[[1]]))
# Remove intercept
shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"]
data.frame(shortlistedVars = shortlistedVars)
## shortlistedVars
## 1 Temperature_Sandburg
## 2 Humidity
## 3 Temperature_ElMonte
## 4 Month
## 5 pressure_height
## 6 Inversion_base_height
Boruta
The Boruta function is a wrapper algorithm, capable of working with any classification method that outputs variable importance measures (Kursa and Rudnicki 2010). By default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attribute importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilise the test.
Boruta follows an all-relevant feature selection method where it captures all features which are in some circumstances relevant to the outcome variable. In contrast, traditional feature selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
library(Boruta)
borout <- Boruta(ozone_reading ~ ., data = na.omit(inputData),
doTrace = 2)
boruta_signif <-
names(borout$finalDecision[borout$finalDecision %in% c("Confirmed", "Tentative")])
data.frame(boruta_signif)
## boruta_signif
## 1 Month
## 2 pressure_height
## 3 Humidity
## 4 Temperature_Sandburg
## 5 Temperature_ElMonte
## 6 Inversion_base_height
## 7 Pressure_gradient
## 8 Inversion_temperature
## 9 Visibility
plot(
borout, cex.axis = 0.7, las = 2,
xlab = "", main = "Variable Importance"
)
Information Value and Weight of Evidence
The InformationValue package provides convenient functions to compute weights of evidence (WOE) and information value (IV) for categorical variables. The WOE and IV approach evolved from logistic regression techniques. WOE provides a method of recoding a categorical x variable to a continuous variable. For each category of a categorical variable, the WOE is calculated as:
$$WOE = ln\left(\frac{percentage\ good\ of\ all\ goods}{percentage\ bad\ of\ all\ bads}\right)$$
IV is a measure of a categorical x variables ability to accurately predict the goods and bads. For each category of x, the IV is computed as:
$$IV_{category} = \left(\frac{percentage\ good\ of\ all\ goods - percentage\ bad\ of\ all\ bads}{WOE}\right)$$
The total IV of a variable is the sum of the IV’s of its categories.
As an example, let’s use the adult data as imported below. The data has 32,561 observations and 14 features. This dataset is often used with machine learning to predict whether a persons income is higher or lower than $50k per year based on their attributes.
# Load data
library(InformationValue)
inputData <- read.csv("./data/adult.csv", header = TRUE)
str(inputData)
## 'data.frame': 32561 obs. of 14 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr " State-gov" " Self-emp-not-inc" " Private" " Private" ...
## $ education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
## $ education.1 : int 13 13 9 7 13 14 5 9 14 13 ...
## $ maritalstatus: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
## $ occupation : chr " Adm-clerical" " Exec-managerial" " Handlers-cleaners" " Handlers-cleaners" ...
## $ relationship : chr " Not-in-family" " Husband" " Not-in-family" " Husband" ...
## $ race : chr " White" " White" " White" " Black" ...
## $ sex : chr " Male" " Male" " Male" " Male" ...
## $ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
## $ nativecountry: chr " United-States" " United-States" " United-States" " United-States" ...
## $ above50k : int 0 0 0 0 0 0 0 1 1 1 ...
We have a number of predictor variables, out of which a few of them are categorical variables. We will analze the data using the WOE and IV functions in the InformationValue package.
factor_vars <- c (
"workclass", "education", "maritalstatus",
"occupation", "relationship", "race",
"sex", "nativecountry"
)
all_iv <- data.frame(
VARS = factor_vars,
IV = numeric(length(factor_vars)),
STRENGTH = character(length(factor_vars)),
stringsAsFactors = F
)
for (factor_var in factor_vars) {
all_iv[all_iv$VARS == factor_var, "IV"] <-
InformationValue::IV(
X = inputData[, factor_var],
Y = inputData$above50k
)
all_iv[all_iv$VARS == factor_var, "STRENGTH"] <-
attr(InformationValue::IV(
X = inputData[, factor_var],
Y = inputData$above50k),
"howgood"
)
}
# Sort
all_iv[order(-all_iv$IV), ]
## VARS IV STRENGTH
## 1 workclass 0 Not Predictive
## 2 education 0 Not Predictive
## 3 maritalstatus 0 Not Predictive
## 4 occupation 0 Not Predictive
## 5 relationship 0 Not Predictive
## 6 race 0 Not Predictive
## 7 sex 0 Not Predictive
## 8 nativecountry 0 Not Predictive
Conclusion
Each of these techniques evaluates each predictor without considering the others. Although these methods are common for sifting through large amounts of data to rank predictors, there are a couple of caveats. First, if two predictors are highly correlated with the response and with each other, then the univariate approach will identify both as important. If the desire is to develop a predictive model, highly correlated predictors can negatively impact some models. Second, the univariate importance approach will fail to identify groups of predictors that together have a strong relationship with the response. For example, two predictors may not be highly correlated with the response; however, their interaction may be. Univariate correlations will not capture this predictive relationship.
References
Boehmke, B. and B.M. Greenwell. Hands-On Machine Learbing with R. CRC Press, Boca Raton.
Chevan, A. and M. Sutherland. 1991. Hierarchical partitioning. The American Statistician, 45, 90-96.
Friedman, J.H. 1991. Multivariate adaptive regression splines. Ann Appl. Stat, 19, 1-141.
Friedman, J.H. 1993. Fast MARS. Stanford University Department of Statistics, Technical Report 110, May.
Guyon, I. and A. Elisseeff. 2003. An introduction to variable and feature selection. Journal of Machine Learning Reserach, 3, 1157-1182.
Hastie, T., R. Tibshirani, and J. Friedman. 2016. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer, New York.
James, G., D. Witten, T. Hastie, and R Tibshirani. An Introudction to Statistical Learning. Springer, New York.
Kuhn, M. and K. Johnson. 2016. Applied Predictive Modeling. Springer, New York.
Kursa, M.B. and W.R. Rudnicki. 2010. Feature selection with the Boruta package. Journal of Statistical Software, 36, 1-13.
Lindeman, R.H., P.F. Merenda, and R.Z. Gold. 1980. Introduction to Bivariate and Multivariate Analysis, Glenview IL: Scott, Foresman.
Reshef, Y.A., D. N. Reshef, H. K. Finucane, P. C. Sabeti, and M. Mitzenmacher. 2016. Measuring Dependence Powerfully and Equitably. Journal of Machine Learning Research, 17, 1-63.
Reshef, D.N., Y. A. Reshef, P. C. Sabeti, and M. Mitzenmacher. 2018. An empirical study of the maximal and total information coefficients and leading measures of dependence. Ann Appl. Stat, 12, 123-155.
- Posted on:
- February 14, 2020
- Length:
- 17 minute read, 3586 words
- See Also: