How to perform feature selection (i.e. pick important variables) using Boruta Package in R ? (2024)

Introduction

Variable selection is an important aspect of model building which every analyst must learn. After all, it helps in building predictive models free from correlated variables, biases and unwanted noise.

A lot of novice analysts assume that keeping all (or more) variables will result in the best model as you are not losing any information. Sadly, that is not true!

How many times has it happened that removing a variable from the model has increased your model accuracy ?

At least, it has happened tome. Such variables are often found to be correlated and hinder achieving higher model accuracy. Today,we’ll learn one of the ways of how to get rid of such variables in R. I must say, R has an incredible CRAN repository. Out of all packages, one such availablepackage for variable selection is Boruta Package.

In this article,we’ll focus on understanding the theory and practical aspects of using Boruta Package. I’ve followed a step wise approach to help you understand better.

I’ve also drawn a comparison of boruta with other traditional feature selection algorithms. Using this, you can arrive at a more meaningful set of features which can pave the way for a robust prediction model. The terms“features”, “variables” and “attributes” have been used interchangeably, so don’t get confused!

What is Boruta algorithm and why such a strange name ?

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. Thispackage derive its name from ademon in Slavic mythology who dwelled in pine forests.

We know that feature selection is a crucial step in predictive modeling. This technique achieves supreme importance when adata set comprised ofseveral variables is given for model building.

Boruta can be youralgorithm of choice to deal with such data sets. Particularly when one is interested in understanding the mechanisms related to the variable of interest, rather than just building a black box predictive model with good prediction accuracy.

How does it work?

Belowis the step wise working of boruta algorithm:

Firstly,itadds randomness to the given data set by creating shuffled copies of all features (which are called shadow features).
Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.
Finally, the algorithm stops either when all features getsconfirmed or rejected or it reaches a specified limit of random forest runs.

What makes it different from traditional feature selection algorithms?

Boruta follows an all-relevant feature selection method where it captures all features which are in some circ*mstances relevant to the outcome variable. In contrast, most of the traditional feature selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.

While fitting a random forest model onadata set, you can recursively get rid of features in each iteration which didn’t perform well in the process. This will eventually lead to a minimal optimal subset of features as the method minimizes the error of random forest model. This happens by selecting an over-pruned version of the input data set, which in turn, throws away some relevant features.

On the other hand, boruta find all features which are either strongly or weakly relevant to the decision variable. This makes it well suited for biomedical applications where one might be interested to determine which human genes (features) are connected in some way to a particular medical condition (target variable).

Boruta in Action in R (Practical)

Till here, we have understood the theoretical aspects ofBoruta Package. But, that isn’t enough. The real challenge starts now. Let’s learn to implement this package in R.

First things first. Let’s install and call this package for use.

> install.packages("Boruta")
> library(Boruta)

Now, we’ll load the data set. For this tutorial I’ve taken thedata set from Practice Problem Loan Prediction

> setwd("../Data/Loan_Prediction")
> traindata <- read.csv("train.csv", header = T, stringsAsFactors = F)

Let’s have alook at the data.

> str(traindata)
> names(traindata) <- gsub("_", "", names(traindata))

gsub() function is used to replace an expression with other one. In this case, I’ve replaced the underscore(_) with blank(“”).

Let’s check if this data set has missing values.

> summary(traindata)

We find that many variables have missing values. It’s important to treat missing values prior to implementing boruta package. Moreover, this data set also has blank values. Let’s clean this data set.

Boruta vs Traditional Feature Selection Algorithm

Till here, we have learnt about the concept and steps to implement boruta package in R.

What if we used a traditional feature selection algorithm such as recursive feature elimination on the same data set. Do we end up with the same set of important features? Let us find out.

Now, we’ll learn the steps used to implement recursive feature elimination (RFE). In R, RFE algorithm can be implemented using caret package.

Let’s start by defining a control function to be used with RFE algorithm. We’ll load the required libraries:

> library(caret)
> library(randomForest)
> set.seed(123)
> control <- rfeControl(functions=rfFuncs, method="cv", number=10)

Here we have specified a random forest selection function through rfFuncs option (which is also the underlying algorithm in Boruta)

Let’s implement the RFE algorithm now.

> rfe.train <- rfe(traindata[,2:12], traindata[,13], sizes=1:12, rfeControl=control)

I’m sure this is self explanatory. traindata[,2:12] refers to selecting all independent variablesexcept the ID variable. traindata[,13] selects only the dependent variable. It might take some time to run.

We can also check the outcome of this algorithm.

> rfe.train

Recursive feature selection
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:

Variables Accuracy Kappa AccuracySD KappaSD Selected
1 0.8083 0.4702 0.03810 0.1157 *
2 0.8041 0.4612 0.03575 0.1099
3 0.8021 0.4569 0.04201 0.1240
4 0.7896 0.4378 0.03991 0.1249
5 0.7978 0.4577 0.04557 0.1348
6 0.7957 0.4471 0.04422 0.1315
7 0.8061 0.4754 0.04230 0.1297
8 0.8083 0.4767 0.04055 0.1203
9 0.7897 0.4362 0.05044 0.1464
10 0.7918 0.4453 0.05549 0.1564
11 0.8041 0.4751 0.04419 0.1336

The top 1 variables (out of 1):
CreditHistory

This algorithm gives highest weightage to Credit History. Now, we’ll plot the result of RFE algorithm andobtain a variable importance chart.

> plot(rfe.train, type=c("g", "o"), cex = 1.0, col = 1:11)

Let’s extract the chosen features. I am confident it would result in Credit History.

>predictors(rfe.train)
[1] "CreditHistory"

Hence, we see that recursive feature elimination algorithm has selected “CreditHistory” as the only important feature among the 11 features in the dataset.

As compared to this traditional feature selection algorithm, boruta returned a much better result of variable importance which was easy to interpret as well ! I find it awesome to work on R where one has access to so many amazing packages.I’m sure there would be many other packages for feature selection. I’d love to read about them.

End notes

Boruta is an easy to use packageasthere aren’t many parameters to tune / remember. You shouldn’tuse a data set with missing values to check important variables using Boruta. It’ll blatantly throw errors. You can use this algorithm on any classification / regression problem in hand to come up with a subset of meaningful features.

In this article, I’ve used a quick method to impute missing value because the scope of this article was to understand boruta (theory & practical). I’d suggest you to use advanced methods of missing value imputation. After all, information available in data is all we look for ! Keep going.

Did you like reading this article ? What other methods of variable selection do you use? Do share your suggestions / opinions in the comments section below.

About the Author

Debarati Dutta is MA Econometrics graduate from University of Madras. She has more than 3 years of experience in data analytics and predictive modeling across multiple domains. She has worked in companies such as Amazon, Antuit, Netlink. Currently, she’s based out ofMontreal, Canada.

Debarati is the first winner of Blogathon. She won amazon voucher worth INR 5000.

You cantest your skills and knowledge.Check out LiveCompetitionsand compete with bestData Scientists from all over the world.

boruta algorithmboruta packagecaret packagemachine learningMissing Valuesmissing values imputation in Rrandom forestrecursive feature selection

guest_blog26 Aug, 2021

AlgorithmBankingBeginnerLibrariesMachine Learning