top of page
  • Writer's pictureHimani Gadve

Feature selection using Python

Updated: Aug 22, 2021

Whenever we work on the datasets, sometimes they are small, but often at times, they are tremendously large in size. It becomes very challenging to process the datasets which are very large, at least significant enough to cause a processing bottleneck.

What makes these datasets this large? Well, it's features. The more the number of features the larger the datasets will be. Well, not always. You will find datasets where the number of features is very much, but they do not contain that many instances. But that is not the point of discussion here. So, you might wonder with a commodity computer in hand how to process these type of datasets without beating the bush.

Often, in a high dimensional dataset, there remain some entirely irrelevant, insignificant and unimportant features. It has been seen that the contribution of these types of features is often less towards predictive modeling as compared to the critical features. They may have zero contribution as well. These features cause a number of problems which in turn prevents the process of efficient predictive modeling -

  • Unnecessary resource allocation for these features.

  • These features act as a noise for which the machine learning model can perform terribly poorly.

  • The machine model takes more time to get trained.

So, what's the solution here? The most economical solution is Feature Selection.

Feature Selection is the process of selecting out the most significant features from a given dataset. In many of the cases, Feature Selection can enhance the performance of a machine learning model as well.

Sounds interesting right?

You got an informal introduction to Feature Selection and its importance in the world of Data Science and Machine Learning. In this post you are going to cover:

  • Introduction to feature selection and understanding its importance

  • Difference between feature selection and dimensionality reduction

  • Different types of feature selection methods

  • Implementation of different feature selection methods with scikit-learn

Introduction to feature selection

Feature selection is also known as Variable selection or Attribute selection.

Essentially, it is the process of selecting the most important/relevant. Features of a dataset.

Understanding the importance of feature selection

The importance of feature selection can best be recognized when you are dealing with a dataset that contains a vast number of features. This type of dataset is often referred to as a high dimensional dataset. Now, with this high dimensionality, comes a lot of problems such as - this high dimensionality will significantly increase the training time of your machine learning model, it can make your model very complicated which in turn may lead to Overfitting.

Often in a high dimensional feature set, there remain several features which are redundant meaning these features are nothing but extensions of the other essential features. These redundant features do not effectively contribute to the model training as well. So, clearly, there is a need to extract the most important and the most relevant features for a dataset in order to get the most effective predictive modeling performance.

"The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."

Let's understand the difference between dimensionality reduction and feature selection.

Sometimes, feature selection is mistaken with dimensionality reduction. But they are different. Feature selection is different from dimensionality reduction. Both methods tend to reduce the number of attributes in the dataset, but a dimensionality reduction method does so by creating new combinations of attributes (sometimes known as feature transformation), whereas feature selection methods include and exclude attributes present in the data without changing them.

Some examples of dimensionality reduction methods are Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc.

Let me summarize the importance of feature selection for you:

  • It enables the machine learning algorithm to train faster.

  • It reduces the complexity of a model and makes it easier to interpret.

  • It improves the accuracy of a model if the right subset is chosen.

  • It reduces Overfitting.

Let me explain general feature selection methods - using Embedded methods

Embedded methods

Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration. Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold.

This is why Regularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients).

Examples of regularization algorithms are the LASSO, Elastic Net, Ridge Regression, etc.

Now, let's see some traps that you may get into while performing feature selection:

For this case study, we will be analyzing crime rate data from different communities. The original data set can be found on the UCI machine learning data repository ( This data set consists of many attributes of different communities, such as household size, percentage of race of different groups, number of police officers, etc. The goal is to use these attributes to predict the total number of non-violate crimes (per 100k population). The original data set consists of 2215 observations, 129 predictors and 18 different response variables. The set has been cleaned up (including removing predictors with missing values and removing the unnecessary response variables. The final data set that you will use (community.csv) consists of 2118 observations, and 101 predictors + 1 response (total number of non-violent crimes).

The code and the step by step approach can be found on my project.

50 views0 comments


bottom of page