Correlation for feature selection

One of the initial tasks in the creation of an ML model is to figure out what are the most important features in the feature set. This is useful in reducing the number of features in data-sets having thousands of features. In case of data-sets with less than hundred features, feature selection is helpful in reducing the final model size which is helpful if the model is to be used in real-time scenario in memory.

There are many approaches to feature selection and some of them pretty involved like using information gain. A simpler approach is to use correlation where good feature subsets contain features highly correlated to the classification, yet uncorrelated to each other.

One of the statistic used to measure correlation is the Pearson correlation coefficient. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.

Comments

Popular posts from this blog

Performance improvement of MySQL inserts in Python using batching

Connect to MySQL 5.7 from Python using SSL

Connect to MySQL Server 5.7 from PHP 7.0 using SSL