Correlation for feature selection
One of the initial tasks in the creation of an ML model is to figure out what are the most important features in the feature set. This is useful in reducing the number of features in data-sets having thousands of features. In case of data-sets with less than hundred features, feature selection is helpful in reducing the final model size which is helpful if the model is to be used in real-time scenario in memory.
There are many approaches to feature selection and some of them pretty involved like using information gain. A simpler approach is to use correlation where good feature subsets contain features highly correlated to the classification, yet uncorrelated to each other.
One of the statistic used to measure correlation is the Pearson correlation coefficient. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.
There are many approaches to feature selection and some of them pretty involved like using information gain. A simpler approach is to use correlation where good feature subsets contain features highly correlated to the classification, yet uncorrelated to each other.
One of the statistic used to measure correlation is the Pearson correlation coefficient. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations.
Comments
Post a Comment