Description
The curse of imbalanced data sets
The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples.
In this talk, we will present the 'imbalanced-learn' package which implement some of the state-of-the-art algorithms, tackling the class imbalance problem.
A scikit-learn-contrib project
scikit-learn includes a tremendous set of pre-processing methods (i.e. transformers, standardizers, etc.) to optimally train machine learning algorithms. However, there is currently no estimators to reduce or generate samples. Therefore, the imbalanced-learn provides a new type of estimator, named sampler, aiming at resampling a data set whenever it is desired. The samplers are fully compatible with the current scikit-learn API and are composed of the following main methods inspired from scikit-learn: (i) fit, (ii) sample, and (iii) fit_sample. Additionally, a class Pipeline is inherited from scikit-learn, permitting to incorporate samplers in the usual classification pipeline. During the talk, we will also present the key parameters, shared by all the samplers.
A data science perspective
Regarding the data science aspect of this talk, we will highlight the distinctive characteristics of the different algorithms: (i) over-sampling, (ii) controlled under-sampling, (iii) cleaning under-sampling, (iv) combination of over-sampling and cleaning under-sampling, and (v) ensemble sampler.
Concrete examples
In addition, we will briefly present a couple of examples in which the package has been used on real-world data sets.
Perspectives
Our package is still under heavy development and we are aiming at improving the following points:
- Speed optimization through benchmarking and profiling.
- Quantitative classification performance benchmarking.
- Additional algorithms (categorical features, ...)