A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning
Last update: Jul 1, 2022
Comments: 16
imbalanced-learn
imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.
Documentation
Installation documentation, API documentation, and examples can be found on the documentation.
Installation
Dependencies
imbalanced-learn is tested to work under Python 3.6+. The dependency requirements are based on the last scikit-learn release:
scipy(>=0.19.1)
numpy(>=1.13.3)
scikit-learn(>=0.23)
joblib(>=0.11)
keras 2 (optional)
tensorflow (optional)
Additionally, to run the examples, you need matplotlib(>=2.0.0) and pandas(>=0.22).
Installation
From PyPi or conda-forge repositories
imbalanced-learn is currently available on the PyPi's repositories and you can install it via pip:
pip install -U imbalanced-learn
The package is release also in Anaconda Cloud platform:
conda install -c conda-forge imbalanced-learn
From source available on GitHub
If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:
git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git
cd imbalanced-learn
pip install .
Be aware that you can install in developer mode with:
pip install --no-build-isolation --editable .
If you wish to make pull-requests on GitHub, we advise you to install pre-commit:
pip install pre-commit
pre-commit install
Testing
After installation, you can use pytest to run the test suite:
make coverage
Development
The development of this scikit-learn-contrib is in line with the one of the scikit-learn community. Therefore, you can refer to their Development Guide.
About
If you use imbalanced-learn in a scientific publication, we would appreciate citations to the following paper:
@article{JMLR:v18:16-365,
author = {Guillaume Lema{{\^i}}tre and Fernando Nogueira and Christos K. Aridas},
title = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},
journal = {Journal of Machine Learning Research},
year = {2017},
volume = {18},
number = {17},
pages = {1-5},
url = {http://jmlr.org/papers/v18/16-365}
}
Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.
One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.
Re-sampling techniques are divided in two categories:
Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.
Below is a list of the methods currently implemented in this module.
: I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003.
: M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997.
: J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001.
: D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972.
: M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014.
[8]
(1, 2, 3) : N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
: H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005.
: H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009.
: G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004.
: G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003.
: X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009.
[14]
(1, 2) : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976.
: H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008.
: Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. "RUSBoost: A hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
I have a dataset which has around 150.000 entries. Exploring SMOTHE sampling seems to be pretty slow as only a single core is used to perform calculations.
Am I missing a configuration property? How else could I improve the speed of SMOTHE?
Issues using SMOTE
Hi
First of all thank you for providing us with the nice library
I have a imbalanced dataset and I've loaded the dataset using pandas.
When I'm supplying the dataset as input to the SMOTE I'm getting the following error:
[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update
I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.
The density estimation function has been changed slightly from the reference paper, as the power term yielded very large numbers. This caused the weighting to favour a single cluster.
[MRG] Address issue #113 - Create toy example for testing
Address issue #113
Over-sampling
[x] ADASYN
[x] SMOTE
[x] ROS
Under-sampling
[x] CC
[x] CNN
[x] ENN
[x] RENN => PR #135 needs to be merged before writing this code
[x] AllKNN => PR #136 needs to be merged before writing this code
[x] IHT
[x] NearMiss
[x] OSS
[x] RUS
[x] Tomek
Combine
[x] SMOTE ENN
[x] SMOTE Tomek
Ensemble
[x] Easy Ensemble => PR #117 needs to be merged before writing this code
[x] Balance Cascade
[MRG+1] Rename all occurrences of size_ngh to n_neighbors for consistency with scikit-learn
For consistency reasons I think that we should follow scikit-learn conventions in naming the parameters.
I propose to change the size_ngh parameter to n_neighbors. Unfortunately, this change will have impact in the public API. It is an early modification but it will break users code. I don't know if we could merge this change without a deprecation warning.
What does this implement/fix? Explain your changes.
Integrating black into the codebase, to keep the code format consistent.
[x] Integrate black
[x] Run black over all files
[x] Add black into precommit hook
Any other comments?
Open questions -
Which requirements file should the black dependency be added to?
line-length for black is currently set as 79. Is that alright?
conda install version 0.3.0
I used
conda install -c glemaitre imbalanced-learn
to install Imbalanced-learn. Instead of getting version 0.3.0, I have the older version
#
imbalanced-learn 0.2.1 py27_0 glemaitre
How do I install version 0.3.0 via conda install?
ValueError: could not convert string to float: 'aaa'
I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?
`ratio` should allow to specify which class to target when resampling
TomekLinks and EditedNearestNeighbours only remove samples form the majority class. However both methods are often used rather for data cleaning (removing samples form both classes) but undersampling (only removing samples form the majority class). Thus SMOTETomek and SMOTEENN are not implemented as proposed by Batista, Prati and Monard (2004), because they use TomekLinks and ENN for removing samples from the majority and the minority class.
It would be great to have a parameter that lets you choose whether to remove samples from both classes or only from the majority class.
EHN: implementation of SMOTE-NC for continuous and categorical mixed types
Reference Issue
#401
What does this implement/fix? Explain your changes.
Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer
Any other comments?
Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.
Points to pay attention to:
working with sparse matrices
2 FIXME points in code
'fit' method expects 'feature_indices' keyword argument and issues a warning if it is not provided falling back to normal SMOTE. Raising an error would probably be better but this would break common estimator tests from sklearn (via imblearn/tests/test_common)
Question: Generation of synthetic samples with SMOTE
Hi,
I have a question regarding the generation of synthetic samples via SMOTE.
The comments in the source code state, that a new sample is generated in the following manner:
s_{s} = s_{i} + u(0, 1) * (s_{i} - s_{nn})
After testing it myself, I come to the conclusion that the current implementation uses the same random number for each attribute.
The code I used for testing:
If there aren't any mistakes in my example, I think the implementation is contradictory to the example shown in the SMOTE anniversary paper on page 6 of the pdf / 868 of the paper.
Can anyone clarify why the implementation uses the same random number for every attribute instead of different random numbers?
Thanks in advance!
[MRG] FIX Make pipeline.fit_transform behaves the same as fit().transform()
Reference Issue
Fixes #904
What does this implement/fix? Explain your changes.
Change pipeline.fit_transform to fit final estimator with transformed data, then use fitted estimator to transform original data to skip samplers in the pipeline.
Add test to test whether samplers in the pipeline are skipped during transform.
Any other comments?
I think fit().transform() should behave the same as fit_transform().
I modified the behavior of pipeline.fit_transform, so it actually not using fit_transform from the final estimator. It will use fit() from final estimator and pipeline.transform(). This might need to update the documentation.
[BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes
Describe the bug
The estimator_ object fit by CondensedNearestNeighbour() (and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.
Steps/Code to Reproduce
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour
n_clusters = 10
X, y = make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)
n_neighbors = 1
condenser = CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
X_cond, y_cond = condenser.fit_resample(X, y)
print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!
print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))
# I think the estimator we want should look like this
knn_cond_manual = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
print('knn_cond_manual.classes_', knn_cond_manual.classes_) # yes 10 classes!
print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!
knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
Manual KNN on condensted data accuracy 0.996
The issue
The issue that we set estimator_ in each run of the loop in _fit_resample e.g. this line. We should really set estimator_ after the loop ends on the condensed datasets.
This looks like it's also an issue with OneSidedSelection and possibly other samplers.
Fix
I think we should just add the following to directly before the return statement in fit_resample
What does this implement/fix? Explain your changes.
Some source code links are wrong in the API reference. For example make_imbalance, fetch_datasets, classification_report_imbalanced and many more. The common thing between those objects is they wrapped by a decorator.
Similar problem occurred in scikit-learn in the past. This PR fixed it.
Any other comments?
As stated in the PR, I also dropped removed Python 2 related lines because Python 2 support is dropped in 0.5.0.
Also, source code links now points to the decorator.
[MRG] Fix SmoteNC zero variance resampling
Reference Issue
Fixes #837
What does this implement/fix? Explain your changes.
Fixes the issue as described by @glemaitre here: https://github.com/scikit-learn-contrib/imbalanced-learn/issues/837#issuecomment-1013928249
Any other comments?
Had to add ytype to base class generate_samples api in order to know which class we're resampling so we can use the right subset of _X_categorical_minority_encoded
Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.
PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
This library was developed in order to facilitate rapid prototyping in Python of predictive machine-learning models using longitudinal medical data from an OMOP CDM-standard database.