Handling imbalanced datasets in machine learning

Click for: original source

An article by Bapriste Rocca about handling imbalanced datasets in machine learning. He searches and answer on question what should and should not be done when facing an imbalanced classes problem.

The acrticle first gives an overview of different evaluation metrics that can help to detect “naive behaviours”. It then discusses a whole bunch of methods consisting in reworking the dataset and shows that these methods can be misleading. Finally, it shows that reworking the problem is, most of the time, the best way to proceed.

For detecting the naive behaviour author suggests to use one of:

  • Confusion matrix, precision, recall and F1
  • Receiver Operating Characteristic (ROC) and Area Under the ROC curve (AUROC)

Whenever using a machine learning algorithm, evaluation metrics for the model have to be chosen cautiously. We must use the metrics that gives us the best overview of how well our model is doing with regards to our goals.

Well balanced article with many charts and supporting mathematical reasoning. Links to further resources also included. Excellent!

[Read More]

Tags big-data big-data data-science miscellaneous machine-learning