Much of the literature on machine learning with imbalanced data ignores the elephant in the room – the need for high quality labelled training and test data. But in the real world, labels can be generated by processes that introduce selection bias. How do you train a model when the labels have been generated by a biased process and also contain a significant amount of noise? In this situation, how do you make sure that your evaluation procedure can give you a reliable estimate of the performance of your model in production so you can make the right decisions about when its ready to deploy?

In this session, we will discuss the issues we encountered when training a machine learning classifier for phishing email detection and how we overcame them. We will explore how the method used to source labelled examples to train the classifier affected our evaluation procedure, and practical challenges we encountered when evaluating the model on highly imbalanced data.

From our real world example, you will learn how to practically deal with this situation in your own projects, and why its important to consider more than just precision/recall when working with highly imbalanced data.

Technical Level: Technical practitioner