Streamlining Feature Engineering Pipelines with Feature-Engine • Data Science Festival

Machine learning models output predictions based of patterns learned from data. Before we can use the data to train a machine learning algorithm, we perform extensive transformations of the variables, which are commonly referred to as feature engineering. Feature engineering includes procedures to impute missing data, encode categorical variables, transform or discretise numerical variables, put features in the same scale, combine features into new variables, extract information from dates, transaction data, time series, text and sometimes even images. To use our models in production, we need to deploy both the machine learning models and the entire pipeline of data transformations and feature creation. We must also ensure that the deployed model is identical to the model developed in the research environment. Kludging together an ad-hoc process for feature engineering is not efficient, debug friendly or reproducible. Using well established open source projects removes the task of coding from our hands, improving team performance, while supporting reproducibility, thus reducing model research and deployment timelines. Feature-engine is an open source Python library for feature engineering which smooths building and deployment of feature engineering pipelines. Feature-engine supports multiple data transformation techniques, preserves fit() and transform() functionality, and can be used within a Scikit-learn pipeline, therefore, allowing organisations to build and deploy an entire machine learning pipeline by saving one object (.pkl). In this talk, I will give a high level overview of the main data transformations that we use in the industry, bring forward the challenges encountered while deploying machine learning pipelines, and highlight how Feature-engine can mitigate some of these challenges.

Supported by