In this session Richard will introduce Tubular – a python package for feature engineering, originally developed within LV= which has now been open sourced. We’ll work through building a feature engineering pipeline using tubular to see some of the key transformers that it offers and how it fits into a data scientist’s workflow. No previous experience with feature engineering is required for this session.
Repository
For the event we will be working from this repo; https://github.com/lvgig/tubular
We will put all the material in the https://github.com/lvgig/tubular/tree/main/examples/Data-Science-Festival-Workshop folder and we will be working on this open dataset; https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Python Environment
To set up the python environment on their own machines participants should;
Clone the repository: https://github.com/lvgig/tubular using git: https://git-scm.com/
Get conda: https://docs.conda.io/en/latest/ by downloading and installing either Anaconda: https://www.anaconda.com/products/individual or miniconda : https://docs.conda.io/en/latest/miniconda.html (smaller download)
Create the conda environment using the environment file in the repository: https://github.com/lvgig/tubular/blob/main/examples/Data-Science-Festival-Workshop/env.yml
Instructions for this can be found here but we will also cover this at the start of the session: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file
Alternatively participants can click on the launch binder shield: https://mybinder.readthedocs.io/en/latest/ on the front page of the https://github.com/lvgig/tubular repository to launch a binder session with the required packages installed that they can work in.
Data
The demo notebook in the repository has code to download the dataset we will be using, we will also cover this at the start of the session: https://github.com/lvgig/tubular/blob/main/examples/Data-Science-Festival-Workshop/Demo.ipynb