Machine Learning Reproducibility: A Kaggle Competition Use-Case
Even though Reproducibility in Machine Learning is a theme that people hear about now and then, we still see that people are practicing it only to a certain degree. Even between Kaggle competition winners, we still see a lot of hard-to-reproduce code in Notebooks. Our goal here is to outline some reproducibility elements and how we tackled them in a recent competition.
First, what reproducibility stands for in Machine Learning? During a Machine Learning project, we have to deal with a couple of things. Different from Software Engineering, the code is not our only final artifact. We have to deal with a dataset and the transformations applied to it, and we do a ton of experiments before getting our final model. After that, our model is composed of its code and weights (what the model learned). And it also depends on the data transformations applied to work correctly. We need reproducibility in all of these steps. Otherwise, it is much harder to get the same results and put our model into production.
In this blog post, we'll go over each of the parts you might need to change to improve your projects' reproducibility. We'll also give some examples from our participation in the MoA Challenge (a recent Kaggle Competition) in the end.
The most basic and first thing that you are probably already doing is seeding everything. When training a Machine Learning model, there many sources of stochastic in play. We split the dataset randomly, initialize the weights randomly, present the batches in a random order, etc. For that reason, to make the experiment reproducible, we need to set the seed for every operation. In general, you should be seeding the NumPy, PyTorch, TensorFlow, and so on.
When dealing with a dataset, we will commonly transform it in many ways, including cleaning, selecting features, engineering new features, and so on. We have to create code for all these tasks, and we need to keep track of such code. After all, once your model goes into production, you need to reproduce the same steps to transform the input data, or your model output won't make any sense.
We commonly see Data Scientists creating a Jupyter Notebook to preprocess their data, running it once, and then working with the output. Sometimes, they would make a single Notebook containing the code to transform the data and train the model. Even though their work might be reproducible in a way (you can rerun the Notebook and get the same output), it is far from ideal. When experimenting with different data preprocessing, we change the code a lot. What if you achieved the best model with a previous version of the preprocessing pipeline? You might have lost that code if you don't keep track of it. And keep tracking of Notebook changes is painful.
Another problem with Notebooks arises when you want to reproduce your pipeline in production. People don't usually organize their code very well in Notebooks. They don't create reusable functions or classes. Instead, you get many cells transforming the data, and now you need to reproduce it precisely in production. It makes it impossible to produce the same steps without a lot of costly refactoring.
So, to increase reproducibility, you are highly encouraged to (i) put your data transformation code inside functions and (ii) keep track of your code changes using Git. You can still use Notebook for your experiments (even though I don't advise you to do that), but your Notebooks will at least be importing modules and reusing code. To make (i) even better, you should parameterize the data transformation and make it pickable. By parameterizing it, you can run different data transformations during the experiments without changing code. And by making it pickable, you can export your data transformation in the end together with your model, making it much easier to put it into production. The very well known scikit-learn already provides an API to do that: Pipeline.
When training a model, we do a lot of experiments. During this process, we might change not only hyperparameters but a lot of code changes. It might be tempting to do all of that in a Notebook (like many people do), but again it hurts the reproducibility. When you are exploring new settings, you often discover that a previous experiment was your best one. To come back to it, you need the hyperparameters used and the exact version of the code. To have that, you need to keep track of both of them, and you can't do it with Notebooks.
The MoA Challenge finished recently and was three months-long competition. It was a competition that required a lot of data processing and many experiments. And since it was a Code Competition, people couldn't generate the submission file on their machine. Instead, they needed to submit the code that, without internet access, can be run on the test dataset to generate the submission. It is not nearly as difficult as putting a model into production, but it increases the reproducibility requirement compared to regular competitions.
Still, at least on public Notebooks, we have seen the standard spaghetti code processing the data on-the-fly and training a model. Since the Kaggle allows you to submit such Notebook, which will train and generate the submission, it works. But in most competitions, the winners use ensembles of many different models. Each of these models might require completely different data preprocessing, and when you need to run all of them together, it gets harder. Because of that, participants had to package their models and data processing steps to run on the Kaggle server if they wanted to compete seriously.
Beyond that, As the experiments evolve, we keep changing the code of both the model and the data processing. Each experiment will have different hyperparameters too. With such a long competition, having some way to keep track of the experiments is very important.
We have used two of our open source projects, Pipeline API from scikit-learn and Software Engineering good practices to tackle all of these problems. First, we have organized a Python project, splitting the code between several modules. To manage our experiments, we separated them into modular steps and used our Stripping library to organize the pipeline. We had a parameterized step to process the data to decide which preprocessing should be applied (and with what parameters). The result is a scikit-learn Pipeline, which is pickable and can smoothly run on the Kaggle server.
For last, to keep track of the experiments, we used another library of ours: Aurum. This library uses Git as its base to keep track of everything: dataset version, code, hyperparameters, and metrics. This way, we can quickly reproduce any previous experiment with precision.
Even though we are giving an example of a Kaggle competition here, it is worth noting that these good practices made our code much closer to production. If we wanted to put such a model into production, we wouldn't need to do any refactoring of the code, and the whole process with be smooth.