Top 6 Engineering Challenges for Machine Learning
It is already 2020, and on top of a pandemic and civil unrest, our Machine Learning community still has to face challenges when working their way through creating models to solve the world's problems.
Here at Amalgam, we're well aware of these challenges that our community faces. We made a list with the top 6 engineering challenges most common in a typical Machine Learning team and gave you some suggestions on how we cope with them.
Machine Learning is hard enough without any of these issues, so let's get to the point here and get these challenges out of our way so we can start dedicating more of our time to work what really matters.
6 - Collaboration
You might be a lone wolf spending days, nights, weekends, and holidays just competing on Kaggle. If that is you, then this might not really be a challenge. But as soon as you add a second engineer to your team, you'll feel us. That's right. For Machine Learning projects, the growing pains are felt on teams with as little as two members.
One of the reasons why collaboration is such a challenge for Machine Learning Engineers is that most of the work is concentrated in a single file (or worse, a single Jupyter Notebook). To complicate things further, the more members you have in a team, the harder it becomes to keep track of all their experiments, respective performance, and datasets in a way that won't cause early onset of hair loss.
We know that pain. Some of us have chosen the lone wolf path because of it, rejecting collaboration and angrily snapping at broken models that are never going to be what they used to because of that senseless co-worker who made a few tweaks to your model.
But don't worry, we have the cure.
The first you'll have to do is ditch Jupyter. Yes, get rid of it. At least for work that does not qualify as strictly prototype with absolutely no expectation of collaboration or versioning. Jupyter is a fantastic tool, and we use it all the time here at Amalgam. But when you get serious about solving a problem, you'll have to use serious tools to get there.
Second, establish sound coding standards and workflow with your team. Now that you've ditched Jupyter and you're allowed to use serious tools, you'll need to get serious about doing it the right way. Adopt git, and have your team follow a workflow such as Gitflow. Resist the urge to commit your dataset to the repository. More on that later.
Third, if Python is your poison, adopt a solution such as Stripping to help you retain some of the benefits from Jupyter notebooks in a plain Python environment. Stripping allows you to separate your code between steps (similar in behavior to Jupyter cells) and cache unchanged code to make your coding iterations go a lot faster by jumping data prep steps and other repetitive overhead activities.
Fourth, again, for Pythonistas, adopt a solution such as Aurum to keep track of all your experiments. More on that later.
Collaboration is a complex topic, and your choice of technology and team preferences will significantly change the dynamics of how these solutions are adopted. Whatever the case may be, don't give up on the goal of using solutions that favor collaboration in your team. It is worth it.
5 - Code Quality
I know I told you to ditch Jupyter for serious work, but I can't help but admit that Jupyter is fun and even addicting. Yes, addicting. If you've tried getting rid of it, you know the withdrawal symptoms all too well. Another serious consequence of getting addicted to Jupyter is that it encourages poor coding practices and lets us get away with it (If we get enough requests, we'll write an article to cover this topic in the future).
And that leniency towards coding standards in our community is one of the first walls most Machine Learning engineers will face when going from solving small Kaggle challenges to other more complex and unexplored problems.
The solution is to realize that before being a Machine Learning Engineer, you must be a Software Engineer. That means that you need to take pride in the quality of your code as much as you take pride in the quality of the models you produce, understanding that both are directly connected. Better coding practices will always yield better machine learning models, whereas poor coding practices will often hinder machine learning models. The sooner you realize and remediate this, the sooner you'll get to the next level.
4 - Latency
Latency is essentially the time that it takes for something to first appear at point B when moving from point A to point B. Often it is assumed that if you have a large enough bandwidth, latency is a non-issue, but such is not the case. Both are equally relevant for Engineering problems, more so in High-Performance Computing.
It is common for unsuspecting engineers to attempt to build a personal Machine Learning rig by spending a lot of money on the top Nvidia GPU but paying little attention to the other hardware that follows it: CPU, Memory, Motherboard, and Storage solutions. If you couple high latency components with your GPU, or worse, pursue poor coding implementations that don't take latency into account, your expensive GPU will only work for a fraction of the time and spend a non-ordinary amount of time idling while data slowly gets in and out of the GPU's memory for processing.
You can reap improvements of one order of magnitude or two by adjusting simple parameters, anticipating data transfer, and acquiring low latency hardware to couple with your GPU.
3 - Dataset versioning
If you know Machine Learning, then you know that data wrangling is one of the most critical and time-consuming pieces of any model you work. And more often than not, you'll be handed datasets that are not clean or in formats that are not conducive to proper model training. For that reason, your workflow will go through cycles between wrangling data and tweaking your model until you get to the results you want.
You may sometimes do transformations to the dataset that yield unsatisfactory results with your model, or after several transformations, you lose track of why they were done in the first place. Couple that with the need to reproduce those transformations reliably when inferencing, and you're going to pay dearly for not keeping track of your dataset over time. But because you don't have the appropriate tools to keep track of the dataset, there isn't much choice for you, especially when you're using Jupyter.
That is until now. By using the combo Stripping and Aurum your problems are over. Stripping will let you break down all transformations into reproducible separate steps, and Aurum will automatically keep track of all versions of the dataset used for any experiment you ran.
Better yet, you can go back to any specific point in time and reuse the exact same dataset with the exact same model, and achieve the exact same results (assuming a deterministic training process). You can even browse your experiments straight from github if your repository lives there, and download any experiment without issuing a single command in the terminal.
2 - Experiment Versioning/Tracking
Loosing track of how a dataset was transformed over time is very frustrating, but loosing experiments from several iterations ago that yielded much better performance than your recent ones and not having the ability to go back to that implementation is absolutely maddening. There are legends of Machine Learning Engineers that never managed to recover their sanity having once solved one of humanity's greatest problems and lost the solution thinking that he change the model for the better to never again achieve the same results.
Lucky for you, we've went through these pains and after gallons of tears shed in the process we learned the lesson. Not only you must keep track of your dataset changes, you must keep track of all your experiments. And if you managed to get all the way to this point in this article then you know what we're going to suggest next: ditch Jupyter and use the combo Stripping and Aurum.
Stripping will help you keep your model code separate from your dataset transformations, and coupled with Aurum, you'll have an automatic versioning of every single experiment you run, with the performance achieved for that specific dataset, and the parameters used to get to those results. Now, you'll be able to focus on the real work, and once you're done you can go back and see every single iteration of your work properly recorded and documented.
This works even better for collaboration. Here is why: you and your team mates can tweak the same file, without impacting each others work, record each respective experiment, and even merge back work from different successful experiments into a single one and all of that using well known commands and procedures from git.
1 - Model Tuning
This is where all the magic happens, right? Our models is nothing but a very complex equation for which we need to find the right weights that transforms this complex equation into one that models our problem as perfectly as possible. The real problem is finding those weights, and in order to get there we need to set the right parameters.
Most of us are led to believe that those parameters are easy to come up with. It is believed that this myth comes from the fact that we learn Machine Learning from tutorials with prepared examples and calibrated parameters, and we're never really exposed to the actual hyperparameter tunning reality until we're thrown real problems for which there are no precedents.
It is at this point that Machine Learning Engineers hit another growth spurt and begins to devise crafty ways to come up with their initial hyperparameters and tune them from there.
Unfortunatelly, the search space for hyperparameters is pretty big in for most problems and in order to overcome this challenge you'll need two strategies:
- Tuning Strategy
- Experiment Parallelization
The first one is to avoid brute force search in a high dimensional search space. If you have a strategy that learns from previous parameters, you'll have a much better shot at finding the parameters than shooting at random.
As for the second one, again - you're working with a high dimensional search space. The more experiments you can run, the faster you get to your answer. For that reason, you can shorten the training time by a factor of how many experiments you can run at the same time.
If you get crafty, you can create a tuning strategy that working within a Jupyter notebook. When it comes to parallelizing the work, however, the challenge gets to another level. If you got to this point and decided to experiment with Stripping and Aurum, you'll see that those two tools actually allow you to achieve both without being crafty and while keeping track of every single experiment and all the performance metrics related to them.
By implementing a central tuning strategy, you can launch tasks in parallel (or even containers, if you have that ability) to execute each experiment and Stripping with shortcut the training process by using cached steps to go straight to the training phase while Aurum automatically keeps track of the experiment and pushes them to your central repository.
At this point you're probably thinking two things: these guys hate Jupyter and this is a shameless plug for their products.
The truth is that we absolutely love Jupyter, but we know that it has its place. And often, it is not in solving real world problems that require collaboration and version control.
As for the shameles plug you're right... and wrong. Stripping and Aurum were made to address these issues and we trully believe that they're going to make you a better Machine Learning Engineer. On top of that, these tools are 100% free and Open Source, so there's that.
Do you think anything different? Let us know in the comments, we're eager to learn what you think and see the world from different perspectives.