10 “Do It Yourself” Tips for Machine Learning Experiment Tracking and Reproducibility

9 min readNov 21, 2021

This post was also published in the IBM Developer Blog.

As machine-learning practitioners, we invest significant time and effort to improve our models. You usually do it iteratively and experimentally, by repeatedly changing your model, running an experiment, and examining the results, deciding whether the recent model change was positive and should be kept or discarded.

Changes in each iteration may involve, for example, changing a value for a hyper-parameter, adding a new input feature, changing the underlying ML model (e.g., using Gradient Boosting Classifier instead of Random Forest classifier), trying a new heuristic, or trying an entirely new approach.

Experimentation cycles can cause a great deal of confusion. It is easy to get lost, forgetting what changes you made in the recent experiments and whether the latest results are indeed better than before. A single experiment can take hours or even longer to complete. So, you try to optimize your time and execute multiple experiments simultaneously. This makes it even less manageable and the confusion gets even worse.

In this article, I will share lessons and good practices I learned in my recent machine learning projects. Although I call it a “Do it yourself” approach, some may call it “The caveman way”. I am fully aware that nowadays there are many experiment tracking and management platforms, but it is not always possible or convenient to use them. Some platforms require that you execute your experiments on their platform. Sometimes you can’t share any sensitive information outside of your organization — not just the datasets but also results and code. Many platforms require a paid subscription, which can also be a problem in some cases. Sometimes you just want full control of your experiment management approach and data.

The practices described below are easy to implement and do not require additional tooling. They are mostly suitable for small to medium ML projects with a single researcher or a small team. Most of the artifacts are saved locally and adaptations may be required if you want to use shared storage. As a seasoned developer of production systems, I’m aware that a few of the tips below might be considered ‘code-smells’ or bad practices when it comes to the traditional development approach of such systems. However, I believe they have their place and are justified for short-term research projects. I would like to emphasize that the tips below reflect my personal journey and point of view, and not necessarily any official views or practices. So, here I am, waiting for your stones :-)

Tracking what you did

1. Use source control

It goes without saying that your experimentation code should be source-controlled. That said, when using modern interactive environments like Jupyter Notebooks, it is easy to be tempted to make quick experiments on the fly without committing changes to GIT or any other source control system. Try to avoid that as much as possible. Maybe it is only me, but I prefer using a decent IDE and plain Python scripts to run experiments. I may use a notebook for the initial data exploration, but soon after an initial model skeleton is ready, I switch to a full-fledged Python script, which also allows debugging, refactoring, etc.

2. Use identifiable experiments

But you know what? Source control isn’t enough. Even if everything is source-controlled, it can be tedious to browse the repository’s history and understand what source was used for running some experiment 12 days ago. I would like to suggest an additional practice that I call “Copy on Write”. Duplicate your latest experiment script file/folder before each new experiment and make the changes on the new file. Make your experiments identifiable by adding a sequential number to each experiment in the source file name. For example, animal_classifier_009.py for experiment #9. And yes, this works also for notebooks: you can create a notebook per experiment. This means you need only a file diff to understand what changed between experiment #9 and #12. Storage is cheap and the size of all of your experiments’ source code is probably dwarfed by the size of your data.

3. Automatic source code snapshots

Another tip is to automatically take a snapshot of your experiment code for each run. You can do this easily inside the experiment script itself, by bootstrapping code that copies the source file/folder to a directory with the experiment start timestamp in its name. This will make your experiment tracking strategy robust even if you were tempted to make on-the-fly experiments without committing or copy-on-write above (a.k.a “Dirty Commits”).
For example, when running the experiment animal_classifier_009.py, we create the folder out/animal_classifier_009/2021_11_03–12_34_12/source and store a snapshot of the relevant source code inside.

4. Treat configuration parameters same as source code

Avoid tuning experiment parameters or hyper-parameters in the command line, environment variables, or any other external means that are not part of the source code. Otherwise, you risk losing traceability for changes if you forget to log the parameter values.

To embed experiment configuration, you can use either plain Python, dictionary, json, yaml, or any other format you find convenient. Just make sure you commit the configuration files together with the experiment code. Does hard-coding stuff seem like a code smell? Well, not in this case. If you do accept external run-time parameters — be sure to log their values!

Each configuration changeset should be treated as a unique experiment. It should be committed to source control, configuration files shall be included in the experiment code snapshot, and it should get its own experiment ID.

The advantage of embedding configuration as part of the source control is that you can be sure you reproduced the same experiment just by running the program file, not other moving parts that you may forget to set.

Using plain Python variables for configuration tracking

5. Track experiment evolution tree

One of the things that helps me a lot is to keep track of the reference experiment — the predecessor baseline that I am trying to improve upon. This is easy to do if your experiments are identifiable. When you create a new experiment by duplicating, keep track of the parent experiment ID + the essence of what you’ve tried in this experiment that is different from the parent. This information will help you quickly recall what you did, without relying on code diffs. It also makes it possible to traverse back in the experiment tree and quickly get the full picture. You can track the parent experiment inside the source code itself, as a code comment.

However, this might cause a problem if you forget to update the notes before running the experiment. I suggest a simple spreadsheet like this:

In the sheet, you can also capture other information or parameters that you used in this experiment, and of course — experiment results. But we’ll touch on that later.

Tracking what happened

6. Keep console/log output

Be generous with logging statements that track what happened in the experiment. Track many metrics and types of information, like dataset size, label count, date ranges, experiment execution time, and more. These can help you detect issues and mistakes. Be paranoid! Every unexplained change in a metric could be caused by some mistake in the experiment setup. This will help you understand its root cause.

Any experiment output should be persisted. I recommend using the Python logging module instead of plain console prints, so you can redirect logging messages to both stdout and a file. In addition, you will get timestamps for each log event, which can help you to solve performance bottlenecks. You can store the log file under a folder correlated to the experiment ID and execution time:

7. Track experiment results

You might use multiple metrics that quantify the quality of your model. For example, accuracy, precision, recall, F-score, AUC. Make sure you track these in a separate, structured, results file that you may automatically process later to show charts, etc.

It’s also a good idea to track your most important metrics in the experiment spreadsheet so you can get the full picture quickly and decide on future directions. I like using colors to mark results (green=improved, red=got worse, yellow=not sure).

Tracking experiment results in a spreadsheet

8. Do multiple repeats for stochastic models

You want your results to be reproducible, but still, avoid getting misleading results due to chance. The solution lies in repetition with random seeds. Avoid using fixed random seeds if your models are stochastic. The same applies when shuffling, downsampling, or any operation that contains a random element. If you use SciKitLearn, for example, always run your models with random_state=none. Perform multiple repeats in each experiment and average the results of your optimization target metrics in all repeats, so you get stable numbers. You can use metrics like Standard Error of the Mean (SEM), to estimate how close your repeats’ mean is to the true mean of the population (if you could run an infinite number of repeats). SEM metric value decrease as you increase the number of repeats. This will help you gain confidence and understand if your latest results are indeed better, or might it was just luck, and you should increase the repeat count to be sure. In general, when your model gets more mature/stable, your optimizations will probably have a smaller impact and you might need to increase the repeat count.

Tracking input and intermediate datasets

9. Track input datasets

Remember to version and name the datasets that are used as input to your model with the version identifier. Input datasets tend to be big, so I wouldn’t recommend duplicating them into each experiment’s tracking folder. Just make sure to log the filenames/URIs of the input datasets you used. You can also find these filenames in the source code snapshots for the relevant experiment. You can add another safety layer here by computing and logging a hash/digest of the contents of each input dataset. Log also the basic characteristics of the data, such as its dimensions and sample counts for each class.

10. Avoid or track intermediate datasets

Some of your code may carry out heavy preprocessing of datasets. This can sometimes take a long time, so you may do it once and then use the output in later steps. If your preprocessing has a stochastic nature, (shuffling, train/test splitting, etc..), try to avoid creating intermediate datasets unless the processing can really save a lot of experiment time. Otherwise, you may have an inherent bias in your data, similar to when using a fixed seed. Instead, you can invest in optimizing the execution time of the preprocessing steps.
If you do generate intermediate datasets, treat the source code you wrote for that purpose just like a normal experiment using the practices described so far. Use version numbers for the source file, track the source code, track the logs, etc. It’s a good idea to save the output intermediate datasets in the out folder of each experiment. This will make the datasets inherently identifiable.

Summary

In short, experiment management is essential and pretty easy to do if you adopt some simple techniques. No matter whether you do it yourself or use an experiment management platform, just do it!