Keepsake Version control for machine learning

A machine learning model is the combination of the code and the training data, so knowing what data you trained a model with is essential.

There are two ways to do this:

  1. Store training data in Keepsake. This is recommended if your training data is small (<100MB).
  2. Point at data in another system. This is recommended if your training data is large or you already have somewhere to store it.

Store training data in Keepsake

If your training data is small, then we recommend storing it with Keepsake in each experiment.

For example, if your training data is in a directory training-data/ alongside your training code, then you might write this in your code:

experiment = keepsake.init(
path="training-data/",
params={...}
)

If you want to store both your training script and training data, you can just save everything:

experiment = keepsake.init(
path=".",
params={...}
)

Then, to copy this data back to the current directory, you use keepsake checkout:

$ keepsake checkout <experiment ID>

The downside of this approach is that Keepsake makes a complete copy of your training data on each experiment. So, this approach only works if your training data is small.

How small is "small" depends on your storage costs and bandwidth, but typically we'd recommend doing this if your data is less than 100MB.

Point at data in another system

If your training data is large, or you already have a system for storing your training data, then we recommend putting a pointer to your training data in the params dictionary.

For example, if your training data is on S3, you might put the URL to your training data in params:

training_data_url = "s3://hooli-training-data/hotdogs-2020-05-03.tar.gz"
experiment = keepsake.init(
path=".",
params={
"training_data_url": training_data_url
}
)
# ... download training_data_url and run training

This assumes you are disciplined about versioning your data and the contents of that URL never change. If the data at this URL might change, then you might want to calculate the shasum and record this in params.

Then, if the data changes, you will see a different shasum in keepsake diff, and you will know an experiment was trained on different data.

Note: This documentation is incomplete. We'd love to hear about ways that you are versioning data. See this GitHub issue or chat to us in Discord.

Let’s build this together

Everyone uses version control for software, but it’s much less common in machine learning.

This causes all sorts of problems: people are manually keeping track of things in spreadsheets, model weights are scattered on S3, and nothing is reproducible. It's hard enough getting your own model from a month ago running, let alone somebody else's.

So why isn’t everyone using Git? Git doesn’t work well with machine learning. It can’t handle large files, it can’t handle key/value metadata like metrics, and it can’t record information automatically from inside a training script. There are some solutions for these things, but they feel like band-aids.

We want to make a small, lightweight, native version control system for ML. Something that does one thing well and combines with other tools to produce the system you need.

We need your help to make this a reality. If you’ve built this for yourself, or are just interested in this problem, join us to help build a better system for everyone.

Join our Discord chat  or  Get involved on GitHub


Sign up for occasional email updates about the project and the community:

A project from Replicate.

```