Beyond the Random Seed: Achieving Long-Term Model Reproducibility through Time Travel and Data Tracking

Simon Stiebellehner
9 min readJun 1, 2023

--

“I know an MLOps joke, but it’s hard to reproduce.”

In this article, you’ll read about

  • Why reproducibility is needed beyond experimentation
  • Why model reproducibility starts with data
  • How you can make sure that changing data doesn’t jeopardize your model’s reproducibility
  • How Time Travel and Data Tracking can help achieving full end-to-end reproducibility

Are you interested in curated, high-quality content around MLOps?
Follow me on LinkedIn to stay up-to-date!

> Do you have feedback, suggestions or would like to have a virtual coffee? Message me on LinkedIn!

Thanks to my colleagues at TMNL (specifically Michael Uryukin, Denis Dallinga and Axel Goblet) for the great discussions around this topic, which inspired this blog post.

The Need for Reproducibility goes Beyond Experimentation

“Reproducibility” generally refers to the ability to execute the same (sequence of) actions at different points in time and arrive at the same result. It’s very much related to traceability, but instead of only keeping track of what’s happened it’s indeed repeating the past. Anyway, you could swap out “reproducibility" with “traceability" for the larger part of this article and it’d still make sense. Great, isn’t it!

Obviously, reproducibility is imperative for effective and efficient experimentation. Which is why it also plays an important role in the context of Machine Learning and Data Science. You’d really want to end up with exactly the same model if you trained it using the exact same data, hyperparameters, library versions and so forth.

When Data Science professionals talk about reproducibility, they typically refer to reproducing experiments — training and evaluating various models with the objective of finding the model that solves the optimization problem at hand best. The need for reproducibility is not limited to experiments, however, but may as well be a non-negotiable requirement for a vetted model running in production and serving millions of users.

Reproducibility may be Non-Negotiable if you’re in a Regulated Industry or working on a High-Risk Use Case

The ability to unambiguously track or even reproduce results of a model running in production at any point in time for several years may be a requirement in specific areas of highly-regulated industries, such as finance. For example, if a financial institution’s model is used to decide upon the success of applications for loans, this process must be auditable… and auditors may want to go back several years (often 5–7) and understand what’s been happening, what results were produced by that model, why and how they were generated.

This requirement per se already poses a challenge for many organisations. By now, thanks to the strong MLOps community, there’re so many excellent tools that help you keep tight track of what’s been happening to build up a bullet-proof audit trail: metadata stores, model registries, experiment trackers, and more. However, there’s one challenge that definitely remains:

Model reproducibility starts with data. After all, the inner workings of your model — its weights — are largely determined by the data it was trained on.

No matter how many random seeds you set, if your data changes irrevocably, your model and its results lose its reproducibility. You might be inclined to say “well, I just snapshot the tables I use!”. Good for you! However, what if you’re training your model on big data — say hundreds of gigabytes or even terrabytes. Perhaps you’re also performing batch-inference with your model on similar quantities of data on a regular basis. At the same time, your data sources keep evolving over time: tables added, schemas changed, rows added, rows marked for deletion (or actually deleted), rows changed, and more.

How are you going to reproduce results of a model if the underlying data is very large and keeps changing over time? Let’s look at a few options:

  • 🛤️ Track that data!
  • ⛓️ Sync copy it!
  • 📄 Async copy it!
  • ⏰ Time travel!
  • ❓… or …

🛤️ Data Tracking

“Data Tracking” is a fairly general term. In essence, it really just means you keep track of the data you are using for your model. Practically, in the context of MLOps, it usually boils down to inserting the path (local, S3..) or query string that you used to load data from some data storage into your metadata store, which keeps track of what’s been happening around your model. Popular MLOps solutions such as neptune, comet ML and W&B offer this capability (W&B and comet ML allow for going a step further even… more below).

This form of data tracking is great for quick experimentation and limited requirements on the duration of reproducibility. After all, if my underlying data source keeps evolving, the query I used to fetch data on Monday may not return the same data on Tuesday. Which means that just keeping track of the query I used to fetch the data is not going to ensure reproducibility in the mid and long term. In order to achieve long-term reproducibility, we need to look beyond our “model world” and consider a bit more of the data value chain.

The gist: Don’t rely only on it if you need mid or long-term reproducibility.

⛓️ Synchronous Copy

If our data source keeps evolving over time, what if we just created our own snapshot of it and tied it to our model (e.g. via some data tracking functionality) when we run a model? As part of the data loading logic, we’d simply write the exact same data we initially load to a storage that’s in our control and not changing over time. We’re essentially creating our own snapshot of the data source at the point in time of running our model. Some MLOps tools such as W&B or comet.ml include this functionality on top of the aforementioned “Data Tracking” capability.

Inarguably, it would do the job of ensuring your model can run on the exact same data again at any future point in time. As long as we ensure we manage these snapshots appropriately and tie them to a model run, we’re good from a reproducibility perspective. However, it’s a brute force and naive approach with lots of downsides:

  • Remember, your model is operating on big data. Naively copying data your model is consuming for every model run (even if we do some smart caching) will make your storage and compute costs rise faster than the number of self-proclaimed AI experts since ChatGPT came out.
  • Synchronous copying as part of your model pipeline negatively impacts the performance of every single model run. This won’t make your Data Scientists and MLEs very happy.
  • Especially relevant if you’re using Spark: The full write of your input data may require a different cluster configuration as would be optimal if you didn’t do it. This may go at the expense of performance at other stages of the job.
  • Somebody has to own these snapshots. What if these snapshots contain personal data and a GDPR request comes in that forces you to delete the data of a specific person? Good luck if you are sitting on a mountain of unmanaged, not-cataloged snapshots. On the other hand, keeping your stuff in order — i.e. really owning and managing that data as part of your team’s responsibilities — requires expertise and time.

The gist: Don’t do it if your models are running on big data.

📄 Asynchronous Copy

Similarly to “Synchronous Copy”, you’re still creating a snapshot of the data used in your model run. The key difference is that, obviously, the copy is done asynchronously. Concretely, this means that you wouldn’t load and write the data as direct part of your model pipeline, but you would only load it and trigger a job that performs the copy and runs outside of your model. This variant has a few advantages over a synchronous copy:

  • No increased runtime of your model pipeline.
  • You can choose the optimum service, tool, infrastructure to handle the copy job, independent of your model pipeline. This means you can optimize for costs a lot better.

Whereas there’re some strong arguments for preferring async over sync in this context, be aware that it requires more expertise and effort to set up such an async copy in a reliable way. Some key challenges are, for example:

  • Is the async job indeed copying exactly what the model pipeline loaded? Even seconds or minutes may make a difference.
  • What if the async job failed but my model pipeline succeeded? You can’t use that model in production. Trust is great, automation is better — you should tie a successful async job run to a promotion of the model to higher environments.

Keep in mind that beyond these two bullets, some of the disadvantages of sync copies apply to async copies as well.

The gist: If you opt for copying and if you possess the appropriate expertise to make it work reliably, give preference to async over sync copies.

⏰ Time Travel

As Doc put it in Back to the Future: “Roads? Where We’re Going, We Don’t Need Roads.”

… or put into our context: “Copies? Where We’re Going, We Don’t Need Copies.”

That’s music to our ears!

A fiercly advertised feature of modern Data Warehouses and Data Lakehouses is the ability to travel in time. Without going into the details on how this works technically (read up on it here [Apache Iceberg] or here [BigQuery], for example), it’s essentially the ability to fetch data from a historical state of your DWH/DLH. Depending on the underlying technology, this may be done by automatically retaining a timeline or log of changes. The consumer perspective of it is fairly smooth — simply use some query element such as AS_OF {DATE} or similar. The below image illustrates this using an example Spark read on a Delta Lake table, specifying a time travel date ( "timestampAsOf").

Time travel sounds like exactly what we need, right? Yes, but… Obviously, retaining extensive history of big, quickly changing data will become expensive, not to mention the environmental impact. It is advisable to consider retaining a “hot history” and a “cold history” by using appropriate storage tiering. For example, you could keep the last 6 months in your “primary” DLH/DWH instance for ad-hoc time travel while moving earlier states to cold storage. Your company and the environment will thank you!

The gist: Yes, time travel is a key piece of the puzzle. However, a puzzle always has more than one piece.

⏰ 🛤️ Time Travel + Data Tracking = Reproducibility

Time travel helps reconstructing historic state of data. But how do you know which state was used by your model? This is where the aforementioned “Data Tracking” functionality comes into play. Most metadata tracking tools allow you to track e.g. query strings or paths pointing to specific locations on storage systems. Tracking this in combination with the selected “time travel date” ensures you tightly couple a specific model and its run with a reference to the exact state of data that model run consumed.

Sidenote: For large data sets, I’d strongly discourage using data tracking functionality that copies all your referenced data into some MLOps solution provider’s cloud storage. That’s great for your 10mb CSV, but it’s terrible from cost, speed and data management perspective for anything beyond this.

The below diagram extends upon the previous one and illustrates this for 3 cases: t-0, t-2 , t-n (long-term history). Better click the image to read it more comfortbly.

What do we gain from this solution?

  • Bullet-proof long-term history making your auditors happy.
  • Minimal overhead for teams developing models.
  • No performance degradation caused by synchronous copies of data on every model run.
  • Time travel is resource and cost efficient as it’s based on capturing change in logs instead of naively duplicating entire tables when changes happen.
  • Ability to extend towards hot and cold storage to balance out cost and retrieval effort/speed without compromising on auditability.

The gist: Yes, this is not a quick fix. Good solutions rarely are. However, it’s a good solid long-term solution that’ll give you peace of mind and provides more value than “just” bullet-proof traceability and reproducibility.

Caveat: There may be Limits on History Length

Apart from increasing storage costs (d’uh), popular Data Lakehouse frameworks such as Apache Hudi, Apache Iceberg or Delta Lake do not explicitly mention problems with retaining long histories (let me know if you find sources stating the opposite, please!). In contrast, some Data Warehouses such as Snowflake or BigQuery do explicitly state time limits on the history that can be travelled in time. It’s maximum 90 days for Snowflake and 7(!) days for BigQuery. Anything beyond this time frame requires taking periodic table snapshots and restoring them when needed. That may be completely fine if your data changes in large batches once a day or less, however, it is going to impair your ability to reproduce the exact state of data your model consumed further in the past. That might be a real issue, forcing you back to sync or async copy approaches through, e.g. creating table snapshots at model run time and storing the reference in your metadata store.

Happy reproducing!

--

--

Simon Stiebellehner
Simon Stiebellehner

Written by Simon Stiebellehner

I am lecturer in Data Mining & Data Warehousing at University of Applied Sciences Vienna and Lead MLOps Engineer at Transaction Monitoring Netherlands (TMNL).

No responses yet