How to cure the reproducibility crisis in data notebooks

Reproducibility is the foundation of data work. Learn how the right tools keep data scientists and analysts on the right side of the scientific method.

Reproducibility is the cornerstone of all human knowledge. Without the ability to reproduce the results of research, there’s no foundation for trusting it. And without a foundation, there’s no way to keep building. That’s why it’s a key component of any scientific endeavor.

A static medium, such as a letter, is reproducible by default. It doesn’t have any dependencies. It is self-sufficient. The same letter will work regardless of who’s reading it, whether early in the morning or late at night — today or 100 years from now. But computational mediums are different. They’re fluid. They’re dependent on the environments they operate in and the sources they’re connected to. They’re constantly evolving.

The tools we use to uncover and document knowledge must support reproducibility, not stand in its way. At Deepnote, we’re building a collaborative notebook — a computational medium designed to advance and distribute knowledge. So we think a lot about reproducibility. Let’s look at what causes reproducibility problems in notebooks and how to solve them.

The reproducibility crisis

If you follow science news, you’ve probably come across a startling term in the past several years: “the reproducibility crisis.”

In 2016, the British scientific journal Nature revealed the findings of a survey of 1,576 researchers who were asked about reproducibility in their research. More than 70% of respondents said they had tried and failed to reproduce another scientist’s experiments. Over half had also failed to reproduce their own experiments.

This was fuel for a fire that had been burning for a long time. Some observers trace it back to an essay from scientist John P. A. Ioannidis: “Why Most Published Research Findings Are False.”

Plenty of experts have pushed back on the idea of their being a “crisis,” chalking up some of the issues to being inherent in experimentation. But across multiple other studies, issues with reproducibility have been found across a wide range of fields, from medicine and psychology to economics.

And there’s yet another scientific field that faces its own reproducibility crisis: data. One study found that only 25% of Jupyter notebooks on GitHub are reproducible.

The problem isn’t that data professionals are unable to reproduce each other’s work — it’s that the tools they use (i.e., traditional data notebooks) make reproducibility difficult to achieve.

Reproducibility issues in data notebooks

Data notebooks are still the ideal medium for data work. They allow data professionals to explore, collaborate on, and share data using a single medium for querying, coding, visualizing, and storytelling.

But there are three issues that make reproducing notebooks problematic:

Datasets:

Access to data (e.g., working with local files (CSV exports stored in your downloads folder), access rights (i.e., using data from a data warehouse that only the original author has access to).
Data versioning (training a model or running a report using data from an S3 bucket, but the next time the notebook is run, the data or schema has changed)

Environments:

Replicating the same hardware, GPUs, libraries, etc.

States:

Out-of-order execution: Traditional notebooks are executed cell by cell, allowing for a cell to be executed out of order or even repeatedly. This has the potential to create a hidden state.

Hidden states: Every time a notebook is edited but not executed, it becomes stale, as executing it again would create different outputs than the ones already present.

Deleted cells: It’s all to easy to accidentally lose the cells you’re working on: closing a notebook without saving it, hitting the wrong keyboard shortcut and deleting important cells, opening the same notebook in multiple browser windows and overwriting your own work.

These issues don’t make reproducibility impossible, they make it complicated and slow, which defeats the purpose of data work in the first place — accelerating time to insight and transforming raw data into actionable intelligence for the business.

Data teams aren’t dealing with uncontrollable variables that make their work impossible to reproduce, they’re contending with tools that prevent them from moving at speed.

The reproducible future

Notebooks are the perfect tools for data teams, but they need to be better. What they require are:

Moving to the cloud: A cloud-based notebook allows all team members to automatically share the same hardware, GPUs, libraries, database connections, etc. Fully customizable environments aren’t chained to individual machines — they’re securely shared across teams.

No more time is wasted on sharing files and configuring environments.

Dataset versioning: Modern data notebooks allow you to preview and restore older versions of notebooks. You can track and view everything that has happened in a project.

Version history is built in and done automatically, eliminating the need for using other tools.

Reactivity: Modern data notebooks are always kept up to date. Whenever its code is changed or a cell is deleted or moved, the notebook’s outputs are automatically updated as if the notebook was executed fresh, from top to bottom.

This also makes iteration loops tighter (e.g., when building charts, updating the code automatically updates the visualizations so there’s an instant feedback loop on the changes in the code and data).

Conclusion

Reproducibility arguably matters more than anything. Even the most game-changing insight will fail to drive results if it can’t be reproduced. That’s why we’ve made reproducibility a first-class citizen at Deepnote.

Data teams don’t lack details, raw data, and research materials. They lack the ability to quickly share their work with teammates and have it duplicated.

Exploratory programming is messy. It’s supposed to be. And that can make reproducibility a daunting challenge. But teams shouldn’t have to rely on a litany of slow, complicated tools to achieve it. They deserve tools that take the work out of proving their analysis and helping team members build on it.

Simplify notebook reproducibility with Deepnote

Get started for free to see how Deepnote makes it easy to create and share reproducible data notebooks.

The notebook manifesto

Data analytics

Data engineering

Machine learning

Fintech & Finance

Biotechnology

Gaming

Enterprise

Startups

Research

Use cases

How to cure the reproducibility crisis in data notebooks

The reproducibility crisis

Reproducibility issues in data notebooks

Datasets:

Environments:

States:

The reproducible future

Conclusion

Simplify notebook reproducibility with Deepnote

Blog

Deep dive: why we built a new notebook format

Ultimate guide to huggingface_hub library in Python

How we made data apps 40% faster

Try Deepnote now