You might be asking if storage even matters in data science platforms in 2021. Well, whenever you're copying data out of your data warehouse, you're probably wasting time and bandwidth, and maybe even breaching security policies.
Hence, most platforms today focus on integrating with existing databases. You submit a query and only a subset of your data is loaded to memory, which is subsequently destroyed when you close your project.
However, there are more aspects to storage than just the data sets themselves. You need a way to store your notebooks, helper scripts and libraries, auxiliary files, and also computation outputs.
We surveyed some popular tools and found that approaches to general storage vary widely across different data science platforms. A particularly interesting aspect is the “on-premise” offering, ideal for security-conscious customers who don’t want their data leaving their network.
This can be implemented as fully on-premise, where the entire platform is deployed into the customer’s cloud or the “control plane/data plane” model, where some functionality is still managed on the servers of the provider. The line between control plane and data plane depends heavily on the provider, but in general, control plane manages auth, cluster scaling, and other miscellaneous tasks while data plane is responsible for storage of data and execution of user code (analytics, modeling, ML, etc.).
Let’s dive in and take a look at the storage methods of three popular tools and compare them to Deepnote’s approach.
Databricks mounts Databricks File System (DBFS) content as a folder in your root using FUSE.
Apart from the full Databricks platform, the company also offers a “Community Edition.” However, this is mostly intended for trial purposes, so we're going to focus on the full platform. Databricks adopts a control plane/data plane model, where the control plane sits on Databricks’ servers and the data plane is deployed into your cloud (AWS, GCP, or Azure).
How does storage work? Databricks offers its proprietary DBFS, which is backed by an S3 (or equivalent) bucket inside your AWS account (so it's part of the data plane). The platform offers an interface to upload your data into DBFS. Since DBFS is mounted as a folder in the root of your VM, you can access the data as if it was on your local machine.
If you have data stored in object stores (e.g., S3, Azure blob, etc.), you can also mount it using DBFS.
Any code stored in Git repositories can also be mounted in a separate UI screen to the
/Repos/ folder and can be accessed as if it was on your local machine. Databricks also provides functionality to pull upstream changes either manually or programmatically via API calls. The code is managed by the control plane, so it's copied on the Databricks servers for the lifetime of your project.
The notebooks themselves are also stored by Databricks in their RDS database, not in your VCP. This can be problematic if you don’t want your data leaving your cluster in the form cell outputs. However, Databricks offers you an option to provide an encryption key for the notebooks, so you can always revoke access and “destroy” the data that left your network. This is especially important for some regulated industries, where companies need to prove that all data is encrypted by the keys they manage.
Domino Data Lab
Domino stores your project files in a bucket and each sync creates a new version you can come back to later in time.
While Domino’s architecture distinguishes control plane and data plane to some extent, they deploy the platform in its entirety to the customer’s cloud. You can, however, also choose a managed version, which is hosted on Domino servers.
Work in Domino happens in projects, which are essentially a collection of files. At rest, the files sit in an object store, such as an S3 bucket or something with a similar API. When you start a project, those files are copied onto the disk of your VM and then copied back when you finish or request sync manually.
One great feature is versioning — each update essentially creates a new version of your entire project, so you can come back to previous versions at any time, ensuring reproducibility. Notebooks are not treated as special in Domino; they simply sit together with other files.
However, this approach makes it slow to work with very large files or large amounts of files due to the limitations of the object storage API. For these reasons Domino also provides the Domino Datasets feature, which is mounted remotely using an NFS. You can use it to start working very quickly even on very large data since you don't need to download it to your VM.
Git repos can be accessed similarly to Databricks and you can mount them to the
Paperspace creates two special folders —
/storage — in your root that are persistent so you can store notebooks and code next to each other.
Paperspace Gradient offers two main ways to store data, either in the form of an NFS mount (called “persistent storage”) or in the form of an S3-like object store (called “versioned data”). Persistent storage is mounted to
/storage and, as the name suggests, it is persistent between VM restarts. You can use it to store arbitrary data and code, including Git repositories. The data itself lives in a Ceph cluster managed by Paperspace.
Versioned data lives in an S3 bucket managed by you (and in your own account), but you give the access keys to Paperspace so it can take care of organizing the bucket in the supported way and also take care of versioning. The data itself might still be cached on Paperspace servers, but this is only ephemeral. You can also mount other storage solutions, such as Azure Blob and GCP.
Notebooks are stored in a specific
/notebooks folder that works exactly in the same way as
/storage. You can even store other auxiliary files (such as Python scripts or even entire repositories) in
/notebooks, so it should be fairly easy to import Python scripts. When your instance is off, you can still access your notebooks, but they are read-only.
How does Deepnote compare?
Deepnote tries to emulate the experience of a home folder on a local machine by putting all files into a versioned
/work folder. It creates a persistent
/work folder where you can store your notebooks, auxiliary files (Python scripts, etc.), and even your data.
We offer GitHub integrations that clone repositories to
/work, so they live alongside the rest of the project and can also contain notebooks. The
/work folder is stored in an encrypted NFS managed by Deepnote and it can be versioned in its entirety to guarantee reproducibility.
If you prefer to separate your data, you can create a “Shared data set” that lives in a GCS bucket managed by Deepnote, or you can mount any object storage solution, such as S3. Deepnote does not build versioning for buckets, but it still caches it locally for faster access.
We're currently building our on-premise offering, where we're going to adopt a control plane/data plane model. However, we're going to store all data, including notebooks and outputs, in the data plane to make sure your data never leaves your VPC, not even by accident.
Data science notebook platforms differ widely in the types of storage they offer. The main differences are in:
- Treatment of notebooks, which can be either separate entities or embedded within the project
- Underlying storage mechanism, where data can be stored in NFS-like filesystems or in object stores (while this distinction is mostly invisible in the UI, it can have a high impact on the performance profile)
- Separation of entities — some platforms want you to have notebooks, code, and data completely separate while others allow you to decide on the organization yourself
If you want to give Deepnote a try, sign up for free.