By Jan on May 4, 2021
You might be asking if storage even matters in data science platforms in 2021. Well, whenever you are copying data out of your data warehouse, you are probably wasting time, bandwidth and maybe even breaching some security policies. Hence, most platforms today focus on integrating with existing databases so you submit a query and only a subset of your data is loaded to memory, which is subsequently destroyed when you close your project.
However, there are more aspects to storage than just the datasets themselves. You need a way to store your notebooks, helper scripts and libraries, auxiliary files and also computation outputs.
We surveyed some popular tools and found that approaches to general storage vary widely across different data science platforms. Particularly interesting aspect is the “on-premise” offering, ideal for security-conscious customers who don’t want their data leaving their network. This can be implemented as full on-premise, where the entire platform is deployed into the customer’s cloud or “control plane/data plane” model, where some functionality is still managed on the servers of the provider. The line between control plane and data plane depends heavily on the provider, but in general, data plane is responsible for storage of data and execution of user code (analytics, modelling, ML) while control plane manages auth, cluster scaling and other miscellaneous tasks.
Let’s dive in and take look at the storage methods of 3 popular tools and compare them to Deepnote’s approach.
Databricks mounts /dbfs content as folder in your root using FUSE.
Apart from the full Databricks platform, the company also offers a “Community Edition”. However, this is mostly intended for trial purposes, so we are going to focus on the full platform. Databricks adopts control plane/data plane model, where the data plane is deployed into your cloud (AWS, GCP or Azure), while the control plane sits on Databricks’ servers.
How does storage work? Databricks offers their proprietary “Databricks File System (DBFS)” which is backed by an S3 (or equivalent) bucket inside your AWS account (so it is a part of data plane). The platform offers a seamless interface to upload your data into DBFS and you can then access the data as if it was on your local machine, since DBFS is mounted as a folder in the root of your VM.
If you have data stored in object stores (eg. S3, Azure blob etc.), you can also mount it using DBFS.
Any code stored in git repositories can also be mounted in a separate UI screen to the
/Repos/ folder and can be accessed it as if it were on your local machine. Databricks also provides functionality to pull upstream changes either manually or programmatically via API calls. The code is managed by the control plane, so it is copied on the Databricks servers for the lifetime of your project.
The notebooks themselves are also stored by Databricks in their RDS database, not in your VCP. This can be problematic if you don’t want your data leaving your cluster in the form cell outputs. However, Databricks offers you an option to provide an encryption key for the notebooks, so you can always revoke access and hence “destroy” the data that left your network. This is especially important for some regulated industries, where they need to prove that all data is encrypted by the keys they manage.
Domino stores your project files in a bucket and each sync creates a new version you can come back to later in time.
While Domino’s architecture distinguishes control plane and data plane to some extent, they deploy the platform in its entirety to the customer’s cloud. You can however also choose a managed version, which is hosted on Domino servers.
Work in Domino happens in projects, which are essentially a collection of files. At rest, the files sit in an object store, such as an S3 bucket or something with a similar API. When you start a project, those files are copied onto the disk of your VM and then copied back when you finish or request sync manually. Another great feature is versioning — each update essentially creates a new version of your entire project, so you can come back to previous versions at any time, essentially ensuring great reproducibility. Notebooks are not special in Domino and they simply sit together with other files.
However, this approach makes it slow to work with very large files or large amounts of files due to the limitations of the object storage API. For these reasons Domino also provides Domino Datasets feature which is mounted remotely using an NFS. You can use it to start working very quickly even on very large data since you do not need to download it to your VM.
Git repos can be accessed similarly to Databricks — you can mount them to the
Paperspace creates two special folders /notebooks and /storage in your root that are persistent and you can store notebooks and code next to each other.
Gradient offers two main ways to store data, either in the form of an NFS mount (called “Persistent storage”) or in the form of an S3-like object store (called “Versioned data”). Persistent storage is mounted to
/storage and as the name suggests, it is persisted between VM restarts and you can use it to store arbitrary data and code, including git repositories. The data itself lives in a Ceph cluster managed by Paperspace.
Versioned data lives in an S3 bucket managed by yourself (and in your own account), but you give the access keys to Paperspace so they can take care of organising the bucket in the supported way and also take care of versioning. The data itself might still be cached on Paperspace servers, but this is only ephemeral. You can also mount other storage solutions, like Azure Blob or GCP.
Notebooks are stored in a specific
/notebooks folder that works exactly in the same way as
/storage. You can even store other auxiliary files (such as python scripts or even entire repositories) in
/notebooks/ , so it should be fairly easy to import python scripts. When your instance is off, you can still access your notebooks, but they are read-only.
Deepnote tries to emulate experience of home folder on local machine by putting all files into versioned /work folder.
Deepnote tries to emulate the experience of working on your local machine, which is familiar for most users. It creates a persistent
/work folder where you can store your notebooks, auxiliary files (Python scripts etc.) and even your data. We offer GitHub integrations that clone repositories to
/work , so they live alongside the rest of the project and can also contain notebooks. The
/work folder is stored in an encrypted NFS managed by Deepnote and it can be versioned in its entirety to guarantee reproducibility.
If you prefer to separate your data, you can create a “Shared dataset” that lives in a GCS bucket managed by Deepnote, or you can mount any object storage solution, such as S3. Deepnote does not build versioning for buckets, but it still caches it locally for faster access.
We are currently building our on-premise offering where we are going to adopt a Control plane/Data plane model. However, we are going to store all data, including notebooks and outputs, in the data plane to make sure your data never leaves your VPC, not even by accident.
Data science notebook platforms differ widely in the model storage they offer. Main differences are in:
Do you have any preferences with respect to storage? What are the most important features, metrics or UI elements you care about the most? Let me know in the comments.
If you want to start using Deepnote, you can do so for free on deepnote.com.
Share this post