Spark
What is Spark?
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Spark in Deepnote
Deepnote is a great place for working with Spark! This combination allows you to leverage:
- Spark's rich ecosystem of tools and its powerful parallelization
- Deepnote's beautiful UI, its set of AI generative tools, the collaborative workspace and data apps
Connecting to a remote cluster
A strong motivation for using Spark is its ability to process massive amounts of data, often using large clusters at the major cloud providers (AWS EMR, GCP Dataproc, Databricks or Azure HDInsight), or managed internally by your staff. You can use those as the back-end for your heavy computation, while using Deepnote as the client thanks to the new decoupled client-server architecture called Spark Connect introduced in Spark 3.4.0.
Requirements
On your cluster:
- Spark >= 3.4.0 on your cluster
- Ensure secure network connectivity, by picking one of the options here
- Start the Spark server with Spark Connect, read docs
In your Deepnote project:
- PySpark >= 3.4.0
For example, you can use the jupyter/all-spark-notebook
Dockerhub image as a starting point, and install PySpark as part of initialization, but ideally this is baked into the image.
General instructions
For AWS EMR, GCP Dataproc, Azure HDInsight or other clusters, follow the instructions in the Spark documentation.
from pyspark.sql import SparkSession
# This example uses a remote EMR cluster
spark = SparkSession.builder.remote("sc://ec2-1-2-3-4.compute-1.amazonaws.com:15002").getOrCreate()
Databricks
For Databricks, you can leverage Databricks Connect.
!pip3 install --upgrade "databricks-connect==13.0.*"
Or X.Y.* to match your cluster version
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(
host = "my_host",
token = "my_token",
cluster_id = "my_cluster_id",
).getOrCreate()
Interfacing with Deepnote features
Some features, such as Chart blocks, require the data to be in Pandas DataFrames. You can use the .toPandas()
function to collect a remote Spark DataFrame as a local Pandas DataFrame. Make sure that the data will fit into the memory of your Deepnote machine -> either pick a larger machine, or aggregate or sample the data before the conversion.