Spark
What is Spark?
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Spark in Deepnote
Deepnote is a great place for working with Spark! This combination allows you to leverage:
- Spark's rich ecosystem of tools and its powerful parallelization
- Deepnote's beautiful UI, its set of AI generative tools, the collaborative workspace and data apps
Connecting to a remote cluster
A strong motivation for using Spark is its ability to process massive amounts of data, often using large clusters at the major cloud providers (AWS EMR, GCP Dataproc, Databricks or Azure HDInsight), or managed internally by your staff. You can use those as the back-end for your heavy computation, while using Deepnote as the client thanks to the new decoupled client-server architecture called Spark Connect introduced in Spark 3.4.0.
Requirements
On your cluster:
- Spark >= 3.4.0 on your cluster
- Ensure secure network connectivity, by picking one of the options here
- Start the Spark server with Spark Connect in your cloud provider of choice
In your Deepnote project:
- PySpark >= 3.4.0
For example, you can use the jupyter/all-spark-notebook
Dockerhub image as a starting point as it has PySpark pre-installed. Or you could install PySpark as part of initialization, but because of its size we recommend to use Docker image to speed up notebook initialization. Learn more about custom environments in Deepnote.
General instructions
After starting your cluster, you need to connect to it from the notebook. For AWS EMR, GCP Dataproc, Azure HDInsight or other clusters, follow the instructions in the Spark documentation.
from pyspark.sql import SparkSession
# This example uses a remote EMR cluster
spark = SparkSession.builder.remote("sc://ec2-1-2-3-4.compute-1.amazonaws.com:15002").getOrCreate()
Databricks
For Databricks, you can leverage Databricks Connect.
!pip3 install --upgrade "databricks-connect==13.0.*"
Or X.Y.* to match your cluster version
from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.remote(
host = "my_host",
token = "my_token",
cluster_id = "my_cluster_id",
).getOrCreate()
Interfacing with Deepnote features
Deepnote supports displaying PySpark DataFrames as an output of a code block. If the last expression of the code block evaluates to PySpark DataFrame, its content will be rendered in our data table component. You will be able to browse the data, apply filters and sorting, add cell formatting rules, and manage columns.
PySpark DataFrames can also be used in chart blocks, no need to convert them to Pandas. Same limitation on number of rows (10,000) applies. Learn more about charting in Deepnote.