Spark

What is Spark?

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Spark in Deepnote

Deepnote is a great place for working with Spark! This combination allows you to leverage:

Spark's rich ecosystem of tools and its powerful parallelization
Deepnote's beautiful UI, its set of AI generative tools, the collaborative workspace and data apps

Connecting to a remote cluster

A strong motivation for using Spark is its ability to process massive amounts of data, often using large clusters at the major cloud providers (AWS EMR, GCP Dataproc, Databricks or Azure HDInsight), or managed internally by your staff. You can use those as the back-end for your heavy computation, while using Deepnote as the client thanks to the new decoupled client-server architecture called Spark Connect introduced in Spark 3.4.0.

Requirements

On your cluster:

Spark >= 3.4.0 on your cluster
Ensure secure network connectivity, by picking one of the options here
Start the Spark server with Spark Connect in your cloud provider of choice

In your Deepnote project:

PySpark >= 3.4.0

For example, you can use the jupyter/all-spark-notebook Dockerhub image as a starting point as it has PySpark pre-installed. Or you could install PySpark as part of initialization, but because of its size we recommend to use Docker image to speed up notebook initialization. Learn more about custom environments in Deepnote.

General instructions

After starting your cluster, you need to connect to it from the notebook. For AWS EMR, GCP Dataproc, Azure HDInsight or other clusters, follow the instructions in the Spark documentation.

from pyspark.sql import SparkSession

# This example uses a remote EMR cluster
spark = SparkSession.builder.remote("sc://ec2-1-2-3-4.compute-1.amazonaws.com:15002").getOrCreate()

Databricks

For Databricks, you can leverage Databricks Connect.

!pip3 install --upgrade "databricks-connect==13.0.*"

Or X.Y.* to match your cluster version

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.remote(
  host       = "my_host",
  token      = "my_token",
  cluster_id = "my_cluster_id",
).getOrCreate()

Interfacing with Deepnote features

Deepnote supports displaying PySpark DataFrames as an output of a code block. If the last expression of the code block evaluates to PySpark DataFrame, its content will be rendered in our data table component. You will be able to browse the data, apply filters and sorting, add cell formatting rules, and manage columns.

PySpark DataFrames can also be used in chart blocks, no need to convert them to Pandas. Currently we have a rows limit of 10,000 (bigger DataFrames will be sampled), but we're working on lifting it. Learn more about charting in Deepnote.