Deepnote
IntegrationsPricing
All posts
product

– by Deepnote team on June 14, 2022

Snowpark for Python: Bring Python to your data warehouse with Deepnote

What makes Snowpark such a valuable asset for data science workflows, and how you can use its superpowers in Deepnote—Snowpark’s Python-ready partner

At the Snowflake Summit this year, Snowflake introduced Snowpark—a library for processing data in your warehouse, with the languages, code, and scientific tools you already use. As hundreds of teams use Snowflake with Deepnote to explore and build data products, we used this opportunity to make the Snowflake + Deepnote experience even more delightful.

In this article, we'll walk you through what makes Snowpark such a valuable asset for data science workflows, and see how you can use its superpowers in Deepnote—Snowpark’s Python-ready partner. Let’s dive in!

The toolbox

We are using Deepnote notebooks as a control plane for our Snowflake warehouse. Deepnote is a collaborative data workspace for data science and analytics—it allows for real-time collaboration, integrates with Snowflake seamlessly with SQL blocks, and enables rapid exploration and prototyping with Python. This way, we benefit from the agility of a notebook's interface and Snowflake’s security, scalability, and compute.

Snowpark + Deepnote 101

The goal of this tutorial is to build an end-to-end data science pipeline—from transformation, data exploration, model training and evaluation, to model deployment. The sample dataset, which we will build ML models with later, contains 100k rows of telecommunications data. The full pipeline looks something like this

pipeline.png
💡 If you want to play with the integration right away, you can find a template and a detailed version of this walkthrough in this notebook.

1. Setting up Snowflake in Deepnote

In this section, we will use Snowpark to connect to and populate our Snowflake warehouse.

First, we establish a data connection with Snowflake from Deeponote. If you don’t have a Deepnote account, you can create one for free here. Your credentials are kept secret with Deepnote’s environment variables integration, and the connection can be shared and re-used across your team’s workspace.

credentials.png

If you are starting this demo from scratch, you will also need to set your Python environment to Python 3.8 (use the ⚙️ in the right side bar to access Deepnote's environments). The rest of the nitty gritty setup and authentication details are best viewed in the full notebook walkthough here.

Once the setup is complete, we then create the session object, which allows us to interact with the Snowflake warehouse via their Pandas-like API. Simply pass your connection details to the config method to connect to your Snowflake instance.

We are using a sample telecommunications dataset, a parquet file containing ~100k rows which is bundled with the notebook. Using the session object from Snowpark, we upload that file as a table in our warehouse as shown below.

Let’s check out our new table by displaying it as an interactive Pandas DataFrame in Deepnote.

Now that we’ve set the stage, it’s time for data transformation as we work towards a dataset we can use for machine learning.

2. Data transformation

We begin by transforming our data with Snowpark’s main abstraction—the DataFrame, which contains methods for operating on the data. Importantly, all transformations are computed in the warehouse itself, not the notebook.

You could pass pure SQL to the session object, but by using the DataFrame in a notebook environment, we’re getting intelligent code autocomplete and error highlights in the notebook, while benefiting from the speed.

Below is an overview of the data pipeline and respective transformations. We start with our raw table (loaded from the parquet file), build up several transformations, and finally land on a table that we can use to build a churn prediction model.

Structure.png

Below is an example of how the TELCO_DATASET was built using the DataFrame’s join method from Snowpark. Snowpark disambiguates duplicate column in the joined table by prepending a random string to the original column name. Chaining a new select statement gives us back a column simply called CUSTOMERID.

Now we have a table the data scientists can use for machine learning. Specifically, we’re going to use scikit-learn to build a churn prediction model based on the TELCO_DATASET table.

3. Exploratory data analysis

Once we’ve completed the transformations, our freshly minted TELCO_DATASET can be used for model training and prediction. We will first do the requisite exploratory data analysis (EDA) to make sure our dataset is suitable for model training. We examine the data for missing values, outliers, or anything else that might affect our subsequent modeling.

In Deepnote, a great deal of the exploratory work happens without writing any code, thanks to some built-in EDA magic, such as rich dataframes, built-in sorting and filters, and no-code charts. This way, we can examine the dataset for data types, missing values, percentage of given categories, distributions and more—shifting seamlessly from code to visualizations to querying, and back.

eda.gif

4. Model training

After doing the requisite EDA in Deepnote, we can train and evaluate the model. We applied a random forest classifier and took a look at the confusion matrix to evaluate the accuracy of the classifications.

After evaluating our model's accuracy, we can now deploy this model directly on the Snowflake warehouse—without changing our code or package environment. Moreover, any new data that ends up in our warehouse can be fed into the model to predict churn, without having to take our data out of the warehouse.

5. Model deployment

In order to deploy models to Snowflake, we use Snowpark's powerful UDFs (user-defined functions). UDFs are functions that are defined in pure Python but run on Snowflake compute. This is a perfect fit for our newly created model. They are decorated with @udf and require a few other details beyond a standard Python function. Read the Snowpark documentation to learn more.

Let’s use a Snowpark UDF to move our model onto Snowflake server where it can be used for predictions with incoming data.

The only thing left to do is use the model to make some predictions. We’ll take a sample from the TELCO_DATASET table to use for predictions. 

The Snowpark select statement above accomplished the following tasks:

  • It selected the CUSTOMER_ID, CHURNVALUE, and the output of the predict_churn function we wrote earlier (aliased to PREDICTED CHURN).
  • It saved the results set to a new table in our warehouse called CHURN_DETECTION.

Now we can query the new CHURN_DETECTION table with Deepnote’s native SQL blocks to check it out.

We have sucessfully deployed and made predictions with our model on Snowflake—without changing our code, package environment, or moving our data out of the warehouse.

Conclusion

With Deepnote and Snowpark, the entire database feels like an in-memory object, similar to the experience of using Pandas, but at a much larger scale. All the transformation and scientific computation happen where the data lives, without having to change tools, libraries, or languages.

Whether you’re a data scientist, analytics engineer, data engineer, developer or an analyst—if you’re using Snowpark, Deepnote will be a great addition to your toolbox. If Python is not your jam, Snowpark now supports a growing number of different programming languages including Scala and Java.

Get Started with Deepnote, Snowflake & Snowpark
Sign up for Deepnote for free.

Share this post

Twitter icon

Join the world's best data teams and get started with Deepnote

No credit card required. Run your first notebook in seconds.

Deepnote
Product
© 2022 Deepnote. All rights reserved.