Tutorial: how to query pandas DataFrames with SQL
Pandas is a go-to tool for tabular data management, processing, and analysis in Python, but sometimes you may want to go from pandas to SQL.
Why? Perhaps you find pandas’ syntax intimidating and less intuitive than SQL, which is more beginner-friendly. Or maybe you miss some of the functionalities SQL offers.
The good news is you can work in Python and still use SQL on a tabular pandas DataFrame.
Let’s look at how to query a pandas DataFrame with SQL using either a Jupyter notebook or Deepnote.
Use cases for using SQL with pandas
Pandas comes with many complex tabular data operations. And, since it exists in a Python environment, it can be coupled with lots of other powerful libraries, such as Requests (for connecting to other APIs), Matplotlib (for plotting data), Keras (for training machine learning models), and many more.
But when it comes to talking to databases, it’s not always as powerful as SQL.
For example, an enterprise company may have a massive database where many tables need to be joined together based on different conditions. Performing conditional joins isn’t possible with pandas.
Additionally, SQL allows you to enforce different data constraints — such as unique, not null, primary key, etc. — to make sure your data is consistent and accurate. You can’t do that with pandas.
And if you need to perform advanced aggregations of data (e.g., calculating running totals, moving averages, and percentiles), it can be challenging to write solutions for these in pandas.
Using SQLAlchemy to query pandas DataFrames in a Jupyter notebook
There are multiple ways to run SQL queries in a Jupyter notebook, but this tutorial will focus on using SQLAlchemy — a Python library that provides an API for connecting to and interacting with different relational databases, including SQLite, MySQL, and PostgreSQL.
Using SQLAlchemy, you can retrieve, manipulate, and analyze large sets of data using SQL syntax directly from a Jupyter notebook.
Other options include the PandaSQL library, but it’s lacking in recent updates and has some significant run-time issues that make it less than ideal. There’s also the .query() method, but this is mostly limited to filtering DataFrames and doesn’t provide the exact SQL syntax you may want.
Compared to these options, SQLAlchemy is more straightforward and intuitive. It’s mostly used in an IPython environment, so all you need to get started is a Jupyter notebook.
Installing SQL Alchemy
To install SQLAlchemy, run the following command:
pip install sqlalchemy
Importing SQL Alchemy
To connect to an in-memory database (SQLite, in this case), follow these steps by running commands inside a Jupyter cell:
Step 1: Import SQLAlchemy
Step 2: Load the SQL extension
Step 3: Create a SQLite connection
engine = sqlalchemy.create_engine('sqlite:///mydatabase.db')
Step 4: Connect to the SQLite database
Running queries with SQLAlchemy
After connecting to an in-memory database, you should store data as tables. To do this, first create a dummy DataFrame:
import pandas as pd df = pd.DataFrame([["A",1,2], ["B",3,4], ["A",5,6], ["C",7,8], ["A",9,10]], columns = ["colA", "colB", "colC"])
Next, store the DataFrame in the SQLite database:
df.to_sql('data', con=engine, if_exists='replace')
In the above statement, we created table data in the SQLite engine. The
if_exists=’replace’ argument deletes the current table if it already exists.
Lastly, we can query the table data using SQL as follows:
results = %sql SELECT * FROM data WHERE colA = "A";
If we view the results, we get:
print(results) +-------+------+------+------+ | index | colA | colB | colC | +-------+------+------+------+ | 0 | A | 1 | 2 | | 2 | A | 5 | 6 | | 4 | A | 9 | 10 | +-------+------+------+------+
We can also convert the results to a pandas DataFrame as follows:
results.DataFrame() index colA colB colC 0 0 A 1 2 1 2 A 5 6 2 4 A 9 10
Using Deepnote to query pandas DataFrames with SQL
Deepnote comes complete with SQL support for pandas DataFrames — no downloading, installing, or importing required.
To start querying a pandas DataFrame using SQL, create a DataFrame as follows:
Then create a SQL block:
You can write any SQL query:
Similar to storing the results in a variable in a Jupyter Notebook, you can store the results in Deepnote as shown:
Since Deepnote uses jinjasql templating, you can pass Python variables, functions, and control structures (e.g., “if” statements and “for” loops) into your SQL queries.
Best practices for using SQL with pandas
Using SQL with pandas can make data analysis easier, but there are a few caveats.
Switching to SQL may mean missing out on different optimization techniques that could have been applied using Python. And if you’re using SQLAlchemy, remember that it creates an in-memory dataset. Therefore, you should only use it when you have sufficient memory.
Moreover, unlike pandas, which infers the data types by itself, SQL requires explicit specification when creating new tables. To make sure your data is stored and retrieved correctly, it’s important to use the appropriate data types for the columns in your DataFrame.
Lastly, keep in mind that SQL is a powerhouse for databases, not tabular DataFrames sitting in a Python environment. If you can complete your tasks using pandas, you may want to avoid leveraging SQL unnecessarily (unless it offers some major run-time benefits).
So there you have it — you’re ready to query pandas DataFrames with SQL.
As this tutorial demonstrated, using SQL with a Jupyter notebook requires multiple steps, which can be quite time-consuming. If you want to integrate SQL with an interactive Python environment minus the extra dependencies and effort, give Deepnote a try.
Combine pandas & SQL with Deepnote
Get started for free to explore, collaborate on, and share your data.
Share this post