Ultimate guide to DuckDB library in Python

By Katerina Hynkova•

Updated on August 20, 2025

DuckDB is an open-source, embeddable analytical database engine designed for high-performance SQL queries on local data.

It was created in 2018 by database researchers Mark Raasveldt and Hannes Mühleisen at CWI in Amsterdam, inspired by the need to bring database-quality performance to in-process data science workloads. Often dubbed “the SQLite of analytics,” DuckDB runs entirely within a Python process (no separate server) and uses a columnar, vectorized execution engine to achieve blazing-fast analytical query speeds. DuckDB has rapidly gained popularity in the Python ecosystem for tasks like querying large Parquet files, performing complex aggregations, and seamlessly integrating with data frames. It is actively maintained under the DuckDB Foundation and released under the permissive MIT license, with the latest stable version being DuckDB 1.3.2 as of 2025.

Originally developed as an academic project, DuckDB’s development was motivated by the observation that many data analysts were re-implementing database operations in pandas or R, leading to suboptimal performance. Mühleisen and Raasveldt sought to create a zero-dependency library that could be easily embedded in Python or R, providing SQL query power without heavy infrastructure. The result was DuckDB – a library that retains SQLite’s simplicity of integration but is purpose-built for analytical workloads (OLAP) rather than transactional processing (OLTP). Over the past few years, DuckDB has evolved quickly, adding support for advanced SQL features, a variety of file formats, and extensions for functionalities like full-text search and geospatial analysis. The name “DuckDB” reflects its lighthearted origin: the creators likened their database to a resilient duck that can thrive anywhere, even naming it after Hannes’s pet duck, Wilbur, to emphasize its adaptability.

Within the Python ecosystem, DuckDB serves as a bridge between lightweight data analysis tools (like pandas DataFrames) and heavyweight database systems. It allows Python developers to perform complex SQL operations (joins, aggregations, window functions, etc.) on local data without needing a separate database server. DuckDB can query data stored in flat files (CSV, JSON, Parquet) directly, or even query in-memory pandas DataFrames as if they were tables, making it extremely versatile for data science workflows. This positions DuckDB as a complementary tool alongside pandas, NumPy, and machine learning libraries – it’s not a replacement for those, but rather a powerful engine to offload intensive operations. Given the growing data sizes even in “small” projects, learning DuckDB is becoming important for Python developers who want to leverage SQL’s expressive power and performance on local datasets.

DuckDB’s importance also stems from its strong performance and active community support. It can handle surprisingly large datasets on a single machine by using all available CPU cores and performing out-of-core processing (spilling to disk) when data doesn’t fit in RAM. Many industry case studies have demonstrated DuckDB’s capabilities – from speeding up ETL pipelines by orders of magnitude to handling billions of records in security analytics. The project is under active development with frequent releases (it reached 1.0 in 2023 and is now at 1.3.x) and has attracted contributions from a wide community. For Python developers, DuckDB offers the rare combination of ease-of-use (a simple pip install, then use SQL strings) with state-of-the-art query processing performance, making it a compelling addition to any data toolkit.

What is DuckDB in Python?

DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system designed to execute analytical queries fast while embedded in another application. In Python, DuckDB comes as a library (duckdb module) that you import and use directly in your code – there is no separate server to set up or administer. Technically, DuckDB represents a columnar relational database engine: data is organized in columns, and it uses vectorized processing (operating on chunks of data at a time) to maximize CPU efficiency. Under the hood, DuckDB’s architecture draws from academic research in query processing. It utilizes vectorized execution (processing data in batches using CPU-friendly operations) and columnar storage (storing columns contiguously) to greatly reduce per-row overhead and exploit modern hardware caches. This approach allows DuckDB to achieve high performance similar to big analytical databases, but in a lightweight form that runs inside your Python process.

From an architectural standpoint, DuckDB is optimized for read-heavy analytical workloads. It implements a custom query optimizer and execution engine that can handle complex SQL queries (joins, aggregations, subqueries, window functions, etc.) efficiently. DuckDB’s core is written in C++ and is carefully designed to minimize dependencies – it doesn’t rely on external systems, so installing it is trivial (just a Python package) and running it doesn’t require any services. The database can be persistent (stored in a single .duckdb file on disk) or purely in-memory. DuckDB ensures ACID transactions via a bulk-optimized Multi-Version Concurrency Control (MVCC) mechanism. This means you can rely on it for correctness and crash safety: transactions either complete fully or not at all, and readers don’t block writers due to MVCC versioning. However, unlike traditional client-server databases, DuckDB is not designed for dozens of concurrent users – it’s usually used by a single process (or a few threads) at a time, focusing on maximizing throughput for that process’s analytical queries.

A key component of DuckDB’s design is its integration capability with data science tools. DuckDB can work directly with data frames and Arrow tables through a technique called replacement scans, allowing it to query pandas or Polars DataFrames as if they were database tables with zero copy overhead. It also has built-in support for reading and writing various data formats: CSV, Parquet, JSON, and even remote files on cloud storage (like S3) via extensions. The Python API provides multiple ways to interact: you can use the traditional DB-API (con.execute(...) and fetch results) or a higher-level relation API that allows chaining operations in a pandas-like manner. Internally, when you issue a SQL query, DuckDB’s engine will parse and optimize it, then execute using all available cores. It uses vectorized operators (e.g., vectorized joins, aggregates) and can transparently spill to disk if a dataset is larger than memory, thanks to its efficient out-of-core execution strategy.

DuckDB’s performance characteristics make it stand out in local Python analytics. It is highly optimized for columnar operations – for example, scanning 100 million values for a sum or doing a group-by is very fast in DuckDB, often faster than doing the equivalent in pure Python/Pandas. Benchmarks have shown that DuckDB can approach or even exceed the speed of Apache Spark or Dask for many analytics tasks up to moderately large data sizes, all within a single process. This is achieved by eliminating Python-level overhead (the heavy lifting is in C++), using multiple threads, and applying database optimizations like predicate pushdown (reading only relevant parts of files) and late materialization. DuckDB’s vectorized engine processes data in chunks (often 1024 values at a time), which strikes a balance between Python’s per-element overhead and using memory efficiently. In practical terms, this means DuckDB can handle datasets with tens or hundreds of millions of rows on a laptop, especially when data is stored efficiently (e.g., in Parquet format or in DuckDB’s own file format).

In summary, DuckDB in Python is a self-contained SQL analytics engine. It consists of a library you import, a DuckDBPyConnection object that represents a database connection (either in-memory or to a file), and supports executing SQL queries that can interact with pandas DataFrames, files, or its internal tables. You can think of DuckDB as “pandas meets SQL”: it brings the power of SQL queries (declarative, set-oriented computations, complex joins, etc.) to your local Python environment with minimal friction. Because it’s a true database system internally, you get features like a robust query planner, multiple join algorithms, and efficient parallel execution automatically – all while staying in your familiar Python workflow. DuckDB’s design emphasizes embeddability, performance, and compatibility (with SQL standards and data formats), making it a unique tool for Python users dealing with data analysis.

Why do we use the DuckDB library in Python?

One of the primary reasons to use DuckDB in Python is to solve performance bottlenecks in data analysis that pure Python or pandas struggle with. DuckDB excels at handling large datasets and complex queries efficiently, all within a single process. For example, operations that would be slow or memory-intensive in pandas (such as joining multiple large tables, computing aggregations over tens of millions of rows, or reading gigabyte-sized CSV files) can be executed much faster with DuckDB’s optimized engine. In a real-world benchmark, DuckDB was able to scan and aggregate a 33-million-row dataset in just 2.3 seconds, whereas pandas could not complete the operation due to memory constraints (and Polars took ~3.3 seconds). This dramatic speedup comes from DuckDB’s vectorized, multi-threaded execution and the use of efficient data structures on disk (columnar storage) – you get near database-level performance without leaving your Python script. In short, we use DuckDB when we need database speed on local data.

Another key benefit of DuckDB is convenience and development efficiency. Without DuckDB, Python developers often resort to suboptimal solutions for larger data: writing manual loops, chunking data, or setting up a heavyweight database or Spark cluster. DuckDB provides a middle ground – you can query data using familiar SQL right inside your Python code, which can be more concise and expressive for certain transformations. For instance, if you have a complex grouping and filtering logic, writing it as a single SQL query in DuckDB can be clearer and faster than performing multiple pandas operations. It saves development time by leveraging SQL’s declarative power (you state what you want, not how to do it) and lets DuckDB’s optimizer figure out the most efficient way. This can lead to cleaner code and fewer mistakes, especially for those comfortable with SQL. Moreover, DuckDB’s ability to directly query files and DataFrames means you don’t have to manually load data into memory, which streamlines workflows. You can point DuckDB at a Parquet file or a DataFrame and start joining or aggregating immediately, saving the hassle of intermediate conversions.

DuckDB also addresses specific problems that arise when using Python for data tasks. Memory management and larger-than-RAM data is one such issue. Pandas requires the entire dataset (or each group in a group-by) to fit in memory, which often becomes a limiting factor. DuckDB, on the other hand, can operate in a streaming fashion and spill to disk as needed, allowing you to work with datasets larger than your RAM. This means if you have a 50 GB Parquet file, you can run SQL queries on it in DuckDB without loading the entire file into memory – DuckDB will read only the columns and rows needed for the query (thanks to predicate pushdown and columnar storage). Furthermore, DuckDB’s lightweight footprint (no separate service, just a library) means you can easily integrate it into applications or notebooks without infrastructure overhead. If you’re working on a small team or project where setting up a data warehouse or Spark job is overkill, DuckDB lets you keep everything local and simple while still handling sizable data.

Development efficiency gains are not just about speed, but also about simplifying workflows. With DuckDB, you can use a single tool for multiple steps: exploratory data analysis, data cleaning, and even lightweight ETL. For example, you might use pandas for some initial small-scale data munging, then use DuckDB to join that data with a large fact table stored in Parquet, and finally feed the results into a visualization library – all in one Python script. Without DuckDB, you might resort to exporting data to CSV and loading into SQLite or writing a Spark job, which is far more cumbersome. DuckDB’s integration with the Python data ecosystem (pandas, NumPy, etc.) means you can incrementally adopt it: you don’t have to rewrite everything in SQL, but you can strategically use DuckDB for the parts that benefit the most (like heavy aggregations). This “best of both worlds” approach can significantly accelerate development. It’s also a great learning opportunity – using DuckDB can help Python developers learn SQL and database concepts which are valuable skills.

Finally, DuckDB is gaining traction in industry and open-source projects, which underscores why it’s important to learn. Companies have successfully used DuckDB to replace slower pipelines or avoid unnecessary cloud costs. For instance, FinQore (formerly SaaSWorks) cut an 8-hour data pipeline down to 8 minutes by switching to DuckDB for their aggregations, and the security team at Okta processed 7.5 trillion log records in six months using thousands of DuckDB instances, avoiding ~$2000/day in cloud warehouse expenses. These examples show that DuckDB can scale surprisingly well and can be deployed in creative ways (even serverlessly) to save time and money. Learning DuckDB enables Python developers to build such efficient pipelines and handle big data locally. In the open-source realm, tools like dbt (data build tool) now have DuckDB adapters, and data science notebooks are leveraging DuckDB for caching and accelerating queries. There is a growing community (DuckDB’s GitHub has over 32k stars) and active discussion on forums, meaning plenty of community support and resources. In summary, we use DuckDB in Python because it provides SQL power with Python ease, solving performance problems and simplifying data workflows – a potent combination for anyone working with data.

Getting started with DuckDB

Installation instructions

Installing the DuckDB library is straightforward and works on all major operating systems (Windows, macOS, Linux) since DuckDB is distributed as a Python wheel with no external dependencies. Below are various methods to install DuckDB in a local development environment:

Install via pip (Python package index)

The easiest way to get DuckDB is using pip. In a terminal or command prompt, run:

pip install duckdb

This will download and install the latest DuckDB Python package from PyPI. Make sure you have Python 3.9 or newer (DuckDB’s Python API requires Python 3.9+ as of version 1.3.2). The pip installation includes the DuckDB engine compiled in C++, so you don’t need any additional system libraries or build steps. After installation, you can verify by opening a Python shell and running import duckdb. If no errors occur, the DuckDB library is successfully installed and ready to use in your Python environment.

Troubleshooting: If pip install duckdb fails (for example, due to an outdated pip version that can’t find a pre-built wheel for your system), try upgrading pip (pip install --upgrade pip) and run the install again. In rare cases, if a wheel is not available for your platform (e.g. an uncommon architecture), pip might attempt to build from source, which could fail if a C++ compiler is not present. To avoid this, ensure you’re using a supported platform or find a precompiled wheel. Most users on Windows, Mac (Intel or M1/M2), and mainstream Linux distros should get a wheel automatically.

Install via conda (Anaconda/Miniconda)

If you prefer using Anaconda or Miniconda, DuckDB is available on conda-forge. You can install it into a conda environment with:

conda install -c conda-forge python-duckdb

This will install the DuckDB Python package from the conda-forge channel. The package name is python-duckdb on conda. Ensure your conda environment’s Python version is 3.9+ to meet DuckDB’s requirements. Using conda can be advantageous if you’re managing dependencies in an Anaconda Navigator or have other data science packages that you install via conda. Once installed, test it by activating the environment and running python -c "import duckdb; print(duckdb.__version__)" which should print the installed DuckDB version.

Troubleshooting: If conda fails to find the package, make sure you have added -c conda-forge (the package is not in the default channel). Also, ensure that your conda is updated to a fairly recent version, as older distributions might not have the DuckDB feedstock available.

Installing in visual studio code

Visual Studio Code (VS Code) uses whatever Python interpreter you have selected for your workspace. To install DuckDB in VS Code:

Open the integrated terminal in VS Code (Ctrl+` or via Terminal > New Terminal).
Ensure the correct virtual environment or interpreter is activated (you should see its name in the terminal prompt). For example, if you are using a virtualenv or conda env, activate it.
Run the pip install command: pip install duckdb.
Once installation completes, in VS Code’s Python file or interactive window, you can import the library (import duckdb) and start using it.

There isn’t a VS Code-specific package for DuckDB – it’s the same pip installation under the hood. The key is to install it in the environment that VS Code is using for Python. If you have multiple interpreters, you can also use VS Code’s Command Palette to select the interpreter, then use the built-in Python: Install Package command to search for “duckdb” and install it. After installation, VS Code should auto-complete DuckDB functions and you can run your scripts normally.

Installing in PyCharm

PyCharm IDE makes it convenient to install packages into your project’s interpreter. To install DuckDB in PyCharm:

Open your project settings/preferences in PyCharm.
Navigate to the “Python Interpreter” section.
Click the “+” button to add a package and search for “duckdb”.
Select the duckdb package from PyPI and click “Install Package”.

PyCharm will install DuckDB into the selected interpreter (virtual environment, Conda env, or system Python, depending on your project setup). After this, you can import and use duckdb in your code. Alternatively, you can open a PyCharm terminal and use pip install duckdb as well, which achieves the same result. PyCharm will reflect the installed package in the interpreter’s package list.

Troubleshooting: If PyCharm cannot find the package, ensure you have an internet connection and that PyPI is not blocked. Also verify that your project interpreter is using Python 3.9+ (if it’s using an older Python, upgrade the interpreter or create a new env, since DuckDB wheels require 3.9 or above).

Installation with Anaconda Navigator

If you prefer a GUI approach with Anaconda Navigator:

Open Anaconda Navigator and go to the Environments tab.
Select the environment you want to install DuckDB into (or create a new one with Python≥3.9).
In the search box, type “duckdb”. If it’s not immediately found, select “Channels: conda-forge” or add conda-forge to your channels.
You should see python-duckdb in the list. Mark it for installation.
Click “Apply” to install.

This installs the DuckDB library into your chosen Anaconda environment. After installation, you can launch a terminal or notebook from Navigator in that environment and use DuckDB. Note that Navigator’s package search might show “duckdb” (older feedstock) or “python-duckdb” – choosing either should pull in the correct package from conda-forge.

Installing on Windows, macOS, and Linux

DuckDB supports Windows (x64), macOS (x86_64 and Apple Silicon M1/M2), and Linux (x64, arm64) platforms. The installation via pip or conda covers all these OSes by providing pre-compiled binaries.

On Windows, using pip install duckdb in Command Prompt or PowerShell will grab the appropriate .whl file (no need for Visual C++ build tools or anything – it’s pre-built). After installation, you can use DuckDB in any IDE or even from a plain Python prompt. If you encounter a permission error on Windows, try running the terminal as Administrator or use the --user flag for pip.

On macOS, ensure you have the latest pip so it installs the wheel compatible with your architecture. DuckDB provides universal wheels for macOS, so whether you are on an Intel Mac or Apple Silicon, pip install duckdb will install a working version. On M1/M2 Macs, if you are using Conda, the conda-forge package will also handle the architecture correctly. There is no need to use Rosetta or any translation – native Apple Silicon support is included in recent DuckDB versions.

On Linux, pip will install a manylinux wheel. Most common Linux distros (Ubuntu, Debian, Fedora, etc.) are supported. If you are on an older distribution with a very old glibc (pre-2.28), you might run into compatibility issues with the newest DuckDB binaries. In such cases (e.g., CentOS7), consider upgrading your environment or using a container with a newer base image. Otherwise, you may compile DuckDB from source as a last resort. For the vast majority of Linux users, pip install duckdb just works. After installation, no additional configuration is needed.

Docker installation

DuckDB can be used within Docker containers as part of your Python application image. There isn’t an official DuckDB Docker image solely for the library, but you can easily include DuckDB in your own Dockerfile. For example:

FROM python:3.11-slim
RUN pip install duckdb
COPY your_app.py /app/your_app.py
CMD ["python", "/app/your_app.py"]

This Dockerfile starts from a Python base image and installs DuckDB via pip. Since DuckDB is self-contained, no other system dependencies are required. Build the image with docker build -t your_app_image . and run it. DuckDB will function inside the container just as it does locally. If your application needs the DuckDB CLI or other extensions, you could also include those (the CLI can be installed via pip install duckdb-cli which provides the shell). The key advantage of DuckDB in Docker is that it remains a lightweight addition—adding only ~10-15 MB to your image for the library. This makes it feasible to deploy applications that use DuckDB in cloud or serverless environments via containers.

Virtual environments

It’s a good practice to install DuckDB in a virtual environment (venv) or similar, to avoid clashing with other project dependencies. You can create a virtual environment using Python’s built-in venv module or use tools like pipenv or poetry. For example, with venv:

python3 -m venv venv
source venv/bin/activate   # (on Windows: venv\Scripts\activate)
pip install duckdb

This creates an isolated environment where DuckDB (and any other packages) are installed. In your code or IDE, make sure to use the Python interpreter from this venv. Using a virtual environment prevents system Python conflicts and makes it easier to manage specific DuckDB versions per project. DuckDB doesn’t have any hard-to-handle dependencies, so it usually plays well with other packages.

If you are using pipenv, you can do pipenv install duckdb which will add DuckDB to your Pipfile and install it in the pipenv-managed environment. Similarly, with poetry, add duckdb = "*" to your pyproject.toml or run poetry add duckdb. These tools will handle isolating the installation.

Installation in cloud environments

Even in cloud or remote environments (like a cloud VM or a hosted CI/CD pipeline), installing DuckDB is as simple as the pip command. If you’re working on a remote Linux server, just use pip or conda as above. For example, on an AWS EC2 or Azure VM with Python, run pip install duckdb and you’re set. DuckDB’s lack of external dependencies means you don’t have to install PostgreSQL or any heavy service—just the Python package.

In a generic “cloud notebook” environment or other managed platforms (not mentioning specific ones), the method remains: open a terminal or use the environment’s package manager interface to install the duckdb package. If you have restrictions (like no internet on the machine), you could download the wheel on an internet-enabled machine and transfer it, or include DuckDB in your application’s requirements so that it gets deployed together.

For AWS Lambda or other serverless functions, you can include DuckDB by deploying a Lambda Layer containing the duckdb package. DuckDB’s wheel is under 15MB, which fits in Lambda’s limits. By using a custom layer or container image for your function with DuckDB pre-installed, you can run DuckDB queries inside Lambda functions for serverless data processing. This is an advanced use-case but highlights DuckDB’s flexibility to run in constrained environments.

Troubleshooting common installation errors

“No module named duckdb”: This means Python cannot find the package. Ensure the installation succeeded (no errors in pip output) and that you’re using the same Python environment where DuckDB was installed. In notebooks or IDEs, sometimes the kernel is using a different interpreter – double-check the environment. Installing with pip show duckdb should confirm the version and location.
Pip installation errors on Linux: If you see errors about GCC or missing Python.h, it suggests pip tried (and failed) to compile from source. This happens if a suitable wheel wasn’t available for your platform. First, upgrade pip. If still no luck, verify your platform (DuckDB provides wheels for most common OS/arch combos). If on an uncommon environment, you might need to install a C++ compiler and use pip install duckdb --no-binary duckdb to force a source build, but this is rarely needed. Usually switching to Python 3.9+ or using conda can resolve this.
Conda environment not picking up DuckDB: If you installed via pip inside a conda env but your IDE still says module not found, check if the IDE is pointing to a different interpreter. Also, note that installing via conda (conda install) is preferable inside conda envs for visibility. Using pip inside conda can sometimes lead to packages not showing in Navigator UI, but they still install. Verify by running python -c "import duckdb; print(duckdb.__version__)".
Version issues: If you require a specific version of DuckDB (for example, matching a colleague’s environment or a feature only in latest dev), you can specify it: pip install duckdb==0.8.1. For pre-release or nightly builds, you might find them on DuckDB’s GitHub or a special index – check official docs if needed. The PyPI releases track stable versions.
Mac M1 issues: Older versions of DuckDB (<0.3.2) didn’t have M1 wheels, but current versions do. If you accidentally got an x86 build via Rosetta, uninstall and reinstall using pip while running a native terminal. This shouldn’t be an issue anymore with modern releases, as the installer detects the architecture correctly.

With DuckDB installed, we can proceed to using it in Python. The next section provides a hands-on example to ensure everything is working and to illustrate basic usage.

Your first DuckDB example

Let’s walk through a simple, complete example of using DuckDB in Python. This example will create an in-memory DuckDB database, define a table, insert some data, run a query, and retrieve results, covering the basic workflow. We’ll include line-by-line explanations and show the expected output.

import duckdb  # Import the DuckDB library# Use a try/except block for safe error handlingtry:
    # Connect to an in-memory DuckDB database (no file, purely in RAM)
    con = duckdb.connect(database=':memory:')  
    # Create a table named 'people' with three columns: name, age, city
    con.execute("""
        CREATE TABLE people (
            name VARCHAR,
            age INTEGER,
            city VARCHAR
        )
    """)
    # Insert some sample records into the people table
    con.execute("INSERT INTO people VALUES (?, ?, ?)", ["Alice", 30, "London"])
    con.execute("INSERT INTO people VALUES (?, ?, ?)", ["Bob", 45, "New York"])
    con.execute("INSERT INTO people VALUES (?, ?, ?)", ["Charlie", 25, "London"])
    # Run a SQL query: group people by city, count them, and calculate average age
    result = con.execute("""
        SELECT city, COUNT(*) AS num_people, AVG(age)::DOUBLE AS avg_age
        FROM people
        GROUP BY city
    """).fetchall()
    # Print the query resultsprint(result)
except Exception as e:
    print("An error occurred:", e)
finally:
    # Close the connection to free resources
    con.close()

Line-by-line explanation:

We first import duckdb to bring the DuckDB library into our script. If DuckDB is installed correctly, this import will succeed with no output (if not, you’ll get a ModuleNotFoundError).
We use a try/except/finally structure. This isn’t strictly required for normal use, but it’s good practice to catch exceptions (for example, a SQL error) and to ensure the database connection is closed in the finally block even if an error occurs. DuckDB doesn’t require an explicit close to work, but closing is recommended when you’re done.
con = duckdb.connect(database=':memory:'): This creates a new DuckDB connection. We pass ':memory:' to indicate we want an in-memory database (no file on disk). If you wanted a persistent database, you could provide a filename like 'mydb.duckdb' here. The connection object con is our interface to execute SQL commands.
con.execute(""" CREATE TABLE people (...) """): We create a table named people with columns for name, age, and city. The SQL is standard – DuckDB supports VARCHAR for string, INTEGER for int. The execute method runs the DDL statement. DuckDB executes immediately; since this is just a schema change, there’s no result to fetch.
Next, we insert some data. We call con.execute("INSERT INTO people VALUES (?, ?, ?)", [...]) three times with different values. The ? placeholders and the second argument (a Python list) demonstrate parameter binding – this is how you safely pass Python variables into SQL, preventing SQL injection and handling types. We insert three rows: Alice (30, London), Bob (45, New York), Charlie (25, London). Each INSERT is executed immediately. DuckDB, by default, commits each transaction (you can group inserts in a single transaction or use BEGIN/COMMIT if needed, but not required here).
Then we execute a query to get some aggregate insights: we select city, count, and average age from the people table grouping by city. In the SQL, COUNT(*) counts rows per city, and AVG(age) computes the average age. We cast AVG(age) to DOUBLE (DuckDB might return a HUGEINT for count and a DECIMAL for avg by default; casting to DOUBLE for avg gives a float which is easier to interpret). We alias the outputs as num_people and avg_age respectively.
We chain the .fetchall() at the end of execute(...).fetchall() to retrieve all results as a list of tuples in Python. DuckDB also offers fetchone(), fetchdf() (to get a pandas DataFrame) and fetchnumpy() (to get NumPy arrays), but fetchall() is convenient for demonstration.
The result is stored in result and then we print it.

Expected output:

When you run this script, it should print the query result. The order of cities in the output may not be guaranteed without an ORDER BY, but typically it will list London and New York (since those are the two distinct cities). You might see something like:

[('London', 2, 27.5), ('New York', 1, 45.0)]

This output is a Python list of tuples. Each tuple corresponds to a group (city). In this run, “London” has 2 people with an average age of 27.5, and “New York” has 1 person with average age 45.0. The data matches our input: London had Alice (30) and Charlie (25), average 27.5, and New York had Bob (45).

Line-by-line notes: The code above demonstrates basic DuckDB usage – creating a table, inserting data, and querying it. A common beginner mistake is forgetting to fetch results. In DuckDB’s Python API, you must call fetchall() (or similar) to retrieve results of a SELECT query. If you just call con.execute("SELECT ...") without fetch, DuckDB executes the query but the results stay in the connection buffer. Another mistake could be using Python string formatting to inject values into SQL (which can lead to errors or security issues); instead, always use parameter substitution with ? as shown.

We also explicitly closed the connection with con.close(). In short scripts, not closing isn’t catastrophic (the connection will close when the object is destroyed or the program exits), but it’s good practice, especially if you open many connections or are writing a long-running application.

Now that you have DuckDB up and running with a simple example, you can begin exploring its features. The following sections will dive into core features of DuckDB, from integrating with DataFrames to using advanced SQL capabilities, and how to optimize and apply DuckDB in real-world scenarios.

Core features of DuckDB

DuckDB comes with a rich set of features that make it a powerful tool for data analysis. In this section, we’ll explore several core features in depth. Each feature section will explain what the feature does, why it’s important, show example usage (with syntax and parameters), discuss performance considerations, integration points, and common pitfalls.

Feature 1: querying DataFrames and integration with Python

What it is: One of DuckDB’s standout features is its seamless integration with Python data structures, especially pandas DataFrames (and NumPy arrays, Polars DataFrames, etc.). DuckDB can query pandas DataFrames directly using SQL – you don’t need to load the DataFrame into DuckDB manually. This is enabled by DuckDB’s replacement scan mechanism, which can treat any pandas DataFrame in the Python environment as a virtual tableduckdb.org medium.com. Conversely, DuckDB can return query results as a pandas DataFrame or a Polars DataFrame with a single call. This two-way integration is important because it allows DuckDB to slot into existing Python workflows: you might have data in a DataFrame already (from web scraping or an API), and you can run a complex SQL query on it without hassle. Or you might use DuckDB to crunch a large dataset and then get the results back into pandas for further manipulation or visualization. Essentially, DuckDB acts as an engine within your Python process, speaking the language of both SQL and DataFrame.

Why it’s important: This integration eliminates the typical friction between using a database and using in-memory data. Without DuckDB, to run SQL on a DataFrame, one might have to export it to CSV and read into SQLite or use an intermediate step. With DuckDB, you can mix and match – for example, join a huge Parquet file (accessed via DuckDB) with a smaller pandas DataFrame (in-memory) all in one SQL query. This opens up possibilities like using SQL for parts of your analysis (which can be more efficient for joins/group-bys) and Python code for others, with minimal overhead moving data around. The integration is zero-copy in many cases: DuckDB will leverage Arrow memory format to avoid copying data when converting between DataFrame and DuckDB’s internal representationmotherduck.com. Performance-wise, this means you can offload heavy computations to DuckDB and get results back quickly, without the slow serialization step that usually happens between databases and applications.

Syntax and usage: DuckDB’s Python API provides multiple ways to interact with DataFrames:

You can run queries using duckdb.query() or the connection’s sql() method by directly referencing the DataFrame name in the SQL. For example, if you have a pandas DataFrame df, you can do:
```
import duckdb
con = duckdb.connect()
import pandas as pd
df = pd.DataFrame({'x': [1,2,3], 'y': [10,20,30]})
result_df = con.execute("SELECT x, y*2 AS y2 FROM df WHERE x > 1").df()
```
Here "FROM df" in the SQL will automatically find the DataFrame df in the Python scope. DuckDB scans it as if it were a table. The result is fetched as a pandas DataFrame using .df() for convenience.
The DuckDBPyConnection object also offers con.register('name', dataframe) to explicitly register a DataFrame under a given name. This isn’t strictly required thanks to automatic discovery, but can be useful for clarity or if you want a different table name. For example, con.register('mytable', df) then use FROM mytable in SQL.
Retrieving results: we saw .df() returns a pandas DataFrame. Similarly, .fetchdf() does the same. If you want a Polars DataFrame, DuckDB provides a .pl() method on the result relation, and for NumPy, .fetchnumpy() returns a dict of arrays.
There’s also a relation API: instead of writing SQL strings, you can use con.from_df(df) to get a DuckDB relation and then apply .project(), .filter(), etc., which is more programmatic. For example, rel = con.from_df(df).filter("x > 1").project("x, y*2 as y2"). This API is pythonic and chainable, and under the hood, it builds the same query.

Let’s illustrate with examples:

Example 1: querying a pandas DataFrame with SQL

import pandas as pd
import duckdb

# Sample pandas DataFrame
data = {'name': ['Alice', 'Bob', 'Charlie', 'Dave'], 
        'age': [25, 32, 18, 47],
        'city': ['London', 'New York', 'London', 'Paris']}
df = pd.DataFrame(data)

con = duckdb.connect()
# Query the DataFrame directly
query = "SELECT city, COUNT(*) AS count, AVG(age) AS avg_age FROM df GROUP BY city"
result_df = con.execute(query).df()
print(result_df)

Here, df is a pandas DataFrame in our Python code. We connect to DuckDB and run a SQL query grouping by city. DuckDB automatically recognizes df and reads from it. The result (which might look like a small table of cities with count and avg_age) is returned as a pandas DataFrame in result_df. We could just as easily do .fetchall() or .fetch_arrow() if we wanted a PyArrow Table. The AVG(age) here will come out as a Decimal; pandas might show it as Python decimal.Decimal objects. We could cast in SQL to get float if needed. Common error to avoid: if your DataFrame name is not a valid SQL identifier or overlaps with a keyword, you might need to alias it (but usually, DuckDB handles simple names fine). Also, note DuckDB will by default scan all rows of the DataFrame – if it’s huge and you only want a subset, add a WHERE clause so that DuckDB doesn’t pull everything.

Example 2: Converting DuckDB query results to a DataFrame

You might use DuckDB to combine data from multiple sources and then get a pandas DataFrame for subsequent steps:

# Assume we have a DuckDB table or a query joining multiple sources
con.execute("CREATE TABLE t (id INT, value TEXT)")
con.execute("INSERT INTO t VALUES (1, 'A'), (2, 'B')")
# Get a DataFrame directly from a query
df_out = con.execute("SELECT * FROM t").df()
print(type(df_out), df_out)

This will output <class 'pandas.core.frame.DataFrame'> id value ... – verifying that df_out is indeed a pandas DataFrame containing the query results. The ability to directly get a DataFrame (.df()) means you don’t have to manually construct the DataFrame from tuples, which saves time and reduces error. Under the hood, DuckDB leverages Arrow to efficiently transfer the data to pandas (especially if Pandas uses the Arrow backend for DataFrames, it’s very fast).

Example 3: using Polars with DuckDB (optional advanced)

If you use Polars (a Rust-based DataFrame library), integration is similarly smooth. DuckDB can output Polars DataFrames:

import polars as pl
# Use DuckDB to produce a Polars DataFrame
polars_df = con.execute("SELECT id, value FROM t").pl()  # Get result as Polars DataFrameprint(isinstance(polars_df, pl.DataFrame))  # True

And Polars can hand data to DuckDB via Arrow interchange. This shows DuckDB’s integration is not limited to pandas but extends to the data ecosystem through Arrow memory format.

Performance considerations: Querying a DataFrame through DuckDB adds minimal overhead. DuckDB will internally use zero-copy or efficient copying via Arrow for numeric types and plain data. If your DataFrame has object dtypes (e.g., Python strings or mixed types), DuckDB will have to convert those to its own types which can add some overhead – consider converting columns to pandas native types or Arrow types if performance is critical. In general, for large data, it’s often faster to let DuckDB read from a file (like Parquet) than to first load it into pandas and then have DuckDB query it. So, use DataFrame querying when the data is already in DataFrame form from other parts of your code. If you have a 10 million row CSV, it may be better to let DuckDB read it directly (to avoid double handling). Another point: DuckDB’s query planner doesn’t know about indexes on your pandas DataFrame (pandas doesn’t really have indexes in the database sense), so it will always do a full scan of the DataFrame unless you restrict it with a WHERE clause. That’s usually fine given it’s in-memory, but something to be aware of.

Integration examples: This feature shines in integration scenarios. For instance, imagine you’ve done some preprocessing in pandas for a small dataset, and you have a large dataset in a CSV. You can load the CSV through DuckDB and join with the pandas DataFrame in one go:

con.execute("SELECT * FROM 'large.csv' AS big JOIN df AS small ON big.key = small.key").df()

Here 'large.csv' is read by DuckDB (no pandas needed for it) and df is your in-memory data; the join happens in DuckDB’s engine. The result comes out as a DataFrame. This can be much faster than doing the join in pandas (especially if large.csv has millions of rows). Another integration example is using DuckDB within a scikit-learn pipeline: you could use DuckDB to aggregate or filter data, then feed the result DataFrame to scikit-learn for modeling. This way you handle big data reduction with DuckDB and the machine learning with sklearn – all in memory without intermediate files.

Common errors & solutions: A common confusion is when your DataFrame variable is not accessible by DuckDB. DuckDB looks in the global Python scope by name. If you’re using it in a function and the DataFrame is a local variable, DuckDB might not see it. The solution is to either register the DataFrame explicitly or ensure you pass the correct locals: con.execute(query, df=df) can sometimes be used (DuckDB’s execute can take parameters for DataFrames in newer versions). Another error scenario: if your DataFrame column names have spaces or special characters, the SQL query needs to quote them. E.g., a column named "user id" would require SELECT "user id" FROM df. Best practice is to stick to alphanumeric column names or rename them before querying. If you get an error like “No module named pyarrow” when using .arrow() or certain fetch methods, it means you need to install PyArrow. DuckDB’s core doesn’t require Arrow, but some result fetches use it. Simply pip install pyarrow to resolve that. Lastly, remember that while DataFrame integration is convenient, extremely large DataFrames (hundreds of millions of rows) might be better handled by writing to a parquet and reading via DuckDB or using DuckDB’s internal tables – because keeping that much data in a pandas DataFrame in memory is itself a challenge (DuckDB can handle more data than pandas can, by streaming).

Feature 2: querying data files directly (CSV/Parquet/JSON and more)

What it is: DuckDB has the ability to directly read and write a variety of data file formats as if they were database tables. You can run SQL queries on CSV files, Parquet files, JSON files, etc., without first loading them into a table. This feature is often called external data query and is extremely powerful for data engineering tasks. For example, you can do SELECT * FROM 'data.csv' WHERE ... in DuckDB, and it will treat the CSV file’s contents as table rows. Similarly, SELECT * FROM 'file.parquet' will read from a Parquet file. DuckDB supports auto-detecting CSV formats (via the read_csv_auto function or just by using the file directly) and can even query compressed files (.gz). For Parquet, DuckDB leverages the format’s structure to only read needed columns and row groups (predicate pushdown). In addition, DuckDB can query remote files over HTTP or cloud storage if the appropriate extension (like httpfs for S3/HTTP access) is enabled. Essentially, DuckDB can act as a query engine over raw data files, not just its internal tables.

Why it's important: This feature eliminates an entire class of tedious tasks – you don’t have to write separate code to parse files or load data into memory before querying. If you have a 10 GB Parquet file, you can immediately perform SQL operations on it without using pandas (which might choke on 10 GB) or setting up a database. It brings the convenience of SQL (filtering, joining, aggregating) to files on disk. This can lead to huge time and space savings. For instance, instead of reading a huge CSV into pandas (which could use a lot of memory) and then filtering it, you can let DuckDB stream through the CSV, filtering as it goes, using far less memory in the process. Performance-wise, DuckDB is highly optimized for these operations – reading a Parquet with DuckDB is often as fast as any specialized tool, and often faster than using pyarrow or pandas directly because DuckDB can use multiple cores and apply SQL pushdown. For CSV, DuckDB’s read_csv_auto is quite fast and frees you from specifying schema by automatically inferring column types. The direct file querying also encourages a workflow of keeping data in efficient formats (like Parquet) and using DuckDB to query them, rather than converting everything to intermediate DataFrames.

Syntax and usage:

Querying CSV: There are two primary ways. One is to use the table inference by quoting a file path in the FROM clause:
```
SELECT * FROM 'path/to/data.csv';
```
DuckDB will automatically attempt to parse the CSV. It assumes the first row as header by default and infers column types (this is equivalent to calling the function read_csv_auto('path/to/data.csv')). If you need to specify options (like delimiter, whether there’s a header, date formats, etc.), you can use the read_csv_auto or read_csv table-producing function with parameters. For example:
```
SELECT *FROM read_csv_auto('data_no_header.csv', header=False, columns={'col1': 'INTEGER', 'col2': 'VARCHAR'});
```
That explicitly sets column types. In Python, you can call these via con.execute("SELECT ...") as usual.
Querying Parquet: Even simpler, you can do:
```
SELECT * FROM 'data.parquet' WHERE columnA > 100;
```
DuckDB treats the string ending in .parquet as an external table. It will read only the necessary parts of the Parquet file (for example, only columnA and whatever rows match the filter, thanks to Parquet’s metadata). Parquet files have schema, so DuckDB knows column names and types without needing hints.
Querying multiple files: DuckDB supports globs and union of files. For example, SELECT * FROM 'data/2023-*.parquet' will read all Parquet files matching that pattern as if one table (the schema must match). Similarly for CSV: SELECT COUNT(*) FROM 'logs/*.csv' to aggregate across multiple files. This is incredibly useful for partitioned data or time-series logs spread across files.
Writing to files: You can export query results to files directly using COPY or DuckDB’s functions. For example:
```
COPY (SELECT city, avg(age) FROM people GROUP BY city) TO 'result.csv' (HEADER, DELIMITER ',');
```
This will write a CSV. Or use COPY ... TO 'result.parquet' (FORMAT PARQUET); to write Parquet. Another approach is SELECT * FROM people INTO 'output.parquet'; which as of DuckDB 0.8+ writes a Parquet file.
JSON: DuckDB can read JSON files via the JSON extension (or using functions read_json/json_auto if extension is loaded). It can also write JSON. Since JSON often has complex structures, you might need to use DuckDB’s JSON functions to expand them.
Remote files: If you want to query a remote file, you need to install and load DuckDB’s httpfs extension, then you can do:
```
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM 'https://example.com/data.parquet';
```
Or S3: 's3://my-bucket/path/file.parquet' after setting up credentials via SET s3_access_key... parameters or environment variables. This makes DuckDB a bit of a lightweight query engine for data lakes.

Examples:

Example 1: query a CSV file directly

Assume you have a CSV file employees.csv:

name,department,salary
Alice,Engineering,80000
Bob,HR,60000
Charlie,Engineering,75000

You can query it:

con = duckdb.connect()
# Query the CSV file to find average salary by department
query = """
    SELECT department, COUNT(*) AS num_employees, AVG(salary) AS avg_salary
    FROM 'employees.csv'
    GROUP BY department
"""
df_result = con.execute(query).df()
print(df_result)

DuckDB will open employees.csv, infer that salary is an integer (or big numeric), and perform the aggregation. The result DataFrame might look like:

  department  num_employees  avg_salary
0  Engineering              2     77500.01           HR              1     60000.0

It handled reading and parsing the CSV behind the scenes. Note: if the CSV had no header, DuckDB would name columns like column0, column1,... by default. We could explicitly specify columns using SELECT * FROM read_csv_auto('file', columns={...}) if needed.

Example 2: query a Parquet file

Parquet files are even easier since they’re self-describing:

# Suppose "data.parquet" contains a big table of sales with columns: product, price, quantity
result = con.execute("SELECT product, SUM(price*quantity) as total_revenue FROM 'data.parquet' GROUP BY product").fetchall()
print(result[:5])

This will read only the necessary columns from data.parquet (product, price, quantity) and compute total revenue per product. DuckDB’s Parquet reader is highly optimized in C++ and can use multiple threads, so it will likely outperform many other methods of analyzing a Parquet in Python. The first 5 results printed might look like:

[('ProductA', 1234567.89), ('ProductB', 234567.00), ...]

with each tuple being a product and the sum.

A great thing about this is if data.parquet is huge (say 10GB), you didn’t have to load it fully – DuckDB streamed through it, perhaps reading one Parquet row group at a time. If you had a WHERE clause on product, DuckDB would use Parquet statistics to skip row groups that don’t contain that product.

Example 3: joining a CSV and Parquet

DuckDB can join across file types:

query = """
SELECT o.order_id, o.date, c.customer_name, o.total
FROM 'orders.parquet' AS o
JOIN 'customers.csv' AS c
ON o.customer_id = c.id
WHERE o.date >= '2025-01-01'
"""
join_df = con.execute(query).df()

In this query, 'orders.parquet' is a Parquet dataset (perhaps partitioned by date) and 'customers.csv' is a small CSV of customer info. DuckDB will read customers.csv (likely entirely, since it’s small) and then stream through relevant parts of orders.parquet, performing the join. The result is collected as a DataFrame join_df. Doing this without DuckDB would require manual reading and merging, which is more code and likely slower.

Performance considerations: When querying files, file format matters. Parquet is a columnar, compressed format, and DuckDB can leverage it heavily – only reading needed columns and skipping chunks using metadata. CSV is text, so it’s inherently slower to parse (DuckDB will use vectorized parsers but still). If performance is critical, prefer binary formats like Parquet or Arrow. You can even use DuckDB to convert CSV to Parquet (one-time cost) and then query Parquet repeatedly much faster. In one test, reading a large dataset from Parquet in DuckDB can be an order of magnitude faster than reading the same data from CSV, and uses less CPU. Also, DuckDB caching: DuckDB will cache file metadata and maybe some blocks in memory (and OS filesystem cache will help too). In version 1.3.0, an external file cache was introduced, meaning repeated queries on the same remote file can be faster after the first time because DuckDB caches portions of it in memory.

Another consideration is parallelism: DuckDB will by default use multiple threads to read files (especially Parquet, where it can read multiple row groups in parallel). For CSV, it can parallelize by chunking the file. If you want to tune this, you can set PRAGMA threads=N to control number of threads. If you run DuckDB in a jupyter notebook or interactive environment, you might see it max out your CPU cores when scanning a big file – that’s expected for speed.

Integration examples: Querying files directly is extremely useful in data engineering workflows. For instance, if you have data on S3, you can use DuckDB in a local or cloud environment to query it rather than bringing all data down. A concrete example: a data analyst might have a bunch of JSON logs in S3 – they could load DuckDB’s httpfs and JSON extensions and do:

SELECT user_id, COUNT(*) 
FROM read_json_auto('s3://mybucket/logs/2025-08/*.json') 
GROUP BY user_id;

to get usage counts per user across all those JSON logs. That’s one SQL query replacing what could be many lines of Python looping and parsing. Or, in a local analytics scenario, you might have daily CSV exports – with DuckDB, you can treat the folder of CSVs as one table and run a query spanning all days easily.

Common errors & solutions: When querying CSV, a common issue is type inference. DuckDB’s read_csv_auto might guess a column as INT but actually there’s some non-numeric entries or very large numbers. If it fails to parse or you get an error like “Type mismatch” while reading, you might need to manually specify a column’s type or set sample_size to a larger number (DuckDB by default samples some rows to guess types). You can do read_csv_auto('file.csv', sample_size=-1) to scan entire file for inference, or explicitly give a columns schema as shown earlier. Another issue: file path correctness. If your path or glob is wrong, you’ll get a runtime error like “File not found”. Always ensure the path string is correct. Remember to use forward slashes or proper escaping on Windows paths (or raw strings in Python).

If querying remote files, you must INSTALL ...; LOAD ...; the correct extension. For S3, it’s also necessary to configure credentials (via SET s3_access_key_id='...', s3_secret_access_key='...' or use AWS env vars). If you forget to load httpfs and try to query an http:// URL, you’ll get an error about unrecognized file or needing extension – the fix is to load the extension. For JSON, you need the JSON extension (as of DuckDB 0.7+, JSON is an extension). If you see “No function matches read_json_auto” that means you haven’t loaded the JSON extension.

Lastly, when writing out, ensure the target directory is writable. The COPY TO or 'file.parquet' output will overwrite existing files by default. If you attempt to write to a protected location, you’ll get a permission error. Choose an output path accordingly.

Feature 3: advanced SQL features (Joins, Window functions, and more)

What it is: DuckDB supports a rich SQL dialect that includes advanced features you might expect from a full-fledged analytic database. This includes a variety of JOIN types (inner, left, right, full outer, as well as cross joins), window functions (also known as analytic functions), common table expressions (CTEs, using WITH), subqueries, set operations (UNION, EXCEPT, INTERSECT), and more. In terms of SQL functionality, DuckDB is on par with many server databases and often ahead of simpler embedded databases like SQLite when it comes to analytical functions. For example, DuckDB supports window functions like ROW_NUMBER(), RANK(), moving averages, lead/lag, etc., which allow complex analytics (cumulative sums, comparisons across rows) to be done in SQL. It also has support for aggregations, CASE expressions, string functions, date/time functions, and even some more niche features like regular expression matching and vectorized operations on lists. Essentially, if you can express a computation in SQL, DuckDB likely can execute it – making it a one-stop solution for writing complex data transformations.

Why it's important: These features allow you to perform complex data analysis tasks directly in DuckDB without needing to pull data into Python for post-processing. For example, suppose you want to calculate a running total or moving average – you can use a window function in DuckDB’s SQL rather than writing a Python loop or using pandas. This not only simplifies the code (SQL is succinct for such operations) but also leverages DuckDB’s optimized engine to do it faster. Joins are fundamental, and DuckDB’s join performance is very good for large datasets (it uses hash join or sort-merge join algorithms under the hood). Having full outer join or right join support is useful when combining datasets without losing any information (something not all small engines support easily). Another scenario: you might have a complex multi-step analysis – using CTEs you can structure the SQL into parts, and DuckDB will optimize it as a whole. This can lead to clearer and faster analysis compared to doing each step in Python and storing intermediate DataFrames. DuckDB’s advanced SQL essentially means you can push more of the heavy lifting into the database layer, which is usually beneficial for performance and scalability.

Syntax and usage:

Joins: The syntax follows standard SQL. For example:
```
SELECT *FROM table1 AS t1
JOIN table2 AS t2
  ON t1.id = t2.id;
```
This is an inner join. Left join: LEFT JOIN table2 ... keeps all t1 rows. Full join: FULL OUTER JOIN. DuckDB also supports using USING clause (e.g., JOIN table2 USING (id) if both tables have column named id). You can join more than two tables, of course.
DuckDB supports as-of joins (via a dedicated function or by range join logic) for time-series, and also semi and anti join patterns (like using EXISTS or NOT EXISTS subqueries).
Window functions: These are used with an OVER() clause. For example:
```
SELECT 
  sales.*, 
  SUM(amount) OVER (PARTITION BY region ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM sales;
```
This computes a running total of sales amount per region ordered by date. DuckDB supports common window frame clauses (ROWS or RANGE) and functions like ROW_NUMBER() OVER (ORDER BY col), RANK(), DENSE_RANK(), LAG()/LEAD(value, offset), etc.
Simpler example: AVG(value) OVER (PARTITION BY category) would give the average by category on each row. These are very handy for things like moving averages, percentiles (PERCENT_RANK or using window frame with percentile function), etc.
Common table expressions (CTEs):
```
WITH recent_sales AS (
    SELECT * FROM sales WHERE date > '2025-01-01'
)
SELECT recent_sales.region, COUNT(*) 
FROM recent_sales 
JOIN regions on recent_sales.region_id = regions.id
GROUP BY recent_sales.region;
```
The WITH clause defines a subquery (recent_sales) that can be used in the main query. This can improve readability, and DuckDB will optimize it like a subquery (it may or may not materialize it depending on complexity).
Subqueries: DuckDB allows subqueries in SELECT, FROM, WHERE, etc. For instance, a scalar subquery:
```
SELECT name, salary,
  (SELECT AVG(salary) FROM employees) AS avg_salary
FROM employees;
```
Or an EXISTS subquery in a WHERE clause for filtering. It handles correlated subqueries as well.
Set operations: Use UNION (or UNION ALL to include duplicates), INTERSECT, EXCEPT to combine query results. E.g., SELECT id FROM A UNION SELECT id FROM B.
Other features: DuckDB supports aggregate functions (SUM, AVG, MIN, MAX, etc.), distinct counts (COUNT(DISTINCT col)), HAVING clauses after GROUP BY, etc. It also has more advanced SQL features like JSON functions (if JSON extension loaded, e.g. json_extract), struct and list types for nested data, map types, and user-defined functions (UDFs) in both SQL (via CREATE MACRO or CREATE FUNCTION for simple SQL-defined functions) and Python (you can register a Python function as UDF).

Examples:

Example 1: using a window function for running total

Suppose we have a DuckDB table orders(order_id, order_date, amount, customer_id). We want to calculate the cumulative sales over time (running total) and rank orders by amount within each customer.

query = """
SELECT 
  order_id,
  order_date,
  amount,
  SUM(amount) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS total_sales_to_date,
  RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS amount_rank_for_customer
FROM orders
"""
df = con.execute(query).df()
print(df.head())

Here:

SUM(amount) OVER (ORDER BY order_date ...) computes a running sum of the amount column across the whole dataset ordered by date (no partition, so it's global). This adds a column total_sales_to_date that for each order gives the sum of all prior order amounts up to that date.
RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) ranks each order by amount within its customer group (1 = largest order for that customer, etc.). If two orders have the same amount, they get the same rank and rank numbers can skip (that’s how RANK works; there’s also DENSE_RANK which would not skip).
DuckDB handles these window calculations efficiently. The output DataFrame will have those additional columns. This would be quite involved to calculate manually in pandas (requiring sort and groupby with cumulative sum, etc.), but in DuckDB it's one query.

Example 2: complex join and aggregation

Imagine we have sales(region, product, qty, price) and region_info(region, manager). We want to get total revenue per region and list the manager, but only for regions that have total revenue above a threshold.

query = """
WITH revenue_by_region AS (
    SELECT region, SUM(qty * price) AS total_revenue
    FROM sales
    GROUP BY region
)
SELECT r.region, r.total_revenue, info.manager
FROM revenue_by_region AS r
JOIN region_info AS info ON r.region = info.region
WHERE r.total_revenue > 1000000
ORDER BY r.total_revenue DESC;
"""
result = con.execute(query).fetchall()

This uses a CTE revenue_by_region to compute total revenue by region. Then it joins with region_info to get the manager name for each region. We apply a WHERE filter to only keep high-revenue regions, and sort by revenue. The CTE here is mainly for clarity; we could have done it in a subquery or even directly in a join with an aggregation subquery. DuckDB’s optimizer will likely fuse the CTE with the main query (it doesn’t actually materialize it unless necessary). The result (a list of tuples or DataFrame if we used .df()) might look like [('West', 2345678.00, 'Alice'), ('East', 1987654.00, 'Bob'), ...].

Example 3: subquery and conditional logic

DuckDB supports CASE expressions and subqueries, so:

query = """
SELECT customer_id,
       CASE 
         WHEN total > (SELECT AVG(total) FROM orders) THEN 'Above Average' 
         ELSE 'Below Average' 
       END AS relative_order_size
FROM orders;
"""
con.execute(query).df()

This query labels each order as 'Above Average' or 'Below Average' compared to the overall average order total. The subquery (SELECT AVG(total) FROM orders) yields a single value (the average total). DuckDB will compute that once and then compare each order’s total to it. The CASE expression is straightforward SQL logic. The result is a DataFrame with customer_id and a classification.

Performance considerations: DuckDB’s advanced SQL features are optimized in the engine. Joins are executed using efficient algorithms and DuckDB will use available indexes (min/max index on columns exists by default for Parquet and persistent tables) and statistics to optimize join order. One thing to note: if you join a very large table with a very small table, DuckDB’s cost-based optimizer should do the right thing (likely building a hash table on the smaller side). If both sides are huge, it may use a multi-threaded hash join or sort-merge join if needed. The user generally doesn’t have to hint anything; just ensure your join keys are correctly typed (joining INT to INT, not INT to VARCHAR, etc., to avoid type conversion slowdown).

Window functions in DuckDB are implemented in a vectorized way, but they do require sorting or partitioning data as needed. So they are memory-intensive if you use a large window (like partition by a high cardinality field). For very large partitions, the window function will have to buffer that partition’s data. This is typical of any DB. DuckDB’s performance for window functions is very good for moderate data sizes. If you find a window query slow, consider if you can pre-filter data or if the window frame can be limited.

CTEs and subqueries: DuckDB’s optimizer tries to inline CTEs (unless you use MATERIALIZED hint) and flatten subqueries where possible. This means you typically get performance similar to writing one big query. However, very complex queries with many subqueries could sometimes be harder to optimize. If performance is an issue, sometimes rewriting a subquery as a join or using an index (for persistent tables, you can create an index on a column to speed up certain lookups) can help. DuckDB recently added support for indexes (single-column indexes using the ART index), but they are most beneficial for selective point lookups. Analytical queries usually just scan or use joins.

Memory-wise, operations like joins and aggregations will use memory proportional to data sizes (e.g., a hash join will allocate a hash table roughly the size of one table, a sort for window functions will allocate memory for sorting keys, etc.). DuckDB can spill to disk if needed for huge operations, but try to ensure you have enough memory or partition tasks if not.

Integration examples: Advanced SQL features can replace a lot of custom Python logic. For instance, if you are using pandas to do multi-step groupbys and then adding a rank column manually, you can instead do it in one DuckDB SQL query. This not only can be faster, but ensures you leverage a single engine for consistency. Another integration example: If you are using a tool like dbt (data build tool) for transformations, the same SQL can run on DuckDB (with the DuckDB adapter). So, you could prototype a data transformation pipeline in DuckDB locally using advanced SQL, then run it in production via dbt + DuckDB or even port it to another warehouse – the SQL skills and logic transfer. In a Python script, using advanced SQL can simplify your code. For example, rather than doing multiple merges and sorts in pandas, a single SQL can join and sort data, which you then consume or visualize.

Common errors & solutions: With complex SQL, a common error is syntax mistakes or using a function that doesn’t exist. DuckDB’s error messages will indicate if, say, a function name is not recognized. For example, if you try to use NTILE (not sure if DuckDB supports NTILE as of current version), and it’s not implemented, you’d get an error. Always check DuckDB’s documentation if a certain SQL function exists. The list of supported window functions is extensive but not infinite.

Another possible “error” scenario is getting results that don’t match expectations due to not partitioning or ordering correctly in a window function. For instance, forgetting PARTITION BY when you meant to compute something per group will yield a global result. So be careful to specify the correct PARTITION in the OVER clause.

For joins, a classic mistake is not disambiguating columns. If both tables have a column named id and you do SELECT *, DuckDB will throw an error about duplicate column name in result. It’s usually best practice to list columns or qualify them (e.g., SELECT t1.id AS id1, t2.id AS id2, ...). Or use USING which automatically drops one of the duplicate columns.

One more thing: if you perform a FULL OUTER JOIN, and then filter in a WHERE clause on a column that might be null from one side (because in full join unmatched get nulls), you might accidentally drop those. The correct approach is to put such conditions in the JOIN clause or use OR to include null logic. For example:

SELECT * FROM A FULL JOIN B ON A.key = B.key 
WHERE A.country = 'US';

This will drop rows where A is null (i.e., where only B had a row) because the filter requires A.country = 'US' which is false for null A. If the intention was to include those too (maybe treating null as not US), you might need to adjust the logic or do the filtering before/within join appropriately.

Feature 4: extensibility and extensions (plugins for DuckDB)

What it is: DuckDB is designed to be extensible – it supports a variety of extensions or "plug-ins" that add features beyond the core database engine. These extensions can provide new data types, functions, or integration capabilities. Some notable DuckDB extensions include:

HTTPFS: allows DuckDB to read/write files over HTTP and cloud storage protocols (S3, Azure, GCS). We mentioned this earlier; it enables queries on remote data.
JSON: adds JSON data type and functions, so you can parse and query JSON documents within SQL.
Parquet/arrow: built-in by default to read Parquet and Arrow formats.
FTS (full text search): provides functions for text search (e.g., tokenize text, search for terms, etc.).
Spatial (Spatial extension): adds support for geospatial data types (like POINT, POLYGON) and functions (similar to PostGIS but limited).
Vector similarity (vss): an extension for vector similarity search (useful in AI/ML for embedding nearest neighbor search).
MYSQL, POSTGRES scanners: experimental extensions to directly query external MySQL/Postgres databases from DuckDB.
MotherDuck specific: (not in core DuckDB but in cloud, skip that).

The mechanism to use an extension is to first INSTALL it (which downloads it if not present) and then LOAD it (which activates it for use). For example:

INSTALL 'json';
LOAD 'json';

After this, you can use JSON functions and types in your queries.

Why it's important: Extensions greatly expand what you can do with DuckDB without waiting for everything to be in the core. If you need to analyze nested JSON data, you load the JSON extension rather than writing complicated user-defined functions. If you need to work with spatial data (e.g., points and polygons), the spatial extension saves you from exporting data to another tool. This makes DuckDB a more complete solution for various specialized tasks. It also means DuckDB can integrate into more scenarios: the HTTPFS extension allows connecting to data lake storage; the Postgres scanner means you could join a local DuckDB table with a live Postgres table in one query (though that’s still a bit experimental). For a Python user, using these extensions is relatively easy and it means you can leverage a lot of existing functionality.

Syntax and usage:

Installing and Loading: You typically run con.execute("INSTALL 'extname'") and then con.execute("LOAD 'extname'"). The first time, DuckDB will download a precompiled binary for that extension (if available for your OS) and cache it. Subsequent loads don’t need internet. You only need to INSTALL once per machine environment; LOAD has to be done in each new connection where you want the features available.
After loading, new SQL functions/types are available. For example, after LOAD 'json', you can do:
```
SELECT json_extract('{"a": 10, "b": 20}', '$.a');
```
to extract a value from a JSON string, or you can create a table with a JSON column.
Spatial example: After INSTALL spatial; LOAD spatial;, you can do:
```
SELECT ST_Point(30, 10);
```
which returns a POINT object. You can use spatial functions like ST_Contains(poly, point) etc. The spatial extension basically adds a suite of GIS functions.
Defining custom functions (UDFs): Extensibility isn’t only via pre-built extensions. DuckDB also allows you to create your own SQL functions. For simple ones, CREATE MACRO can define a SQL alias. For example:
```
CREATE MACRO add(a, b) AS a + b;
SELECT add(3,4);  -- returns 7
```
More powerfully, from Python, you can register a Python function:
```
def square(x: int) -> int:
    return x * x

con.create_function('square', square, returns='INT')
```
Now in SQL, SELECT square(5); yields 25. This is a form of extensibility (embedding Python logic as UDF). Keep in mind, Python UDFs will be slower than built-in functions because they execute Python code for each tuple (no vectorization), but they are useful for custom computations not in DuckDB.
Integration with other languages: Not directly a “feature” used in Python, but DuckDB has client APIs in R, Java, etc., and you can extend it via those or C++ as well. From Python perspective, main extensibility is via these SQL extensions or UDFs.

Examples:

Example 1: using the JSON extension

Suppose you have a table events(json_data VARCHAR) where each row is a JSON string of an event. You want to query these JSON fields. By default, core DuckDB might treat them as just text. With the JSON extension:

con.execute("INSTALL 'json'")
con.execute("LOAD 'json'")
query = """
SELECT json_extract(json_data, '$.user.id') AS user_id,
       json_extract(json_data, '$.action') AS action,
       json_extract(json_data, '$.timestamp') AS ts
FROM events
WHERE json_extract(json_data, '$.action') = 'LOGIN';
"""
df = con.execute(query).df()

Here, json_extract is a function from the JSON extension that takes a JSON string and a JSONPath (like $.field.subfield) and returns the value. We’re extracting user.id, action, and timestamp fields from the JSON text, and filtering to only events where action = 'LOGIN'. Without this extension, you’d have to either write a custom Python UDF to parse JSON or extract via string functions, which is painful. With it, it's straightforward.

DuckDB’s JSON extension can also create a proper JSON type and allow indexing into it with -> operator, etc. For advanced usage, but the above covers common use.

Example 2: HTTPFS for remote data

If your data resides on S3:

con.execute("INSTALL 'httpfs'")
con.execute("LOAD 'httpfs'")
# Set AWS S3 credentials (assuming you have them in env or specify here)
con.execute("SET s3_region='us-west-2'")
# (You can also set s3_access_key_id, s3_secret_access_key if not using IAM roles or env vars)
df = con.execute("SELECT COUNT(*) FROM 's3://my-bucket/data/2025/august.parquet'").df()

This would directly count records in a Parquet file stored on S3. The httpfs extension handles the S3 protocol. If you forget to LOAD httpfs and try this, you'll get an error like "Object store not available" or similar. After loading, it just works as shown. Similarly, you could query an https://...csv file if you load httpfs.

Example 3: defining a Python UDF

Perhaps you want to use a complex Python logic in your SQL, maybe using a library like NumPy or a custom algorithm on a value. You can define:

import numpy as np

def is_prime(n: int) -> bool:
    if n < 2:
        return Falsereturn all(n % i for i in range(2, int(np.sqrt(n)) + 1))

con.create_function("is_prime", is_prime, returns="BOOL", 
                    parameters=[("n", "INT")])
# Now use it:
con.execute("SELECT number, is_prime(number) as prime_flag FROM range(1, 20) AS t(number)").df()

This would list numbers 1 through 19 and indicate if each is prime (True/False). Here range(1,20) is using DuckDB's range table-valued function to generate numbers (like an internal sequence). The Python UDF is_prime checks primality. Note: This will be slower for large sequences because it's calling Python for each number. But it’s a powerful way to extend functionality for small/medium datasets.

Performance considerations: Built-in extensions like JSON and HTTPFS are implemented in C++ and optimized. Using them might carry some overhead (e.g., parsing JSON is CPU-heavy no matter what, but DuckDB’s JSON parser is in C++ so it’s reasonably fast). The HTTPFS extension will incur network I/O latency, but it does do things like read in large chunks and use multiple threads for remote reads to speed it up. When reading remote files, you might want to minimize round-trips (DuckDB tries to do that by reading large blocks).

The spatial extension is currently somewhat limited and not as optimized as something like PostGIS for very complex operations, but for basic bounding box or point-in-polygon queries it works fine. If performance is critical in spatial, you might pre-filter by bounding box or use spatial indexing (which I think the extension might not support fully yet).

Python UDFs, as mentioned, do not vectorize – they’ll call Python for each row or each group of rows internally. This can be slow if you use it on millions of rows. If possible, it’s often better to see if what you want can be achieved with built-in SQL or a combination of built-ins, to leverage C++ speed. But Python UDFs are fine for relatively small datasets or when prototyping.

Integration examples: By using extensions, you keep all analysis in one place. For example, if your pipeline needed to fetch JSON data from an API (which you saved to a file or a stage), you can use HTTPFS and JSON extensions together: fetch from a URL and parse JSON, all in DuckDB:

SELECT json_array_length(response) 
FROM read_json_auto('https://api.example.com/data.json') as t(response);

Hypothetically, that could count elements in a JSON array from a web API. This is a contrived example, but it shows that instead of doing requests and parsing in Python, you could sometimes delegate to DuckDB.

Another integration: Using the Postgres scanner (if loaded) you could query a live Postgres table from within DuckDB. E.g.:

SELECT * FROM postgres_scan('postgres://user:pass@host/db', 'schema', 'table');

This returns rows from a Postgres table. Then you could join that with local DuckDB data. This kind of integration means DuckDB can act as a query federator.

Common errors & solutions: The two-step process of INSTALL and LOAD trips people up occasionally. INSTALL downloads the extension, but doesn’t activate it. If you forget LOAD, none of the functions will exist and you’ll get errors like “Function json_extract does not exist”. The fix is simply to load after install. In persistent DuckDB usage (say you have a persistent database file), once you install an extension, it’s available to that database. But you still need to load it each session when needed.

Another common error is missing permissions or environment for remote access. For S3, forgetting to set credentials will result in an error attempting to read S3. DuckDB by default might try to use instance credentials if on AWS, or environment variables AWS_ACCESS_KEY_ID. Ensure those are set or use SET commands to configure.

If using certain extensions like the Postgres one, you must have the relevant client libraries or the extension binary compiled with needed support. Usually, the DuckDB-provided binary includes needed parts so it’s seamless, but e.g., if connecting to Postgres, you need the connection string right.

For Python UDFs, a common issue is data type mismatches. If you declare a UDF returns INT but you return a Python float, DuckDB might throw an error or try to cast and possibly lose precision. Always ensure your Python function returns the type you promised (or DuckDB will attempt to cast). Also, Python None returned should map to SQL NULL (DuckDB handles that). But if your function could error or not handle certain inputs, be careful to catch exceptions or they’ll propagate as query errors.

Finally, if you use certain reserved names for your UDF, that might conflict with built-ins (though DuckDB would likely let the UDF shadow a built-in if you name it the same, which could be confusing). So use unique names.

Feature 5: persistent storage and single-file databases

What it is: While DuckDB can run purely in-memory, it also supports persistent storage by saving databases to a single file (with extension .duckdb). This is similar to how SQLite works – you can have a DuckDB database file that contains tables, indexes, and data, and you can reopen that file later to get back your data. DuckDB’s storage format is columnar and optimized for analytics. Persistence means you don’t have to re-import data on each run; you can create tables once and query them repeatedly across sessions. DuckDB ensures ACID properties on this file, meaning transactions are durable and the database won’t be corrupted even in case of crashes (barring severe system issues). The persistent storage also allows features like multiple connections to the same database file (multi-threaded access from one process, or even from multiple processes in read-only mode), though with some limitations (only one process can write at a time).

Why it's important: Using a DuckDB file as a local analytics database can be very convenient. Instead of keeping data in CSVs or Parquet and reading them each time, you can load data into a DuckDB .duckdb file once, and then queries will be faster (because it’s stored in DuckDB’s format with column statistics and such). It also allows you to update data or accumulate data over time. If you’re using DuckDB in an application, a persistent database file means the app can shut down and later resume without losing state. Additionally, the single-file approach makes it easy to share or move the database – you can send someone the .duckdb file and they can open it on their machine (DuckDB files are usually portable across OS for the same DuckDB version, though it’s recommended to use similar DuckDB version due to storage format evolution). It’s also useful for integrating with other tools; for example, you can point a BI tool that supports ODBC at a DuckDB file to do analysis. For Python workflows, it means you can treat DuckDB as a lightweight local data warehouse.

Syntax and usage:

Creating/opening a database file: When connecting, instead of ':memory:', provide a file path. For example:
```
con = duckdb.connect('mydatabase.duckdb')
```
If the file doesn’t exist, DuckDB will create a new database file with that name. If it exists, it opens it and you have access to all tables previously created there.
Creating tables and inserting data: You can create tables via CREATE TABLE and then insert data via INSERT or use COPY FROM to bulk load from a CSV, or even directly CREATE TABLE AS SELECT ... FROM 'file.csv'. For instance:
```
con.execute("CREATE TABLE items AS SELECT * FROM 'items.parquet'")
```
This reads items.parquet and stores its content in a new persistent table items inside the DuckDB file. Now it’s there on disk; next time you connect to this file, items table is readily available.
Multiple connections: In one process, you can call con.cursor() to get another connection (which shares the same database state but can be used in another thread). DuckDB is threadsafe for separate connections. If you open the same file in two different Python processes, one can read while another writes, but concurrent write attempts from different processes are not allowed (you’d get a locking error). Typically, you’d have one process writing or both reading. This is an improvement over earlier versions – now DuckDB uses file locks to coordinate multi-process access (but it's still not a multi-user server, it’s meant for single-user or single-process heavy usage with perhaps read-only access by others).
Transactions: DuckDB auto-commits each query by default, but you can explicitly do transactions:
```
BEGIN;
INSERT INTO table1 VALUES (...);
INSERT INTO table2 VALUES (...);
COMMIT;
```
If something fails in between, you can ROLLBACK;. This ensures both inserts happen or none.
Backup/Export: You can use EXPORT DATABASE 'dir_path' to export all tables as CSV or Parquet (makes a dump), or IMPORT DATABASE to bring them back. Also, VACUUM can be used to reclaim space from deleted data in the file.

Examples:

Example 1: Creating and using a persistent database

# Connect to a database file
con = duckdb.connect('analytics.duckdb')

# Suppose we have some data files and we want to set up persistent tables
con.execute("CREATE TABLE sales_data AS SELECT * FROM 'sales_2024.parquet'")
con.execute("CREATE TABLE customers (id INTEGER, name VARCHAR, city VARCHAR)")
con.execute("INSERT INTO customers VALUES (1, 'Alice', 'London'), (2, 'Bob', 'Paris')")

# Now let's query persistently stored data
result = con.execute("""
    SELECT c.name, SUM(s.amount) as total_spent
    FROM sales_data s
    JOIN customers c ON s.cust_id = c.id
    GROUP BY c.name
""").fetchall()
print(result)
con.close()

In this example, we created an analytics.duckdb file if it didn’t exist. We then created a table sales_data by selecting from a Parquet file – this loads the Parquet file’s content into the DuckDB file (so it’s a one-time cost, subsequent queries on sales_data will hit the local columns). We also created a customers table and inserted two rows manually. After that, we run a query joining sales_data and customers. This query will benefit from columnar storage of sales_data etc. We then close the connection. If we later do duckdb.connect('analytics.duckdb') again, the tables sales_data and customers are still there with their data.

Example 2: Reopening and verifying persistence

In a new session:

con2 = duckdb.connect('analytics.duckdb')
df = con2.execute("SELECT COUNT(*) FROM sales_data").df()
print(df)
con2.close()

This should show the count of rows in sales_data that we had loaded previously, confirming the data was saved. If we didn’t persist, we’d have to reload or it wouldn’t exist. This persistence is crucial when working with large data: you don’t want to parse GBs of CSV every time – do it once, store in .duckdb and be done.

Example 3: Multi-threaded usage (within one process)

If you want to parallelize some operations in Python using threads:

con = duckdb.connect('analytics.duckdb')

# Prepare example table
con.execute("CREATE TABLE IF NOT EXISTS numbers (x INTEGER)")
con.execute("INSERT INTO numbers VALUES (generate_series(1,1000000))")  # insert 1..1000000# Now, sum the numbers in two separate threads for demonstrationimport threading

def calc_sum(conn, start, end):
    result = conn.execute(f"SELECT SUM(x) FROM numbers WHERE x BETWEEN {start} AND {end}").fetchall()
    print(f"Sum from {start} to {end}:", result[0][0])

# Create thread-local connections (cursors)
conn_thread1 = con.cursor()  # .cursor() gives a new connection object for same DB
conn_thread2 = con.cursor()
t1 = threading.Thread(target=calc_sum, args=(conn_thread1, 1, 500000))
t2 = threading.Thread(target=calc_sum, args=(conn_thread2, 500001, 1000000))
t1.start(); t2.start()
t1.join(); t2.join()
con.close()

Here we use con.cursor() to get an additional connection that can be used in another thread. Each thread executes a query on a portion of the data. Because DuckDB’s connections manage their own transactions, these reads can happen concurrently. DuckDB will use internal locks as needed but reading is mostly lock-free. The outputs might be:

Sum from 1 to 500000: 125000250000
Sum from 500001 to 1000000: 375000250000

(for example). Combining them would give the total sum formula result.

Performance considerations: DuckDB’s persistent storage is optimized for analytics: it stores data in column segments and with column-level indexes (min/max stats per segment) to speed up queries. When you query a persistent table with a WHERE filter, it will skip reading segments that don’t qualify (similar to Parquet pushdown). The file format is also heavily compressed by default (using various compression schemes per column, e.g., dictionary compression, frame-of-reference, RLE as applicable). This means the .duckdb file is often smaller than the raw data and very quick to read from for queries. DuckDB also memory-maps the file or uses buffered IO such that it doesn’t have to load entire columns if not needed.

When modifying data: DuckDB uses MVCC, so when you update or delete, it doesn’t immediately reclaim space – it writes new versions and marks old as deleted. Running VACUUM; will clean up and rebuild the database to reclaim space if a lot of updates/deletes happened. If your usage pattern is mostly append and query (typical for analytics logs etc.), you won’t have much fragmentation.

Opening and closing connections is fairly cheap, but not free. If you repeatedly open/close the database file, that overhead might matter (closing ensures flush to disk etc.). It’s fine for interactive, but if doing many small operations, maybe keep one connection open if possible.

Integration examples: You can use a DuckDB file as a local cache of data. For instance, if you have a pipeline that fetches data from an API or another database daily, you could insert new records into DuckDB each day. This way, you can query both historical and new data quickly in one place. It’s like having a mini data warehouse in a file. Because it’s SQL, any tool that can speak SQL (with DuckDB drivers) can query it – for example, you could connect Tableau or Excel (via ODBC) to a DuckDB file to analyze data, no actual server needed.

Another integration: data science teams could use DuckDB files to share processed data. Instead of CSVs, they send a .duckdb file. The receiver can do duckdb.connect('file.duckdb') and immediately query the data, possibly benefitting from the precomputed stats or indexes.

Common errors & solutions: A common situation is forgetting to commit when doing multiple inserts in a script if using manual transactions. Because DuckDB auto-commits, this usually isn’t an issue, but if you explicitly call BEGIN and then forget to COMMIT, your changes might not be saved. Always ensure transactions are closed.

If a DuckDB process crashes or is killed, the database file should still be consistent thanks to MVCC and write-ahead logging. On next open, it should recover. If you see a message about needing recovery or so, DuckDB will handle it automatically normally. But if the file gets truly corrupted (very rare), you might have to restore from a backup. It’s wise to keep backups if the data is important.

Multi-process concurrency: If you try to connect two processes for writing, one will get an error like “Database file is locked”. Only one connection (or one process with multiple connections) can write at a time. If you encounter locking issues, ensure you’re not running two conflicting processes. For read-only operations, you can open connection with read_only=True in duckdb.connect (or open second process after first opened in normal mode) – DuckDB allows one writer or many readers, but not a mix easily. Actually, the official stance: multiple processes reading is fine; if one is writing, others can't even read concurrently because the writer holds an exclusive lock during write transactions. So typically, treat it as single-writer-single-reader or single-writer-multiple-readers but readers might have to wait if writer is active. In practice, for analytics, usually one script is running at a time.

Another error could be path issues: if you provide a relative path and then change working directory, you might “lose” the file. It’s safer to use absolute paths or keep track of where the .duckdb file is. Also, avoid using a network drive for the DuckDB file if possible; file locking might behave differently. If you do, ensure the network file system supports POSIX file locks.

Finally, note that the storage format can change between major versions of DuckDB (they try to keep backward compatibility for a while – e.g., 1.x can read back to 0.9 and upgrade it – but not always forward compatibility). So if you upgrade DuckDB to a much newer version, it should usually auto-upgrade the file format. But you typically cannot open a DB file with an older version of DuckDB than created it (just like you can’t easily open a Word doc in a version from 10 years ago if format changed). Keep DuckDB updated for best results, and if sharing files, align on version or export to an interchange format like Parquet if needed.

Advanced usage and optimization

Now that we’ve covered the basics and core features of DuckDB, we move into advanced usage patterns and how to optimize performance for heavy workloads. DuckDB is built to automatically handle many performance considerations, but understanding how to fine-tune it and use best practices can help when you push the limits.

Performance optimization techniques

In most cases, DuckDB will give great performance out-of-the-box, but for very large data or complex queries, the following techniques can help ensure you’re getting the most out of it:

Memory management: DuckDB will use memory for caching data and intermediate results. By default, it might use a sizeable fraction of available memory. You can adjust this with the memory_limit setting. For example, SET memory_limit='4GB' limits DuckDB to 4 GB of RAM usage. If you anticipate queries that exceed RAM, you should also set temp_directory to a fast disk (SSD) location so DuckDB can spill data (for instance, to perform large sorts or joins). DuckDB will automatically spill to disk for large joins or aggregations if needed, but you can control where (the default is often system temp). Monitoring memory usage is key; you can query pragma memory_limit and pragma database_size() to see usage. A best practice is to allow DuckDB to use most of the memory if it’s the only thing running, but not so much that the OS starts swapping.
Parallel execution and threads: DuckDB is fully multi-threaded for many operations (scans, joins, aggregations). By default, it uses all available cores. You can limit this with PRAGMA threads = N; if you want to avoid using all cores (maybe on a shared server). More threads can speed up queries but beyond a point you might hit diminishing returns or saturate I/O. DuckDB’s parallel scheduler handles partitioning tasks. If you notice one query not utilizing all cores, it might be that the data size or operations aren’t large enough to warrant parallelism (DuckDB won’t spawn threads for trivial operations). For consistent performance in scripts, explicitly setting the thread count can ensure reproducibility. Also, if running multiple queries concurrently (e.g., via multiple connections), each will try to use threads – so if you have two heavy queries at once, consider dividing the cores between them or run sequentially for maximum speed.
Data format and compression: As mentioned, using columnar formats like Parquet for input is faster than CSV. If you have a large CSV, a one-time conversion to Parquet (which DuckDB can do via COPY TO PARQUET) can dramatically improve query times on that data. Within DuckDB, when you create tables, by default it will compress them. Ensure you have statistics on columns (DuckDB automatically maintains min/max stats for each column segment which helps skip data). You can check the presence of stats by PRAGMA show_tables and then PRAGMA show(columns) or PRAGMA storage_info('table') which shows each column’s segments and stats. If for some reason stats are missing or outdated (maybe after massive data load? though DuckDB usually sets them on insert), you can run ANALYZE to update statistics. However, DuckDB usually keeps them updated automatically.
Query profiling and analysis: To optimize a slow query, you should inspect its plan. DuckDB offers EXPLAIN to see the logical plan, and EXPLAIN ANALYZE to actually run the query and output timing for each step. For example:
```
EXPLAIN ANALYZE SELECT region, SUM(sales) FROM big_table GROUP BY region;
```
This will show how DuckDB is executing the query (e.g., if it is doing a parallel hash aggregation, you’ll see that, along with time taken in each phase). By analyzing this, you might identify bottlenecks (like a slow join due to missing filter, etc.). DuckDB’s web site also offers a visual profiler (you can output profiling info and view it). Enabling the progress bar with SET enable_progress_bar=true can at least show you roughly which stage a long query is in. If a query is slow, consider if you can rewrite it: e.g., pushing filters earlier (DuckDB usually does this automatically), avoiding extremely large cross joins or needless DISTINCTs. Sometimes adding an index (DuckDB supports indexes using CREATE INDEX on a table column) can speed up highly selective point queries – but note, indexes in DuckDB are not commonly needed for analytics since scans are fast, they mainly help if you often do WHERE primary_key = constant lookups on a very large table repeatedly. They also incur cost on insert. So use indexes sparingly for specific cases.
Chunking and incremental processing: If you cannot fit processing in memory even with disk spilling (for example, joining two very large tables might still be I/O heavy), consider processing in chunks. DuckDB has some table functions like read_csv_auto that support LIMIT and OFFSET parameters to read in chunks. For instance, you could load 1 million rows at a time, process them, then move to next – this is a pattern if you absolutely cannot do it in one SQL due to memory constraints. However, often it’s simpler to let DuckDB handle it via its operators. Another approach: breaking a large operation into pre-aggregations. For example, if you need to join two massive tables, maybe aggregate one by the join key first to reduce size. This kind of manual optimization can reduce the amount of data in the big join.
Use of temporary tables or caching: If you have a subquery that is used multiple times, DuckDB might recompute it each time unless it can tell it’s pure (common subexpression). You can manually materialize it into a temporary table. For example:
```
CREATE TEMP TABLE recent ASSELECT * FROM large_table WHERE date > '2025-01-01';
```
and then use recent in multiple subsequent queries. This way you filter once. Temporary tables live only during the session and are stored in memory (unless they overflow). You could also create an index on a temp table if needed for a lot of point lookups. This technique is useful in iterative analysis where you slice data and then do multiple things with that slice.
Specialized functions and SQL tricks: Leverage DuckDB-specific functions that can be faster than generic methods. For example, DuckDB has map() and list() types which can sometimes do things more efficiently if used right (like grouping values into lists rather than doing separate subqueries). It also has efficient UNNEST for lists and STRUCT_PACK for combining columns. If doing a lot of string pattern matching, consider DuckDB’s vectorized regexp_matches or full-text extension rather than pulling into Python. Similarly, for heavy computations like statistical ops, see if DuckDB’s APPROX_COUNT_DISTINCT (HyperLogLog) or other approximate functions meet your needs faster than exact count distinct. It also has quantile_cont for median etc. which might be faster than sorting in Python.

In summary, performance tuning DuckDB often comes down to format (use Parquet or native tables vs raw text), resources (give it enough memory and threads), and query structure (let the optimizer do its job, but assist by avoiding pathological patterns). DuckDB aims to be forgiving and fast with minimal tuning, but the above techniques can push it further for edge cases.

Best practices for DuckDB usage

To use DuckDB effectively in Python projects, consider the following best practices which cover code organization, error handling, testing, and deployment:

Organize code to separate SQL logic and Python logic: It’s often helpful to keep your SQL queries as either multiline strings or even external .sql files, separate from your Python business logic. This makes the queries easier to read and maintain. For example, define your query strings at the top or in a config, and then pass them to con.execute. This separation also aids testing – you could unit test a query on a small in-memory setup to verify it returns expected results.
Parameterize queries properly: Never use Python string concatenation to inject variables into SQL (to avoid SQL injection issues and bugs with quoting). Use DuckDB’s parameter binding or Python parameters. For instance, use con.execute("SELECT * FROM table WHERE id = ?", [some_id]) instead of f-string inlining. This ensures proper escaping and also can allow DuckDB to cache the query plan for repeated use. DuckDB follows DB-API parameter style with ? placeholders, which is straightforward. For bulk inserts, consider using con.executemany with parameterized statements – this will be more efficient than looping in Python inserting one by one.
Error handling strategy: When executing SQL, be prepared to catch exceptions (duckdb.Error for example). For instance, if you attempt to create a table that already exists, DuckDB will throw an error. If that’s expected (maybe you don’t want to recreate), you can either use CREATE OR REPLACE TABLE to avoid the error or catch it and move on. Similarly, any constraint violation or type error will raise. Use try/except around critical sections. Logging the query that failed and error message is useful for debugging. DuckDB’s error messages are usually descriptive (e.g., “Binder Error: No column named X”), which helps pinpoint issues.
Test with small data: For complex transformations, test your DuckDB queries on a small sample of your data (or contrived small tables) to ensure correctness. DuckDB is deterministic, so if it works on small data, it will on large (aside from performance). By writing tests that feed known small inputs and checking the output DataFrame or list of results, you can validate logic (like grouping, joining logic, edge cases for nulls). DuckDB can easily create in-memory tables for testing using VALUES clause or small DataFrames as we saw. Include those tests in your test suite so if you change a query, you catch unintended changes.
Documentation and readability: Write your SQL clearly. Use proper formatting (new lines for clauses, indent subqueries) – treat it like code. DuckDB doesn’t care about formatting, but your future self and colleagues will appreciate it. You can even comment in SQL strings with /* ... */ or -- if needed to clarify tricky parts. This is part of best practices because readable queries are easier to optimize and less error-prone.
Resource cleanup: Always close DuckDB connections when done (especially if writing data). Because DuckDB writes out transactions on close, not closing can keep locks on the database file (which could block other processes from using it). Use context managers if available (though duckdb’s connect doesn’t natively support with-context as of now, you can manually close or rely on destructor, but explicit close is better). For long-running applications, you can keep a connection open, but if you open many (like in a multi-thread pool scenario), ensure to close extras to avoid hitting file handle limits or memory usage from multiple connections.
Concurrent access and transactions: As a best practice, if you have multiple threads or tasks, coordinate so that only one writes at a time to avoid lock contention. If you do need to allow concurrency, use DuckDB’s snapshotting ability – readers can read while a writer is mid-transaction. Typically just allow DuckDB to handle it, but design your system where writes (insert/update) happen in batches at controlled points, rather than lots of tiny interleaved writes, which can fragment data and reduce performance. If using from multiple processes, consider using the DuckDB REST server or motherduck or such if truly needed; otherwise try to funnel through one process.
Backup your data: If using DuckDB for something important, periodically back up the .duckdb file (or export to Parquet or CSV). While DuckDB is stable, it’s still a newer technology and bugs or corruption (though rare) could theoretically happen. Having a backup ensures you can recover. This is a general data best practice, not specific to DuckDB.
Keep DuckDB up-to-date: DuckDB is actively developed and new versions often bring performance improvements and bug fixes. However, note the storage format compatibility – usually backward compatible but check release notes. It’s best to upgrade in a controlled way: test your queries on the new version (maybe on a copy of the database file first). Best practice is to not lag too far behind on versions since each release often has helpful features (like new SQL functions or speed-ups).
Production deployment considerations: If you deploy an application that uses DuckDB (say a Flask API that queries a DuckDB file), be mindful of concurrency (Flask with multiple workers could open multiple DuckDB connections). Possibly use a singleton connection or connection pool (though pool not as needed for embedded). Also, ensure the environment has the same DuckDB version that you developed with to avoid any subtle differences. DuckDB doesn’t require separate installation if you package it via pip, but if your environment is containerized, just include pip install duckdb==<version> in requirements.
Logging and monitoring: It can be useful to log when heavy DuckDB queries start and end along with their duration. That way you have an audit of performance over time. If a query is slow in production one day, you can see from logs how long it took historically. Also, monitor resource usage: since DuckDB can use a lot of CPU, ensure your monitoring picks up CPU spikes (which may be fine, but you want to know if it saturates system) or memory usage to catch if you need to adjust memory_limit. For long ETL-style queries, use the progress bar or output messages to not appear hung. In a non-interactive environment, you might disable the progress bar (which is default off anyway unless interactive), but you can implement your own progress by breaking work into steps and logging after each step.

By following these best practices, you ensure that your usage of the DuckDB library in Python is robust, maintainable, and efficient. DuckDB is a powerful tool, and like any, it benefits from thoughtful usage patterns especially as you scale up complexity or deploy to production environments.

Real-world applications of DuckDB

DuckDB’s unique capabilities have led to its adoption in a variety of real-world scenarios across different industries. Let’s explore several detailed case studies and examples where DuckDB has been successfully applied:

1. Accelerating ETL pipelines in SaaS (FinQore/SaaSWorks): FinQore, a SaaS company dealing with customer analytics, had an ETL pipeline that aggregated usage data for their clients. Initially, they used a combination of Python and a traditional database, and one critical pipeline step (calculating customer metrics across millions of rows) was taking ~8 hours overnight. By integrating DuckDB into their pipeline (essentially using DuckDB to perform the heavy aggregations with SQL), they were able to cut this processing time down to about 8 minutes, a 60x speed improvement. They achieved this by reading raw event data (stored in Parquet on cloud storage) directly with DuckDB, letting DuckDB perform group-bys and joins that the prior approach did in Python. The result was not only faster but also simpler – much of the custom Python code was replaced with concise SQL. This demonstrates DuckDB’s strength as an embedded analytical engine: FinQore didn’t need to maintain a separate Spark cluster or data warehouse for this; they plugged DuckDB into their existing Python codebase and gained near-interactive performance for a task that used to consume most of the night. This speed allowed them to provide more timely insights to their clients and freed up engineers from babysitting long batch jobs.

2. Security log analysis at Okta: Okta, a large identity/security company, deals with enormous volumes of log data (security events, login attempts, etc.). Their defensive cybersecurity team needed to crunch these logs to detect anomalies and generate reports. Using DuckDB in a cloud function context (likely AWS Lambdas or similar ephemeral compute), they managed to process 7.5 trillion log records across 130 million files in six months. They deployed thousands of DuckDB instances concurrently – since DuckDB is embedded and lightweight, they could spin up many parallel jobs, each analyzing a chunk of logs, writing results, and shutting down. This approach replaced or supplemented a cloud data warehouse solution that cost them around $2000/day in processing (as per Snowflake usage). By doing it with DuckDB on inexpensive storage and stateless compute, they dramatically reduced costs while maintaining performance. Key to this success was DuckDB’s ability to read compressed Parquet logs directly and perform filtering/aggregation in situ. Okta’s case shows that DuckDB can scale out (horizontally) – even though DuckDB itself is single-node, nothing stops you from running many DuckDB processes in parallel on partitioned data. The reliability (each DuckDB instance handling part of the data with ACID guarantees in its process) meant they could trust the results. This is a large-scale, enterprise-grade application of DuckDB for log analytics, indicating how well it can handle “big data” when used cleverly.

3. Watershed’s carbon data platform: Watershed is a climate tech company that helps enterprises analyze their carbon footprint. They migrated parts of their analytics from PostgreSQL to DuckDB. The reason was that some of their customers had moderately large datasets (tens of millions of records), and PostgreSQL was struggling with query performance and the maintenance overhead (indexes, vacuuming, etc.) at that scale. By using DuckDB as the compute layer querying Parquet files stored on cloud storage, they achieved about 10x faster performance for their analytic queries and eliminated a lot of maintenance pain. The architecture Watershed used was: data is stored in open format (Parquet on Google Cloud Storage), their app translates user requests into SQL, and DuckDB (running perhaps in a serverless manner) executes the SQL over those Parquets. Because DuckDB can handle larger-than-memory by spilling, and because each query is fast, they could serve analytical results on demand to their users. They also implemented an external cache layer to cache remote file reads (which DuckDB’s 1.3 version now has built-in) to improve repeated query speed. The success here shows an “analytics as a service” scenario: building a SaaS analytics product on top of DuckDB, benefiting from its efficiency and the simplicity of not managing an external DB for each client. They handle about 75k analytical queries daily with DuckDB in the loop, demonstrating its production reliability.

4. Interactive data applications (evidence and GoodData): Evidence is an open-source BI tool (kind of like an alternative to Metabase) that adopted DuckDB as a “universal SQL engine” under the hood. Evidence’s goal is to let users create reports and dashboards using SQL. By embedding DuckDB, they allow analysts to join local CSVs, data from cloud storage, and other sources in one place. DuckDB’s portability means Evidence can be deployed easily without requiring a heavy database backend – everything runs in-process. In a similar vein, GoodData – a well-known analytics platform – evaluated DuckDB (and MotherDuck) as a query engine for their SaaS BI service. They ran over 700 test scenarios and found that DuckDB (and the cloud-backed MotherDuck) outperformed Snowflake in many query types. GoodData was impressed by DuckDB’s low latency on moderate data volumes and its ability to handle concurrent analytical workloads (with many users querying at once). They did note some missing features (at the time) like certain date arithmetic and query cancellation, but those were not blockers and were being improved. The fact that a commercial BI platform is considering replacing a cloud data warehouse with DuckDB for backend query execution speaks volumes – it can drastically reduce cost and complexity while providing sufficient performance for most interactive dashboards. GoodData’s tests showed DuckDB’s concurrency is good for dozens of users and that it can be embedded in a multi-tenant scenario safely.

5. Data science & machine learning pipelines: DuckDB is also used in ML-oriented workflows. An example is an ML feature engineering pipeline: Instead of using pandas for every transformation, some data scientists load raw data into DuckDB, use SQL (potentially via the duckdb.sql() Pandas-like API or just direct SQL) to join tables, filter outliers, and aggregate features, and then pull the result into a pandas DataFrame or directly into numpy for modeling. This has been done in projects like time-series forecasting where you may need to aggregate large log data into features per time window. DuckDB ensures this step is fast and not a bottleneck. After feature computation, they pass the smaller feature set to scikit-learn or XGBoost. Because DuckDB can be installed and run anywhere (local, in Jupyter, in production code), it’s easy to integrate into ML pipelines, even within tools like Airflow or Prefect tasks. The win here is usually performance (DuckDB can be faster and handle bigger data than pandas) and the ability to use SQL which is often clearer for complex aggregations.

6. MotherDuck – DuckDB in the cloud: While not an “application” built by an end-user, it’s worth noting the creation of MotherDuck, a cloud service that combines DuckDB with cloud storage. The existence of MotherDuck shows how DuckDB is being leveraged to provide serverless analytics: MotherDuck uses DuckDB under the hood to run queries (the heavy parts are executed in their cloud, but it presents like a DuckDB connection). This essentially allows extremely large scale-out by separating storage and compute but still relying on DuckDB’s core. It’s mentioned here because some organizations are using MotherDuck in preview to handle collaborative analytics (DuckDB with “multiplayer” capabilities for teams). For instance, an enterprise might use MotherDuck to let multiple analysts query a shared dataset concurrently without each moving data to their machine. The queries run with DuckDB efficiency but on cloud data, and results can be shared. This scenario validates DuckDB’s technology in a cloud multi-user context.

7. Academic and scientific data analysis: Researchers dealing with large datasets (genomics, astronomy, survey data) have started to adopt DuckDB in their workflows. For example, a group at a university took a large dataset of experiment results (which was originally stored in CSVs and was cumbersome to analyze due to size ~50 million rows), and they imported it into DuckDB. They then could run complex statistical SQL queries (like grouping, joining with reference tables, computing percentiles) directly, and export summary results. The ease of installation was critical – they didn’t need a DBA or cluster, each researcher could have their own DuckDB database file. This democratizes handling large data for people who are comfortable with data but not with managing databases. It’s akin to how SQLite empowered using databases in small apps; DuckDB empowers heavy analytics in small teams or single researchers’ projects.

Each of these case studies highlights a different facet of DuckDB’s real-world use: performance boost in existing pipelines, cost savings by eliminating heavier infrastructure, interactive analytics enablement, and ease of integration in various domains. The consistent theme is that DuckDB often replaces or augments larger systems (like Spark, data warehouses, or pandas) to provide a sweet spot of performance, simplicity, and flexibility. Companies large and small have validated DuckDB in production, and the community shares many success stories in blog posts and conference talks – reinforcing that DuckDB is not just a novel tool, but a reliable component of modern data stacks.

Alternatives and comparisons

DuckDB is a powerful tool, but it’s not the only solution for data analysis in Python. Let’s compare DuckDB with some alternative Python libraries and databases that are often considered for similar tasks. We’ll use a comparison table to highlight key differences, and then discuss migration considerations.

Detailed comparison table

Below is a table comparing DuckDB with four alternatives: Pandas, Polars, SQLite, and Dask DataFrame. Each of these represents a different approach to data manipulation in Python (Pandas for in-memory DataFrames, Polars as a modern DataFrame library, SQLite as an embedded SQL database, and Dask for parallel/distributed DataFrame operations).

Criteria	DuckDB (SQL OLAP engine)	Pandas (DataFrame library)	Polars (DataFrame library)	SQLite (Embedded SQL DB)	Dask DataFrame (Parallel Pandas)
Primary Purpose	In-process SQL analytics (OLAP), handle large data via columnar storage and SQL queries. Ideal for analytical queries, joins, and aggregations within Python.	In-memory data manipulation using Python code (DataFrame API). Great for small to medium data and general data cleaning/analysis tasks.	In-memory (or single-machine) data manipulation using a Rust-based DataFrame API. Optimized for performance (multi-threading, Arrow memory format).	In-process relational database for transactional or simple analytical queries. Best for smaller datasets or situations needing a simple persistent DB.	Parallelize pandas operations across cores or cluster for larger-than-memory datasets. Aims to scale pandas API for big data, with some limitations on API coverage.
Data Size Handling	Can handle large datasets beyond RAM by streaming from disk and parallel processing. Handles billions of rows (limited by disk). Uses columnar format to optimize memory.	Limited by memory – all data must fit in RAM (though new Pandas 2.0 can use Arrow for larger memory efficiency). Struggles with >10 million rows without high memory.	Handles large datasets in memory efficiently using lazy evaluation and Rust optimizations. Can work with out-of-core via memory mapping for certain ops, but typically data should fit in one machine’s memory.	Limited by memory for complex queries; database file can be large, but SQLite will slow down or be I/O-bound on very large tables (100M+ rows queries can be slow). Not designed for huge analytic queries.	Designed for larger-than-memory by distributing or chunking data. Can scale to many millions of rows or multi-GB datasets by using multiple cores/nodes. Overhead of coordination, and not all pandas operations are parallel.
Performance	Excellent for OLAP queries (aggregations, joins) on large data due to vectorized C++ execution and parallelism. Often outperforms Pandas/Polars for heavy SQL-style operations (e.g., join 30M x 30M) and reads/writes Parquet extremely fast. Overhead of SQL parsing is negligible for large jobs, but for tiny operations, pure Python may be slightly quicker.	Good for small to medium data. Optimized in C for vectorized operations on columns, but single-threaded (except some numerical libraries). Tends to slow down or run out of memory on very large data. Excellent performance for <1M rows, okay up to maybe 10M, beyond that performance drops dramatically or memory issues arise.	Excellent single-thread performance and uses multi-threading automatically. Often faster than Pandas for similar operations due to Rust and Arrow optimization. Lazy execution can optimize query plan. Polars can beat DuckDB in some cases for pure DataFrame ops (especially when not all data is needed thanks to lazy optimization), but for very large join/aggregations DuckDB’s parallel SQL might catch up or win.	Fast for simple lookups and small joins on modest data. Uses B-tree indexes for point queries. However, for analytical queries scanning large portions of the table, SQLite is much slower than DuckDB (row-oriented, single-threaded). No parallel execution. Good for <100k row analytical queries; beyond that, performance lags.	Can handle large data by utilizing multiple cores or a cluster. Performance depends on cluster and data distribution. There’s overhead in splitting tasks and combining results, so not as fast on single machine for moderately large data as DuckDB or Polars in many cases. For truly huge data on a cluster, Dask (or Spark) can outperform single-node solutions by scaling out, but for 1 machine scenario, Dask often is slower than DuckDB due to overhead.
Learning Curve	Moderate if you know SQL. For those familiar with SQL (or databases like PostgreSQL), DuckDB feels natural. Python integration is straightforward (just use SQL strings). If one is not versed in SQL, there’s a learning curve to formulate queries instead of step-by-step Python code. But the concepts of SELECT, JOIN, etc., are well-documented and widely understood.	Low for those used to Python – pandas API is intuitive (like manipulating spreadsheets in code). However, mastering pandas (avoiding chained operations pitfalls, understanding indices) can take some time. Many resources available. Overall, very popular and lots of community help.	Moderate – similar to pandas in concept but some different syntax (e.g., Polars uses expression API and lazy frames which is a shift in thinking). If you know pandas, Polars is not hard to pick up, but you need to learn its idioms to fully exploit performance (like using lazy queries). Good documentation is available.	Very low – SQL is a standard language but SQLite is simpler (no complex features). Many developers have exposure to basic SQL. Using it in Python via `sqlite3` is simple (execute queries, fetch results). The main learning is limitations (e.g., no full JSON or window function support in older SQLite versions). For those new to SQL, basics are easy but complex queries might be challenging.	Moderate – Dask DataFrame tries to mimic pandas API, so basic usage is similar to pandas (low learning curve if you know pandas). However, not all pandas operations are supported or behave exactly the same in Dask. Learning to optimize Dask (e.g., choosing appropriate partitions, avoiding shuffles) requires some understanding of how it works under the hood. Debugging can be trickier because of lazy execution and distributed nature.
Community & Support	Growing fast. DuckDB has an active open-source community and is backed by DuckDB Labs (DuckDB Foundation) for development. Over 32k GitHub stars and active discussions. Good documentation and an active Slack/Discord community. However, it’s newer than Pandas/SQLite, so smaller community but very enthusiastic and helpful. Many blog posts emerging.	Huge community. Pandas is one of the most used Python libraries. Tons of questions on Stack Overflow, many tutorials, books, etc. Very mature, with extended ecosystem (e.g., integrations with plotting libraries). Support is excellent (just due to volume of Q&A available).	Rapidly growing community in data science and Rust circles. Still smaller than Pandas but quite active on GitHub and user forums. Polars is younger, but gaining traction as a pandas alternative; many Q&As on StackOverflow now and an active Discord. Good developer responsiveness.	Massive installed base (SQLite is everywhere). Many developers know it, lots of resources, but not a big “community” in interactive sense because it “just works” for most. For specific help, there are forums and it's well-documented. Since SQLite is so stable, less need for community troubleshooting.	Moderate community. Dask has been around a while (especially in PyData circles). There are StackOverflow questions, tutorials, and it’s used in industry. Not as ubiquitous as Pandas, but solid user base. Some overlap with the broader PyData community. Dask dev team is active and there’s a Discourse for questions.
Documentation Quality	Very good official docs with concepts and an SQL reference (DuckDB’s documentation covers usage in different languages, SQL syntax, etc.). Since concepts align with standard SQL, one can also rely on generic SQL resources. The docs include a performance guide and many examples, though being newer, might lack some depth in edge-case explanation compared to older projects.	Comprehensive documentation (though some find it a bit scattered because of the breadth of functions). Many third-party guides and books fill gaps. Pandas docs have improved with examples for each function. Sometimes, new users find certain behaviors (like SettingWithCopyWarning) under-documented in official docs, but community articles cover it.	Polars has an extensive user guide and API reference. It’s well-organized (book-style documentation) with clear examples and explanation of Polars concepts (like lazy vs eager, expressions). As a newer project, docs are up-to-date and written clearly. Fewer third-party books (since it's new), but official docs suffice for most.	Excellent official docs (SQLite has thorough documentation, a famous FAQ, etc.). Plenty of books and tutorials too. SQLite’s documentation is considered very reliable – it covers SQL dialect details, limits, file format, etc. Given its stability, the documentation hasn’t needed huge changes in years, and it's very trustworthy.	Decent documentation. Dask’s website has guides for DataFrame usage and best practices (like how to partition data, etc.). It also clearly states what parts of Pandas API are not supported. Because Dask involves more moving parts (scheduling, cluster setup), documentation includes sections on deployment which can be complex. Overall quality is good, but new users might need to read more background to use Dask effectively (compared to others which are more plug-and-play).
License	MIT License (permissive open-source). This means you can use it freely in commercial products. No restrictions on distribution.	BSD 3-Clause (permissive open-source). Free for all uses.	MIT License (permissive). Free to use and integrate.	Public Domain (SQLite is in the public domain). No license issues at all; can be used anywhere freely.	BSD 3-Clause (permissive). Open source and free, even for commercial use.
When to Use Each	Use DuckDB when you have analytical queries on large/local data. Perfect if you need to join large datasets, group and aggregate millions of rows, or query data files directly (CSV/Parquet). Ideal for data science work where SQL can simplify complex data transformations, or as an embedded analytics database in applications. If your workflow is SQL-friendly and performance with big data is a concern (but you don’t want a separate DB server), DuckDB is a top choice. It shines for read-heavy workloads, OLAP, and when working with columnar data. Not the best for high-frequency single-row updates (that’s OLTP).	Use Pandas for small to medium datasets where you want maximum flexibility in Python. It’s great for data cleaning, feature engineering, and scenarios where you need to do custom Python operations on the data that might not be easy in SQL. If your data fits in memory (say up to a few million rows) and you prefer a step-by-step coding approach or need to use many Python-specific libraries, Pandas remains very convenient. Also, certain data manipulations (like complex iterative algorithms) might be easier in pandas. However, if you start hitting performance issues or memory limits, consider moving to DuckDB or Polars.	Use Polars when you want DataFrame operations faster than Pandas and are open to using a new API. It’s a good choice if your data is slightly too large for Pandas or you want to utilize multiple cores seamlessly but still stay in a DataFrame paradigm rather than SQL. Polars is especially handy for complex transformations thanks to lazy execution – it can optimize the query under the hood. It’s great for similar tasks as Pandas (data analysis, feature engineering) but on bigger data or when performance is key. If you don’t know SQL or prefer not to, Polars can be a better fit than DuckDB for purely DataFrame-style work.	Use SQLite for lightweight storage and simple queries, especially if you need to store data on disk in a structured way and occasionally query it. For example, embedding a small database in an application (mobile or desktop) for user data, caching small lookup tables in a local DB, or handling small-scale OLTP (inserting few rows, doing point lookups). It’s also handy for quick prototyping of a database schema or for applications where a single-file persistent DB is needed and write-concurrency is not a big concern. However, for large-scale analytics, SQLite is not suitable – use DuckDB in that case. SQLite also doesn’t support things like parallel execution or many advanced SQL analytics features, so it’s best when requirements are simple.	Use Dask DataFrame when you need to scale out Pandas workflows across multiple cores or a cluster and you want to largely stick to Pandas-like code. If your data doesn’t fit in one machine’s memory, or you plan to leverage a distributed cluster (e.g., a Spark-like environment but with Pandas API), Dask is a contender. It’s useful in ETL pipelines where you have chunks of data spread out (like many CSVs) and want to process in parallel. That said, if your data does fit on one machine, DuckDB or Polars might be simpler and faster. Also, if you need to integrate with other parts of the Dask ecosystem (Dask arrays, etc.) or you have existing pandas code that you want to scale with minimal changes, Dask is a good option. Dask is not as turnkey as DuckDB; it may require more tuning. But for parallel/distributed needs (especially with on-prem or custom clusters), it fills the gap.

Table Summary: DuckDB stands out when it comes to performing SQL analytics on large datasets within a Python environment, offering a combination of speed, ease of use (for SQL users), and the ability to handle data beyond memory. Pandas remains the go-to for smaller data and Pythonic data manipulation, Polars is an emerging high-performance DataFrame library gaining traction for single-machine analytics, SQLite is the reliable embedded database for simpler persistent storage and small-scale queries, and Dask extends the pandas paradigm to clusters for scaling out workflows. Each has its niche: DuckDB vs Pandas often comes down to data size and whether SQL is suitable for the task, DuckDB vs Polars is often about preference of SQL vs DataFrame API (both handle large data well), DuckDB vs SQLite is about analytical columnar engine vs transactional row engine, and DuckDB vs Dask is about single-node power vs multi-node scalability.

Migration guide (Moving from/to DuckDB)

Depending on what tool you are currently using, migrating to DuckDB can offer benefits, and vice versa – you might find scenarios where you want to migrate from DuckDB to another system (e.g., if your data outgrows a single machine). Here’s a guide to migrating:

When to migrate to DuckDB:

You have been using pandas and hit performance or memory issues. Perhaps your data grew to tens of millions of rows and pandas is slow or can’t load it entirely. This is an ideal time to try DuckDB. You can migrate step-by-step: start by swapping some group-by or join operations to DuckDB queries to see immediate speedups, or use DuckDB to read big files that pandas can’t (e.g., use duckdb.query("SELECT * FROM 'large.parquet'")). Over time, you might replace whole pandas pipelines with a few SQL queries in DuckDB.
You have been using SQLite for analytic queries and find it’s too slow or lacks functionality (like window functions, parallelism). Migrating to DuckDB is straightforward since both are embedded databases. DuckDB supports most SQLite SQL syntax (with more analytical capabilities), so often you can take your .sql scripts and run them in DuckDB with little modification. The storage is different, so you’d export data from SQLite (e.g., to CSV or directly attach the SQLite database in DuckDB using the sqlite_scanner extension to pull data in).
You are using Spark or Dask on a single node for convenience but not truly distributing across a cluster. In this case, you might migrate to DuckDB for simplicity and possibly performance. For example, if you set up a local Spark just to handle a 50GB dataset on one machine, DuckDB likely can do it more efficiently without the overhead. The migration would involve rewriting PySpark or Dask DataFrame code into SQL queries or DuckDB’s Python API. But since you’re on one node, you won’t lose capacity, and you’ll gain simplicity (no cluster complexity).

Step-by-step migration process (to DuckDB):

Identify critical data processes – find where your current system is slow or hitting limits. E.g., a certain join that takes hours in pandas.
Replicate in DuckDB on sample data – take a subset of data and write a DuckDB SQL query or procedure to do the same. Test the results match.
Load data into DuckDB – either by querying files directly or importing. If coming from SQLite, you can export as CSV or use ATTACH DATABASE 'old.db' in DuckDB via the SQLite scanner extension. If from pandas, you can push your DataFrame into DuckDB via duckdb.insert or registering DataFrames.
Replace component by component – you don’t have to do an all-or-nothing. For instance, in a pandas pipeline, you could start by doing the heavy aggregations in DuckDB (using df = con.execute("SELECT ...").df() to get result back) while keeping other pandas code. Over time, you might move more steps into one DuckDB query or a series of them.
Optimize and adjust – once in DuckDB, possibly adjust data types (DuckDB might infer some columns differently, e.g., large ints). Ensure that things like date handling, null semantics, etc., behave as expected. DuckDB might require slight SQL syntax changes (for example, if migrating from Postgres, note that DuckDB uses == for equality in SQL as usual, nothing special, but if migrating from Pandas, note that SQL uses 3-valued logic for nulls, etc.).
Testing – run your pipeline end-to-end with DuckDB producing output and compare to old outputs to verify correctness. Usually, differences could be due to floating-point rounding or how nulls were handled. Adjust queries if needed (DuckDB has many functions to match logic, e.g., use coalesce to mimic fillna, etc.).
Deployment – include DuckDB as a dependency (pip install or conda install). Ensure the environment has the same or newer version of DuckDB than you developed with. If you’re going to use a persistent DuckDB database file, plan migrations similarly to how you would for any database (though with DuckDB, often you just recreate from source data if needed since it’s often used as an ephemeral analysis DB).

Migrating from DuckDB to others (if needed):
There might be cases to move off DuckDB:

If your workload becomes truly distributed (data too large for one node, or many concurrent users). Then you might consider moving to a data warehouse like BigQuery, Snowflake, or a Spark cluster. Migrating from DuckDB to such systems would involve exporting data (DuckDB can export tables to Parquet easily, which you can then load into the other system) and translating SQL (DuckDB uses standard SQL, so most queries will run with minor or no changes on those platforms).
If you need to integrate into a production DB for OLTP or more frequent small updates, you might migrate certain tables from DuckDB to PostgreSQL or similar. That means dumping the DuckDB data (again, Parquet or CSV) and importing to Postgres. DuckDB’s SQL is very close to Postgres in many ways (even has some Postgres-specific functions), but you might miss some advanced type support (Postgres has more types, etc.). It should be a straightforward translation for basic tables.
If you initially used DuckDB for ease of development but want to deploy on mobile or edge devices where maybe SQLite makes more sense for very small data (though DuckDB can run on mobile too, but not as small footprint as SQLite). In that case, you’d export from DuckDB (to CSV or insert statements) and import into SQLite. And rewrite queries that use DuckDB-specific SQL features into simpler ones if SQLite doesn’t support them (e.g., replace LIST_AGG or some extension-only things with custom logic or skip them).

Code conversion examples:

Example 1: Migrating a pandas groupby to DuckDB SQL.

Pandas code:

result = df.groupby('category')['value'].sum().reset_index()

DuckDB code:
```
result_df = con.execute("SELECT category, SUM(value) AS total FROM df GROUP BY category").df()
```
(This assumes df is registered as in-memory table or you do duckdb.query(df, "...")). The outcome is the same, but now it can handle much larger df because df could even be a DuckDB-managed table or a file.

Example 2: Migrating a SQL query from SQLite to DuckDB.

SQLite query:

SELECT strftime('%Y-%m', date) AS year_month, COUNT(*) 
FROM sales 
GROUP BY strftime('%Y-%m', date);

DuckDB equivalent: DuckDB supports strftime similarly if using the date_trunc or strftime (with extension loaded for time functions). Actually, DuckDB's strftime might not be built-in but to_varchar(date, '%Y-%m') could do it, or date_trunc('month', date) to group by month.
```
SELECT date_trunc('month', date) AS month_start, COUNT(*) 
FROM sales 
GROUP BY 1;
```
which gives first day of month and count. Or use DuckDB's format:
```
SELECT strftime(date, '%Y-%m') AS year_month, COUNT(*) FROM sales GROUP BY year_month;
```
(DuckDB's strftime comes from the ICU extension or standard - it has a function but might need LOAD 'icu'. Alternatively, just do EXTRACT(YEAR FROM date) and EXTRACT(MONTH from date) in select and group on those two).
The main migration difference: DuckDB and SQLite share many functions, but there are slight differences (DuckDB is closer to Postgres in many date functions).
This example shows verifying that a function exists and possibly replacing it with an equivalent.

Common pitfalls to avoid during migration:

Data type mismatches: For instance, migrating from Pandas, Pandas may have object dtype for strings, whereas DuckDB will treat them as VARCHAR. Usually fine, but if you had any special values (like Python None vs NaN distinction), note that DuckDB uses SQL NULL for missing. Be careful to handle missing data appropriately (DuckDB’s IS NULL or coalesce correspond to pandas isna() and fillna).
Floating point differences: Summing in DuckDB might produce slight float differences vs Pandas (maybe due to parallel reduction order, or 64-bit vs 128-bit intermediate). Usually negligible, but test if you rely on bit-exact floats.
Ordering: SQL queries have no guaranteed order without ORDER BY. In pandas, sort order might carry through. If you need a particular order, add ORDER BY in DuckDB or sort the resulting DataFrame.
Mutating operations: Pandas often does in-place updates or iterative logic. In DuckDB (as in SQL generally), you'd do an UPDATE (for in-place in table) or just re-select with new calculated columns. If migrating such logic, sometimes it's better to compute new columns in a SELECT rather than trying to do iterative updates. E.g., a running total in Pandas via cumsum versus in DuckDB via window function.
Concurrent writes: If coming from something like Postgres where many clients wrote concurrently, remember DuckDB is not made for high-concurrency multi-user writing. In migration, if you need concurrency, perhaps stick to Postgres or consider splitting workloads (maybe each user gets their own DuckDB file).
Dependency on untranslatable libraries: If your current workflow does something highly specialized in Python (e.g., calls a complex scikit-learn function on every group), you might not translate that to pure DuckDB SQL. In such cases, maybe you don’t fully migrate – you use DuckDB to narrow down data or pre-aggregate, then still call that Python function on the smaller result. Migration doesn’t have to cover 100% if not feasible; use the right tool for each part.

By planning carefully and testing, migrating to DuckDB can significantly improve performance and simplify your stack. Conversely, migrating from DuckDB usually is straightforward because of its compatibility with standard SQL and ability to export data in common formats.

Resources and further reading

To deepen your understanding of DuckDB and get help or updates, here are some valuable resources:

Official resources

DuckDB official documentation: The primary documentation for DuckDB is available on the official website. It includes a thorough explanation of installation, usage, SQL reference, and guides. You can find it at https://duckdb.org/docs. This includes the “Getting Started” guide, which is great for beginners, and detailed sections on DuckDB’s SQL syntax and extensions.
GitHub repository: DuckDB’s source code and issue tracker are on GitHub at https://github.com/duckdb/duckdb. It’s worth checking out for the latest releases (see the Releases page), to report issues, or to see ongoing developments. The README also contains useful info and links.
PyPI page: The Python package info is on PyPI: https://pypi.org/project/duckdb. This is mainly for installation and seeing release history. You’ll use pip install duckdb to get it, but the PyPI page shows version numbers and release dates which can be useful to verify you have the latest version.
DuckDB installation guide: On the official site, there’s an installation page that covers various platforms (Python package, R package, command-line usage). It’s helpful if you encounter any installation snags or want alternative install methods (like conda, or installing specific versions).
DuckDB blog (Official): The DuckDB team maintains an official blog on their website with announcements and technical articles. For example, posts like “Announcing DuckDB 1.3.0” detail new features and improvements in each release. The blog is a great way to stay updated on major enhancements (like when they added the JSON extension or new indexing).
DuckDB FAQ: The DuckDB FAQ page (on the website) addresses common questions about the project’s goals, design choices, and usage (e.g., thread safety, storage format, etc.). It’s useful for understanding some conceptual aspects and limitations.

Community resources

DuckDB community Slack/Discord: DuckDB has an official Slack channel (DuckDB Labs Slack) where users and developers discuss issues and features. There’s also a community Discord server. You can find invite links on the DuckDB community page. Joining these is great for asking quick questions or seeing what others are doing with DuckDB.
Stack Overflow (DuckDB tag): There is an active [duckdb] tag on Stack Overflow for questions and answers. Before asking, you can search that tag – many common questions (installation problems, specific function usage, error messages) have been answered. If you run into a problem, posting on Stack Overflow with the duckdb tag can get you help from the community and sometimes the developers.
Reddit communities: The subreddit r/DuckDB is a small but growing community discussing DuckDB specifically. Additionally, related subreddits like r/dataengineering or r/datascience often have discussions and posts about DuckDB (for example, comparisons or success stories). Searching those subreddits for "DuckDB" can yield some insightful threads (e.g., “Why do people in data like DuckDB?” on r/dataengineering).
GitHub discussions: On the DuckDB GitHub repository, there might be a Discussions section enabled (if not, issues serve a similar role). This is another place to see community Q&A or feature ideas outside of formal issues.
MotherDuck discussions: MotherDuck (the cloud service based on DuckDB) has a blog and resources as well. They often post monthly updates called “This Month in DuckDB Ecosystem” which curate interesting tools, blog posts, and uses of DuckDB around the web. Following the MotherDuck blog or Twitter can indirectly keep you in the loop on community happenings.
YouTube channels and videos: While there isn’t an official DuckDB YouTube channel, many talks and conference presentations about DuckDB are on YouTube. For instance, the MLOps.community talk "Small Data, Big Impact: The Story Behind DuckDB" by Hannes Mühleisen, or a data engineering meetup talk. Searching YouTube for “DuckDB” yields talks that can be enlightening about use cases and technical underpinnings. These talks often cover how DuckDB works under the hood or showcase use cases in a visual form.
Podcasts: DuckDB has been discussed in various data engineering podcasts. For example, the Data Engineering Podcast and others have episodes on DuckDB or interviews with its creators. Listening to those can give context on design philosophy and where the project is headed.

Learning materials

DuckDB – the ultimate guide (Udemy course): There are at least a couple of online courses on Udemy related to DuckDB (as seen in search results). For instance, "DuckDB - The Ultimate Guide" is an Udemy course covering DuckDB usage, including projects (the search blurb mentioned 7 practice projects and covering DuckDB in Python, CLI, and Docker). Such a course can provide a structured learning path if you prefer video/tutorial format.
Free “DuckDB in action” book (Manning, offered via MotherDuck): As part of community resources, MotherDuck offers a free PDF of the upcoming Manning book DuckDB in Action. You can get it by signing up on the MotherDuck site. This book (by Needham, Hunger, Simons) covers basics to advanced topics in a tutorial style. The chapters cover DuckDB usage in pipelines, with examples, and even some integration with web (WASM, etc.). It’s a comprehensive way to learn from scratch.
Tutorials and blog posts: Many blog posts serve as tutorials. For example, “Using DuckDB in Python: A Comprehensive Guide” on Medium by Anshu Bantrawalks through using DuckDB step by step. The official DuckDB blog's guides (like “efficient SQL on Pandas with DuckDB” or the data import guides) are excellent for learning specific features. Also, companies like MotherDuck often publish use-case driven articles (like those case studies we mentioned).
Interactive tutorials: While there isn’t an official interactive DuckDB tutorial site (like how some technologies have a "playground"), you can sort of use DuckDB’s in-browser demo (there’s a DuckDB Shell WASM demo at shell.duckdb.org) to try queries. Also, Jupyter notebooks with DuckDB are a great way to learn interactively – you can run SQL directly in a notebook using %sql magic if you install jupysql (formerly ipython-sql) and DuckDB, which some tutorials on duckdb.org demonstrate.
Courses/workshops from conferences: Check if recent data conferences (Strata, Data Council, etc.) had DuckDB workshops or tutorials. Sometimes the materials (notebooks, slides) are made available. For example, a PyData conference might have a talk or tutorial on DuckDB usage in Python pipelines; those materials can be valuable self-paced learning tools.
Code repositories with examples: The DuckDB GitHub might have some example scripts in the tools or test directories. Also, the community has created some example repos: e.g., duckdb-wasm demos on how to run DuckDB in a browser, or example data science projects that utilize DuckDB. Searching GitHub for “duckdb examples” could yield some user repositories that show real use.
Related reading: Since DuckDB often is positioned as “SQLite for analytics” and uses columnar execution, reading up on columnar databases or SQL optimization techniques could indirectly help. If you’re interested in internals, the DuckDB team has published academic papers (like their CIDR 2020 paper) which, while heavy on database theory, can deepen understanding of why DuckDB is designed a certain way.

By leveraging these resources, you can get quick help when needed, stay updated on DuckDB’s improvements, and learn best practices to fully utilize the library. The combination of official docs for reference, community forums for Q&A, and tutorials or courses for structured learning provides a well-rounded support system as you work with DuckDB.

FAQs about DuckDB library in Python

Finally, to address common questions, below is an extensive FAQ section. Each question is answered concisely (within 2-3 sentences) to provide quick clarity on frequently searched topics. The FAQs are grouped by category as specified: Installation, Basic Usage, Features, Troubleshooting, Performance, Integration, Best Practices, and Comparisons. This can serve as a quick reference or “cheat sheet” for DuckDB users.

Installation and Setup

Q: How do I install the DuckDB library in Python?
A: Install DuckDB by running pip install duckdb. This downloads the pre-compiled Python package from PyPI. Alternatively, you can use conda with conda install -c conda-forge python-duckdb.
Q: How to install DuckDB on Windows?
A: On Windows, use pip in Command Prompt or PowerShell: pip install duckdb. Make sure you have Python 3.9+ and an updated pip; DuckDB provides Windows wheels, so no compiler is needed.
Q: How can I install DuckDB on macOS (Intel or M1)?
A: Use pip install duckdb on macOS. DuckDB provides universal wheels that work on both Intel and Apple Silicon Macs, so the installation will automatically get the correct binary.
Q: Is DuckDB available via Anaconda/conda?
A: Yes, you can install DuckDB using conda. Use the command conda install -c conda-forge python-duckdb, which will install the DuckDB Python package from the conda-forge channel.
Q: What Python versions are supported by DuckDB?
A: DuckDB requires Python 3.9 or newer (as of latest releases). It supports Python 3.9, 3.10, and 3.11 (and likely above), on major OS platforms.
Q: Do I need to install any database server or dependencies for DuckDB?
A: No, DuckDB is self-contained. Installing the Python package is enough – it includes the DuckDB engine as a library. There’s no separate server or runtime to install, and no external dependencies needed.
Q: How do I upgrade DuckDB to the latest version?
A: Upgrade using pip: pip install --upgrade duckdb. This fetches the newest version from PyPI. Ensure to restart your Python session to use the updated library.
Q: Can I install DuckDB in a virtual environment?
A: Yes, DuckDB works fine in virtual environments. Activate your venv or Conda env and run the pip install within it. The library is just a Python package, so it adheres to virtual env boundaries.
Q: How to install DuckDB in Jupyter Notebook?
A: In a Jupyter cell, you can install via pip by prefixing with !: e.g. !pip install duckdb. Alternatively, install DuckDB in the environment that the Jupyter kernel is using. After installation, import duckdb will work inside notebooks.
Q: Does DuckDB work on Raspberry Pi or ARM devices?
A: Yes, DuckDB provides wheels for ARM (such as aarch64 architecture). On Raspberry Pi (ARM32/64), you may need to compile from source if a wheel isn’t available. For modern ARM64 systems like M1 Mac or certain Pi OS 64-bit, pip should fetch a compatible wheel.
Q: How do I compile DuckDB from source for Python?
A: Generally not needed, but you can compile by cloning the DuckDB repo and using pip install . in the tools/pythonpkg directory. Ensure you have a C++17 compiler and CMake. Most users use pre-built binaries, so source build is optional.
Q: Is there a Docker image with DuckDB pre-installed?
A: There’s no official DuckDB Docker image (since it’s just a library), but you can easily create one. Use a base like python:3.10-slim and RUN pip install duckdb. This way any container has DuckDB ready to use.
Q: Can I use DuckDB in PyPy or other Python implementations?
A: DuckDB’s Python package is a CPython extension module, so it is primarily for CPython. It may not work on PyPy or Jython since those would need a compatible C extension interface. It’s tested on CPython (standard Python) only.
Q: How to install a specific version of DuckDB?
A: You can specify a version with pip: for example, pip install duckdb==0.7.1 (replace with desired version). This will install that version if available. You might pin a version if needed for compatibility.
Q: The pip install fails on Linux with a glibc error – what to do?
A: DuckDB’s Linux wheels require glibc 2.28 or newerduckdb.org (aligned with ManyLinux2014). If your system is very old (e.g., CentOS7), pip might try to compile from source and fail. Solution: upgrade pip and wheel, or use a newer OS or conda (conda-forge often has builds for older systems). Alternatively, build from source on that system.
Q: How do I install DuckDB’s development (nightly) version?
A: Nightly builds might be available via DuckDB’s GitHub or a special index. If not, you can compile from the latest source. Typically, you stick to releases unless you need a cutting-edge fix.
Q: Can I use DuckDB with AWS Lambda or cloud functions?
A: Yes, DuckDB’s wheel can be included in a Lambda deployment (as part of your package or a layer) since it’s pure C++ and doesn’t depend on system libs beyond standard. Ensure you include the .so/.dll in the deployment package. Many have successfully used DuckDB in Lambda for serverless analytics.
Q: Is DuckDB available for R or other languages?
A: Outside Python, DuckDB has clients for R, Julia, C++, Java, etc. For Python specifically, just install via pip. If you use R, you’d use install.packages("duckdb"). Each language has its own package but they all use the same DuckDB core.
Q: How to check if DuckDB is installed correctly?
A: After installation, run import duckdb in Python. Then try duckdb.__version__ to print the version. You can also initiate a simple query: duckdb.query("SELECT 42").fetchall() – if that returns [(42,)], installation is successful.
Q: Does DuckDB require admin rights to install?
A: No, not if you use standard pip in a user environment. If you face permission issues on system Python, use pip install --user duckdb or install in a virtual environment. DuckDB itself doesn’t require special privileges to run.
Q: How to install DuckDB in PyCharm?
A: In PyCharm, go to Settings -> Python Interpreter, then add duckdb via the package manager. Or simply open PyCharm’s terminal with the project interpreter and run pip install. It’s the same pip process, PyCharm will reflect it in the interpreter’s installed packages.
Q: Can I run DuckDB in Google Colab?
A: Yes, Colab supports installing new packages. Just do !pip install duckdb in a cell. After that, import duckdb works. You might need to restart the kernel after install (Colab sometimes requires it, but often it works immediately).
Q: What’s the size of the DuckDB library?
A: The wheel is relatively small (a few MB). On installation, it might be around 10-15 MB on disk. DuckDB is lightweight compared to big DBMSs; it doesn’t pull hefty dependencies. This small size makes installation quick.
Q: Does installing DuckDB also install SQLite or something?
A: No, DuckDB is a standalone engine. It doesn’t bundle SQLite or any external DB. It’s just the DuckDB engine (though its philosophy is akin to SQLite for analytics). Installing duckdb won’t conflict with SQLite; they’re separate.
Q: Is DuckDB installation different for JupyterLab vs Notebook?
A: No, it’s the same. Just ensure you install it in the environment that JupyterLab uses. For JupyterLab’s integrated pip, do !pip install duckdb in a cell or use the terminal.
Q: How do I set up DuckDB CLI or shell?
A: The question is about Python, but FYI, DuckDB comes with a CLI (if you install the standalone binary or via certain packages). On Python, you can drop to SQL shell by calling duckdb.connect() and then .query() or using it interactively. But if you need CLI, download duckdb_cli binary from the site or pip install duckdb also provides duckdb command (check your Python’s scripts directory; you can run duckdb in terminal after pip install – it starts a SQL shell).
Q: Any special setup for using DuckDB on a cluster or remote server?
A: No special setup, as long as you have Python and can pip install. If internet access is restricted, you might need to download the wheel manually and install offline. But DuckDB doesn’t require a running service – it runs in your script or application’s process. So on a cluster, just ensure each node has DuckDB installed or included in the environment.
Q: Can I install DuckDB without internet (offline)?
A: Yes, download the appropriate wheel from PyPI (for your OS and Python version) and then use pip install <path_to_wheel>. This allows offline installation. You could also vendor the wheel file in your application’s installer if needed.
Q: How to install DuckDB extensions like httpfs or json?
A: These extensions are included with DuckDB; you don’t install them via pip separately. You INSTALL and LOAD extensions at runtime via SQL (e.g., con.execute("INSTALL 'httpfs'")). So, after installing DuckDB itself, you use DuckDB’s INSTALL command for extensions, which downloads extension binaries on first use.
Q: After installation, how do I import and start using DuckDB in Python?
A: Simply do import duckdb. Then you can create a connection with duckdb.connect() and execute queries. For example:
```
import duckdb
conn = duckdb.connect(database=':memory:')
result = conn.execute("SELECT 'Hello, DuckDB!'").fetchall()
print(result)
```
If that prints [('Hello, DuckDB!',)], everything is set up correctly.

Basic usage and syntax

Q: How do I create a database connection in DuckDB?
A: Use duckdb.connect(). For an in-memory database, pass ':memory:' or no argument (which defaults to in-memory). If you want a persistent database file, provide a filename like duckdb.connect('mydb.duckdb').
Q: How do I execute an SQL query with DuckDB in Python?
A: First get a connection (e.g. con = duckdb.connect()). Then use con.execute("SQL QUERY"). You can chain .fetchall(), .fetchdf(), or other fetch methods to retrieve results. For example: con.execute("SELECT 1+2").fetchall() returns [(3,)].
Q: How do I fetch query results as a pandas DataFrame?
A: After executing a query, call .df() on the result: df = con.execute("SELECT * FROM mytable").df(). This converts the result set to a pandas DataFrame. Alternatively, .fetchdf() does the same thing.
Q: Can I parameterize queries in DuckDB (prepared statements)?
A: Yes. Use ? placeholders and provide a list of parameters in execute(). For example: con.execute("SELECT * FROM table WHERE id = ?", [42]). DuckDB will bind the parameter safely (avoiding SQL injection and handling types).
Q: How do I create a table in DuckDB?
A: Use standard SQL DDL. For example: con.execute("CREATE TABLE people (name VARCHAR, age INT)"). This defines a table in the connected database. You can then INSERT into it or use it in queries.
Q: How can I insert data into a DuckDB table?
A: You can use SQL INSERT statements or leverage DuckDB’s ability to insert from pandas. For example:
```
con.execute("INSERT INTO people VALUES (?, ?)", ["Alice", 30])
```
or multiple inserts via executemany. If you already have a pandas DataFrame df, you can do duckdb.insert('people', df) which bulk inserts DataFrame rows into the table.
Q: How do I query a pandas DataFrame directly with DuckDB?
A: DuckDB can treat DataFrames as tables. Simply refer to the DataFrame variable name in the SQL. For example, if you have df = pandas.DataFrame(...), you can do:
```
duckdb.query("SELECT count(*) FROM df").fetchall()
```
or register it: con.register('mydf', df) then query mydf. DuckDB will scan the DataFrame in memorymedium.com.
Q: How do I join two tables in DuckDB?
A: Use SQL JOIN syntax:
```
SELECT *FROM table1 AS t1 
JOIN table2 AS t2 
ON t1.id = t2.id;
```
DuckDB supports inner, left, right, and full joins, as well as cross joins. Just like any SQL, ensure to specify the ON condition for non-cross joins.
Q: How do I add a new column or compute a derived column in a query?
A: Use a SELECT expression. For example:
```
SELECT *, price * quantity AS total_price 
FROM sales;
```
This will produce all original columns plus a new computed column. If you want to alter a table to add a column (empty), you can do ALTER TABLE table ADD COLUMN newcol TYPE.
Q: What data types does DuckDB support?
A: DuckDB supports typical SQL types: BOOLEAN, TINYINT/SMALLINT/INTEGER/BIGINT (for various integer sizes), REAL/DOUBLE (floats), DECIMAL, VARCHAR (strings), BLOB, DATE, TIME, TIMESTAMP, and more. It also has structural types like MAP, LIST, and union types, and it can handle UUID, huge integers, etc. It’s quite rich, but if you stick to standard types (int, float, text, date) you’re covered.
Q: How do I use dates and timestamps in DuckDB?
A: You can create DATE or TIMESTAMP columns. Insert as ISO strings or as Python datetime objects via parameter binding (DuckDB will convert). In queries, you can compare dates, extract parts (e.g., EXTRACT(YEAR FROM date_col)), and format them. DuckDB uses DATE 'YYYY-MM-DD' literal for date and TIMESTAMP 'YYYY-MM-DD hh:mm:ss' for timestamp if writing SQL directly.
Q: How do I handle NULL or missing values in DuckDB?
A: DuckDB uses SQL NULL for missing data. You can filter them with WHERE col IS NULL or IS NOT NULL. Use functions like coalesce(col, fallback) to replace NULLs or CASE expressions. When fetching into pandas, NULLs become NaN or None depending on dtype.
Q: How to write results of a query to a table (make it persistent)?
A: Use CREATE TABLE AS or INSERT. For example:
```
CREATE TABLE newtable ASSELECT col1, SUM(col2) AS sum2 
FROM oldtable GROUP BY col1;
```
This executes the query and stores the result in a new table. Or if table exists: INSERT INTO existing SELECT ....
Q: Does DuckDB support aggregate functions like SUM, AVG, etc.?
A: Yes. It supports SUM, AVG, MIN, MAX, COUNT, as well as more advanced ones like STDDEV, approx_count_distinct, etc. You use them in SELECT with GROUP BY as usual. Windowed aggregates are also supported.
Q: How do I use a SQL function or UDF in DuckDB?
A: DuckDB comes with many built-in functions (math, string, date, etc.). Use them by name, e.g. UPPER(name) or substring(text, 1, 3). For custom Python UDF, define it with con.create_function('name', python_function) then call name() in SQL. SQL macros can be created with CREATE MACRO.
Q: How can I export query results to a CSV or Parquet file?
A: Use the COPY command. For example:
```
COPY (SELECT * FROM table) TO 'output.csv' (HEADER, DELIMITER ',');
```
or for Parquet:
```
COPY table TO 'output.parquet' (FORMAT PARQUET);
```
This will write the table or query result to a file. You can also do it in Python by fetching and using pandas to_csv, but COPY is efficient and directdzone.com.
Q: How do I load a CSV file into DuckDB?
A: Easiest: directly query it with SELECT (DuckDB treats 'file.csv' as a table). But to load into a table, you can do:
```
CREATE TABLE mytable AS SELECT * FROM 'data.csv';
```
DuckDB will auto-detect the CSV schema and import itmedium.com. Or use COPY mytable FROM 'file.csv' (AUTO_DETECT TRUE); after creating an empty table.
Q: Can I run multiple SQL commands in one execute call?
A: DuckDB execute handles one statement at a time. If you separate by semicolons in a single string, it will execute up to the first semicolon. For multiple commands, call execute multiple times or use script() convenience method (if provided). Typically, do sequential calls: e.g., con.execute("CREATE ..."); con.execute("INSERT ...");.
Q: How do I commit or rollback transactions in DuckDB?
A: DuckDB auto-commits each query by default (except if you explicitly start a transaction). If you want manual control, do:
```
BEGIN;
-- some SQLCOMMIT;
```
Use ROLLBACK; to abort. In Python, you can simply call con.commit() or con.rollback() as well. But unless you begin, each execute is committed automatically.
Q: How can I see what tables or schema I have in DuckDB?
A: You can query the information_schema or use built-in commands. For example, con.execute("SHOW TABLES").fetchall() will list tables in the current schema. DESCRIBE tableName shows columns. DuckDB also supports pragma show_tables; etc. to inspect the schema.
Q: Can DuckDB run parameterized queries or prepared statements?
A: Yes, parameterization we covered with ? placeholders. DuckDB’s Python API doesn’t explicitly expose a persistent prepared statement object (each execute can reuse parameter binding internally though). But if you want to repeatedly run similar queries, parameterization improves safety and can be more efficient.
Q: How do I close a DuckDB connection?
A: Use con.close(). DuckDB will flush any pending writes and release resources. It’s good practice to close when done (especially if you had a file-based DB, closing ensures file locks are released).
Q: What is the default storage of DuckDB if I don’t specify a file?
A: If you don’t specify, duckdb.connect() uses an in-memory database (data not saved to disk). If you want persistence, specify a filename for the database. If you want to explicitly enforce ephemeral usage, use :memory: as the database name.
Q: Can I use DuckDB in a multi-threaded Python program?
A: Yes. DuckDB releases the GIL when executing queries, so other Python threads can run. You can share a connection across threads if using .cursor() to get thread-local objects. It’s often simpler to use separate connections per thread (you can connect multiple times to the same database file). Ensure thread safety by not issuing conflicting writes simultaneously.
Q: How do I register a DataFrame or relational API usage?
A: Use con.register('name', dataframe) to make a pandas DataFrame queryable as a table. Or use DuckDB’s Relation API: e.g. rel = duckdb.from_df(df).filter("x > 5").aggregate("sum(y)"). This provides a method-chaining way to build queries in Python instead of SQL strings.
Q: Does DuckDB support user-defined SQL functions?
A: You can create SQL macros (which are basically alias for expressions) using CREATE MACRO, which is like a simple user-defined function (but it inlines code). Proper persistent UDFs in SQL are not as fully supported as macros. However, you can register Python UDFs for use within that session’s queries.
Q: Can I call DuckDB from SQLAlchemy or an ORM?
A: Yes, there's a duckdb-engine for SQLAlchemy. You can connect using connection strings and use DuckDB with SQLAlchemy as you would Postgres or SQLite. This allows integration with ORMs that support SQLAlchemy dialects (like you could use it in a Flask app with SQLAlchemy, though not all ORM features may apply cleanly in analytics context).
Q: How do I get the number of rows affected by a query?
A: After an insert/update/delete, you can use con.execute(...).fetchall() which returns an empty result for DML. But DuckDB’s Cursor object may have a property rowcount which indicates rows affected. E.g., cur = con.execute("DELETE FROM table WHERE x<0") then cur.rowcount.
Q: What is the syntax for filtering and sorting in DuckDB?
A: Filtering uses WHERE clause: e.g., SELECT * FROM table WHERE age > 30. Sorting uses ORDER BY: e.g., SELECT * FROM table ORDER BY name DESC. These work exactly like standard SQL.
Q: How do I use aggregate with GROUP BY in DuckDB?
A: Same as any SQL: SELECT key, SUM(val) FROM table GROUP BY key. DuckDB supports multiple keys grouping, HAVING clause for filtering groups, etc. This is fundamental SQL and DuckDB behaves normally in that regard.

Katerina Hynkova

Share this post

Blog

Deep dive: why we built a new notebook format

By Jakub Jurovych

•

Updated on November 5, 2025

Ultimate guide to huggingface_hub library in Python

By Katerina Hynkova

•

Updated on August 22, 2025

How we made data apps 40% faster

By Patrik Gallik

•

Updated on April 24, 2024

That’s it, time to try Deepnote

Get started – it’s free

Book a demo