Using DuckDB to query beneficial ownership data in Parquet files
This notebook is to demonstrate how to use DuckDB to query beneficial ownership data published in line with version 0.2 of the Beneficial Ownership Data Standard developed by Open Ownership and Open Data Services. This is made possible thanks to DuckDB's features which allow direct querying of Parquet and csv files. Open Ownership's data analysis tools help anyone use public beneficial ownership data published in line with the Beneficial Ownership Data Standard by republishing that data in formats including Parquet, csv, SQLite and JSON.
Installation
To install DuckDB, you need to run `pip install duckdb` on your computer or in a notebook. Some environments have it set up already.
Next you'll need to download the Parquet files from our data analysis tools and unzip them. For this example, we will be using data from Slovakia's Public Sector Partners Register. The tools make use of Flatterer to extract and flatten the JSON files for use.
Create DuckDB connection. This will by default use an in memory version of DuckDB. This is fine as we are directly querying the Parquet files.
Sampling
DuckDB has very good SQL support and is able to do things like random sampling of the data. Here are examples of querying the main BODS tables, getting a random sample of 5 rows for each:
DuckDB supports different methods to sample data which is useful for analysing large datasets.
Querying
DuckDB supports SQL syntax very similar to Postgres and has similar (some more and some less) features. You can query the Parquet files directly by specifying the file name. This query shows each person statement that has some direct beneficial ownership interest in an entity:
Saving and using results
To save to a Pandas 'dataframe' it is as simple as:
However, a dataframe needs to exist in memory and if you are working with larger datasets this may not be possible. It is also possible to copy the results to a new Parquet file that you can then work with. This creates a new Parquet file `all_person_names.parquet` which aggregates all person names and comma separates them if one person has multiple names.
You can then use the results using this new file. The below query is the same as the one above but also joins to the names table getting out the names for the people in the beneficial ownership relationship: