Using DuckDB to query beneficial ownership data in Parquet files

This notebook is to demonstrate how to use DuckDB to query beneficial ownership data published in line with version 0.2 of the Beneficial Ownership Data Standard developed by Open Ownership and Open Data Services. This is made possible thanks to DuckDB's features which allow direct querying of Parquet and csv files. Open Ownership's data analysis tools help anyone use public beneficial ownership data published in line with the Beneficial Ownership Data Standard by republishing that data in formats including Parquet, csv, SQLite and JSON.

Installation

To install DuckDB, you need to run `pip install duckdb` on your computer or in a notebook. Some environments have it set up already.

Next you'll need to download the Parquet files from our data analysis tools and unzip them. For this example, we will be using data from Slovakia's Public Sector Partners Register. The tools make use of Flatterer to extract and flatten the JSON files for use.

!wget https://s3.eu-west-1.amazonaws.com/oo-bodsdata/data/slovakia/parquet.zip !unzip -o parquet.zip

Create DuckDB connection. This will by default use an in memory version of DuckDB. This is fine as we are directly querying the Parquet files.

import duckdb conn = duckdb.connect()

Sampling

DuckDB has very good SQL support and is able to do things like random sampling of the data. Here are examples of querying the main BODS tables, getting a random sample of 5 rows for each:

conn.execute('''select * from 'person_statement.parquet' USING SAMPLE 5 ''').fetchdf()

conn.execute('''select * from 'entity_statement.parquet' USING SAMPLE 5 ''').fetchdf()

conn.execute('''select * from 'ooc_statement.parquet' USING SAMPLE 5 ''').fetchdf()

DuckDB supports different methods to sample data which is useful for analysing large datasets.

Querying

DuckDB supports SQL syntax very similar to Postgres and has similar (some more and some less) features. You can query the Parquet files directly by specifying the file name. This query shows each person statement that has some direct beneficial ownership interest in an entity:

conn.execute(''' SELECT person.statementID as person_id, person.birthdate, entity.statementID as entity_id, entity.name as entity_name, ooc.statementID as ooc_id, FROM 'ooc_statement.parquet' ooc JOIN 'person_statement.parquet' person on person.statementID = ooc.interestedParty_describedByPersonStatement JOIN 'entity_statement.parquet' entity on entity.statementID = ooc.subject_describedByEntityStatement USING SAMPLE 5 ''').fetchdf()

Saving and using results

To save to a Pandas 'dataframe' it is as simple as:

my_dataframe = conn.execute('''select * from 'ooc_statement.parquet' USING SAMPLE 5 ''').fetchdf()

However, a dataframe needs to exist in memory and if you are working with larger datasets this may not be possible. It is also possible to copy the results to a new Parquet file that you can then work with. This creates a new Parquet file `all_person_names.parquet` which aggregates all person names and comma separates them if one person has multiple names.

conn.execute('''copy (select _link_person_statement, string_agg(fullName, ', ') all_names FROM 'person_names.parquet' GROUP BY _link_person_statement) to 'all_person_names.parquet' ''').fetchdf()

You can then use the results using this new file. The below query is the same as the one above but also joins to the names table getting out the names for the people in the beneficial ownership relationship:

conn.execute(''' SELECT person.statementID as person_id, person.birthdate, names.all_names, entity.statementID as entity_id, entity.name as entity_name, ooc.statementID as ooc_id, FROM 'ooc_statement.parquet' ooc JOIN 'person_statement.parquet' person on person.statementID = ooc.interestedParty_describedByPersonStatement JOIN 'entity_statement.parquet' entity on entity.statementID = ooc.subject_describedByEntityStatement JOIN 'all_person_names.parquet' names on _link_person_statement = person._link USING SAMPLE 5 ''').fetchdf()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Using DuckDB to query beneficial ownership data in Parquet files

Installation

Sampling

Querying

Saving and using results

Using DuckDB to query beneficial ownership data in Parquet files