Using DuckDB to query beneficial ownership data in Parquet files
This notebook is to demonstrate how to use DuckDB to query beneficial ownership data published in line with version 0.2 of the Beneficial Ownership Data Standard developed by Open Ownership and Open Data Services. This is made possible thanks to DuckDB's features which allow direct querying of Parquet and csv files. Open Ownership's data analysis tools help anyone use public beneficial ownership data published in line with the Beneficial Ownership Data Standard by republishing that data in formats including Parquet, csv, SQLite and JSON.
To install DuckDB, you need to run `pip install duckdb` on your computer or in a notebook. Some environments have it set up already.
--2023-05-31 12:15:17-- https://s3.eu-west-1.amazonaws.com/oo-bodsdata/data/slovakia/parquet.zip Resolving s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)... 188.8.131.52, 184.108.40.206, 220.127.116.11, ... Connecting to s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)|18.104.22.168|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7297193 (7.0M) [application/zip] Saving to: ‘parquet.zip.2’ parquet.zip.2 100%[===================>] 6.96M 11.5MB/s in 0.6s 2023-05-31 12:15:18 (11.5 MB/s) - ‘parquet.zip.2’ saved [7297193/7297193] Archive: parquet.zip inflating: person_statement.parquet inflating: person_names.parquet inflating: person_identifiers.parquet inflating: person_nationalities.parquet inflating: person_addresses.parquet inflating: entity_statement.parquet inflating: entity_identifiers.parquet inflating: entity_addresses.parquet inflating: ooc_statement.parquet
DuckDB supports different methods to sample data which is useful for analysing large datasets.
DuckDB supports SQL syntax very similar to Postgres and has similar (some more and some less) features. You can query the Parquet files directly by specifying the file name. This query shows each person statement that has some direct beneficial ownership interest in an entity:
Saving and using results
To save to a Pandas 'dataframe' it is as simple as:
However, a dataframe needs to exist in memory and if you are working with larger datasets this may not be possible. It is also possible to copy the results to a new Parquet file that you can then work with. This creates a new Parquet file `all_person_names.parquet` which aggregates all person names and comma separates them if one person has multiple names.
You can then use the results using this new file. The below query is the same as the one above but also joins to the names table getting out the names for the people in the beneficial ownership relationship: