Using DuckDB to query beneficial ownership data in Parquet files
This notebook is to demonstrate how to use DuckDB to query beneficial ownership data published in line with version 0.2 of the Beneficial Ownership Data Standard developed by Open Ownership and Open Data Services. This is made possible thanks to DuckDB's features which allow direct querying of Parquet and csv files. Open Ownership's data analysis tools help anyone use public beneficial ownership data published in line with the Beneficial Ownership Data Standard by republishing that data in formats including Parquet, csv, SQLite and JSON.
Installation
To install DuckDB, you need to run `pip install duckdb` on your computer or in a notebook. Some environments have it set up already.
Next you'll need to download the Parquet files from our data analysis tools and unzip them. For this example, we will be using data from Slovakia's Public Sector Partners Register. The tools make use of Flatterer to extract and flatten the JSON files for use.
--2023-05-31 12:15:17-- https://s3.eu-west-1.amazonaws.com/oo-bodsdata/data/slovakia/parquet.zip
Resolving s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)... 52.218.1.171, 52.218.21.162, 52.218.46.48, ...
Connecting to s3.eu-west-1.amazonaws.com (s3.eu-west-1.amazonaws.com)|52.218.1.171|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7297193 (7.0M) [application/zip]
Saving to: ‘parquet.zip.2’
parquet.zip.2 100%[===================>] 6.96M 11.5MB/s in 0.6s
2023-05-31 12:15:18 (11.5 MB/s) - ‘parquet.zip.2’ saved [7297193/7297193]
Archive: parquet.zip
inflating: person_statement.parquet
inflating: person_names.parquet
inflating: person_identifiers.parquet
inflating: person_nationalities.parquet
inflating: person_addresses.parquet
inflating: entity_statement.parquet
inflating: entity_identifiers.parquet
inflating: entity_addresses.parquet
inflating: ooc_statement.parquet
Create DuckDB connection. This will by default use an in memory version of DuckDB. This is fine as we are directly querying the Parquet files.
Sampling
DuckDB has very good SQL support and is able to do things like random sampling of the data. Here are examples of querying the main BODS tables, getting a random sample of 5 rows for each:
0
26489
115642693838180910
1
23658
15965360602030856069
2
21441
5167444844300214596
3
31079
9183680011182080981
4
43784
15581682745602805039
0
2879
3182592789504659857
1
11969
4689106094742225149
2
5289
12595671253740199089
3
2307
4180267835029674540
4
2346
17351123701195437165
0
44773
7606769600253917900
1
18894
3086605360229111783
2
16176
7091464573727041665
3
35758
3915125806182030751
4
28167
4148840848045628261
DuckDB supports different methods to sample data which is useful for analysing large datasets.
Querying
DuckDB supports SQL syntax very similar to Postgres and has similar (some more and some less) features. You can query the Parquet files directly by specifying the file name. This query shows each person statement that has some direct beneficial ownership interest in an entity:
0
466752678574386055
nan
1
12103530539130429375
nan
2
4370500509583261651
1968-03-25
3
3711095586054358057
1992-08-04
4
14034295023186105036
1986-06-01
Saving and using results
To save to a Pandas 'dataframe' it is as simple as:
However, a dataframe needs to exist in memory and if you are working with larger datasets this may not be possible. It is also possible to copy the results to a new Parquet file that you can then work with. This creates a new Parquet file `all_person_names.parquet` which aggregates all person names and comma separates them if one person has multiple names.
0
50085
You can then use the results using this new file. The below query is the same as the one above but also joins to the names table getting out the names for the people in the beneficial ownership relationship:
0
17203585654442440550
nan
1
10826327996973194992
nan
2
7795889732628808593
nan
3
6935552980283077603
nan
4
13240835550346652139
1980-04-09