Introduction

Benchmark

Histogram of individuals belonging to more than one group

Example of a dataset

Parrallel Coordinates showcasing the variability of the groups properties in the simulated dataset

Distribution of the group properties

Performance of the predictability of models on the dataset

Capacity to follow the correlation matrix

Simulated data for real world datasets

Adult dataset

Original data

adult = fetch_ucirepo(id=2) df1 = adult['data']['original'] df1.drop(columns=['fnlwgt'], inplace=True) df1

Simulated data

schema, corr_matrix = generate_schema_from_dataframe(df1, protected_columns=['race', 'sex'], outcome_column='income', n_samples=50)

data = generate_data( correlation_matrix=corr_matrix, data_schema=schema, prop_protected_attr=0.4, nb_groups=10, max_group_size=400, categorical_outcome=True, use_cache=False, corr_matrix_randomness=0.0) print(f"Generated {len(data.dataframe)} samples in {data.nb_groups} groups") print(f"Collisions: {data.collisions}") df2 = decode_dataframe(data.dataframe, schema) df2

Analysis

fig = plot_distributions_comparison(df1, df2, figsize=(30, 50)) plt.show()

plot_correlation_matrices(corr_matrix, data)

German credit data

Original data

statlog_german_credit_data = fetch_ucirepo(id=144) df1 = statlog_german_credit_data['data']['original'] df1

schema, corr_matrix = generate_schema_from_dataframe(df1, protected_columns=['Attribute8', 'Attribute12'], outcome_column='Attribute20', n_samples=50)

Generated dataset

data = generate_data( correlation_matrix=corr_matrix, data_schema=schema, nb_groups=10, max_group_size=400, categorical_outcome=True, use_cache=False, corr_matrix_randomness=0.0) print(f"Generated {len(data.dataframe)} samples in {data.nb_groups} groups") print(f"Collisions: {data.collisions}") df2 = decode_dataframe(data.dataframe, schema) df2

Analysis

fig = plot_distributions_comparison(df1, df2, figsize=(30, 50)) plt.show()

plot_correlation_matrices(corr_matrix, data) plt.show()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction

Benchmark

Histogram of individuals belonging to more than one group

Example of a dataset

Parrallel Coordinates showcasing the variability of the groups properties in the simulated dataset

Distribution of the group properties

Performance of the predictability of models on the dataset

Capacity to follow the correlation matrix

Simulated data for real world datasets

Adult dataset

Original data

Simulated data

Analysis

German credit data

Original data

Generated dataset

Analysis

Experiment results

Introduction