Introduction
Benchmark
Histogram of individuals belonging to more than one group
Example of a dataset
Parrallel Coordinates showcasing the variability of the groups properties in the simulated dataset
Distribution of the group properties
Performance of the predictability of models on the dataset
Capacity to follow the correlation matrix
Simulated data for real world datasets
Adult dataset
Original data
adult = fetch_ucirepo(id=2)
df1 = adult['data']['original']
df1.drop(columns=['fnlwgt'], inplace=True)
df1
Simulated data
schema, corr_matrix = generate_schema_from_dataframe(df1, protected_columns=['race', 'sex'], outcome_column='income',
n_samples=50)
data = generate_data(
correlation_matrix=corr_matrix,
data_schema=schema,
prop_protected_attr=0.4,
nb_groups=10,
max_group_size=400,
categorical_outcome=True,
use_cache=False,
corr_matrix_randomness=0.0)
print(f"Generated {len(data.dataframe)} samples in {data.nb_groups} groups")
print(f"Collisions: {data.collisions}")
df2 = decode_dataframe(data.dataframe, schema)
df2
Analysis
fig = plot_distributions_comparison(df1, df2, figsize=(30, 50))
plt.show()
plot_correlation_matrices(corr_matrix, data)
German credit data
Original data
statlog_german_credit_data = fetch_ucirepo(id=144)
df1 = statlog_german_credit_data['data']['original']
df1
schema, corr_matrix = generate_schema_from_dataframe(df1, protected_columns=['Attribute8', 'Attribute12'], outcome_column='Attribute20',
n_samples=50)
Generated dataset
data = generate_data(
correlation_matrix=corr_matrix,
data_schema=schema,
nb_groups=10,
max_group_size=400,
categorical_outcome=True,
use_cache=False,
corr_matrix_randomness=0.0)
print(f"Generated {len(data.dataframe)} samples in {data.nb_groups} groups")
print(f"Collisions: {data.collisions}")
df2 = decode_dataframe(data.dataframe, schema)
df2
Analysis
fig = plot_distributions_comparison(df1, df2, figsize=(30, 50))
plt.show()
plot_correlation_matrices(corr_matrix, data)
plt.show()