David Alonge

%run "./TESTED AND FUNCTIONAL CODE/INCLASS_STARTPY_SETTINGS.ipynb"

Run to view results

import matplotlib.pyplot as plt import scipy.stats as stats from scipy.stats import pearsonr

Run to view results

Understanding COMPAS Dataset

print('Read in compas_raw1-main file for ProPublica.') cp=pd.read_csv("https://edd2.s3.amazonaws.com/spring2024/inclass_data/compas_raw1.csv") cp.head()

Run to view results

COMPAS means Correctional Offender Management Profiling for Alternative Sanctions. It is a popular commercial algorithm used by judges and parole officers for scoring criminal defendant’s likelihood of reoffending (recidivism). It has been shown that the algorithm is biased in favor of white defendants, and against black inmates, based on a 2 year follow up study (i.e who actually committed crimes or violent crimes after 2 years). The pattern of mistakes, as measured by precision/sensitivity is notable.

Quality Assurance

def understand_columns(df): df_dictionary=pd.DataFrame(df.dtypes).reset_index().reset_index().rename(columns={'level_0':'COLIDX','index':'COLNAME',0:'COL_DTYPE'}) df_dictionary['COLDESCRIPT']=' ' dfnullpct=pd.DataFrame(df.isnull().sum()/df.shape[0]).reset_index().reset_index().rename(columns={'level_0':'COLIDX','index':'COLNAME',0:'NULLPCT'}) dfnullpct['NULLPCT']=dfnullpct.NULLPCT.map("{:.2%}".format) df_dictionary1=df_dictionary.merge(dfnullpct,on=['COLIDX','COLNAME']) df_dictionary1=df_dictionary1.iloc[:,[0,1,3,2,4]] return df_dictionary1

Run to view results

understand_columns(cp)

Run to view results

We see from above that there are 74.32% of the middle names are missing. This makes sense because some people don't identify by their middle name. Also, 0.07 of the scoreText are missing. Can't understand an explanation but i will check that out

Check for Duplicates

sum(cp.duplicated())

Run to view results

Fix the year

cp[cp["Person_ID"] == 51157]

Run to view results

As we see above, there are some people who have. From our .describe() we see that the max year is 2029, which does not make sense. Let's find which values of year are greater than or equal to 2000. We observe that there are 4 IDs that correspond to this range, corresponding to 12 records. Since we have 60843 records, I will make the decision to drop these 12 records.

#fixing the date situation - ugh! from datetime import datetime #cp['DateOfBirth'].str.split("/",3) basic command for spliting a dataframe column with a delimiter print("let's understand the year structure and apply domain knowledge") cp[['month','day','year']]= cp['DateOfBirth'].str.split("/", expand=True) cp

Run to view results

# display years over 2000 years = pd.DataFrame(cp[['Person_ID', 'year']].astype(int)) count_over_2000 = years[years['year'] >= 2000].sort_values(by='year') print(count_over_2000)

Run to view results

# remove these records unique_person_ids = count_over_2000['Person_ID'].unique().tolist() cp = cp[~cp['Person_ID'].isin(unique_person_ids)] cp.head()

Run to view results

Exploratory Data Analysis

cp.shape

Run to view results

Understanding Age

cp['year'] = cp['year'].astype(int)

Run to view results

print("Maximum value:", cp['year'].max()) print("Minimum value:", cp['year'].min())

Run to view results

print("Length of cp:", len(cp)) print("Length of year is also:", len(cp['year']))

Run to view results

cp["age"] = 2024 - cp["year"]

Run to view results

print("Maximum age:", cp['age'].max()) print("Minimum age:", cp['age'].min())

Run to view results

plot.pyplot.figure(figsize=(8, 6)) colors = ['skyblue', 'orange', 'green', 'red', 'purple', 'pink', 'yellow', 'brown', 'grey', 'cyan'] cp['age'].hist(bins=10) # Add labels and title plt.xlabel('Age') plt.ylabel('Frequency') plt.title('Histogram of Age with 10 Bins')

Run to view results

Talk about the positive skew in age and talk about why

def create_bins(series, num_bins=10): min_val = series.min() max_val = series.max() bin_width = (max_val - min_val) / num_bins bins = [min_val + i * bin_width for i in range(num_bins + 1)] return pd.cut(series, bins=bins, include_lowest=True) bins = create_bins(cp['age']) # Get frequency table freq_table = bins.value_counts().sort_index().reset_index() freq_table.columns = ['age_range', 'frequency'] print(freq_table)

Run to view results

use qcut, describe and groupby on the age

plt.figure(figsize=(8, 6)) plt.boxplot(cp['age']) # Add labels and title plt.xlabel('Age') plt.ylabel('Value') plt.title('Boxplot of Age')

Run to view results

Talk about the outliers. Show the mathematics and explain why we have those ages considered outliers

cp["age"].describe([0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.9])

Run to view results

Talk about the distribution, the interquartile range, the min, max.

Understanding Gender

pd.unique(cp.Sex_Code_Text)

Run to view results

cp.Sex_Code_Text.value_counts()

Run to view results

plot.pyplot.figure(figsize=(8, 6)) sns.countplot(data=cp, x='Sex_Code_Text', palette='Set2')

Run to view results

We can see that there are more men than women in the dataset. Almost more than 4 times.

Explain that this makes sense. There are more men in the legal system

Understanding Marital Status

pd.unique(cp.MaritalStatus)

Run to view results

cp.MaritalStatus.value_counts()

Run to view results

plot.pyplot.figure(figsize=(8, 6)) sns.countplot(data=cp, x='MaritalStatus', palette='Set2')

Run to view results

Talk about the differences

Question 1: What is the statistical relationship between age, gender, and marital status?

.;Relationship between Age and Gender

summary_stats = cp.groupby('Sex_Code_Text')['age'].describe() print(summary_stats)

Run to view results

We see that we see the same skew in each gender group. They have similar distribution as I will show below.

plt.figure(figsize=(8, 6)) sns.histplot(data=cp, x='age', hue='Sex_Code_Text', kde=True) plt.title('Histogram of Age by Gender') plt.show()

Run to view results

# Create a scatter plot plt.figure(figsize=(8, 2)) sns.scatterplot(data=cp, x='age', y='Sex_Code_Text', hue='Sex_Code_Text') plt.title('Scatter Plot of Age by Gender') plt.show()

Run to view results

cp.groupby("Sex_Code_Text")["age"].mean()

Run to view results

Pearson correlation between age and gender

cp.loc[:, 'gender'] = cp['Sex_Code_Text'].map({"Male": 1, "Female": 0})

Run to view results

pearson_corr = cp['age'].corr(cp['gender'], method='pearson') print(f"Pearson correlation coefficient: {pearson_corr}")

Run to view results

The Pearson correlation coefficient of approximately 0.01 (<0.05) suggests that there is a very weak positive correlation between "age" and "gender" in your dataset.

Relationship between Age and Marital Status

summary_stats = cp.groupby('MaritalStatus')['age'].describe() print(summary_stats)

Run to view results

The analysis of marital status across age distributions offers intriguing insights into the demographics of the dataset. Each marital status category presents distinct age profiles, reflecting the life stages and experiences associated with each status. Single Individuals: The largest group in the dataset comprises singles, with a mean age of approximately 41.8 years. This younger age profile aligns with the typical age range of individuals who have not yet entered into a formal marital partnership. The 25th percentile age for this group is 34 years, indicating a significant portion of younger individuals in this category. Significant Others: Those categorized as having a "Significant Other" exhibit a slightly higher mean age of around 44.1 years. This can be interpreted as individuals who are likely in committed relationships but have not formalized their status through marriage. The age distribution suggests a progression towards more stable relationships compared to singles. Married Individuals: The married group has a mean age of 52.1 years, indicating a higher age demographic than singles and significant others. Marriage is often associated with more established life stages, such as raising a family and achieving career stability. The 25th percentile age of 43 years suggests that many in this group have been married for a considerable period, further reflecting the stability associated with this status. Separated Individuals: Those who are separated have a mean age close to the married group at approximately 51.6 years. Separation usually occurs later in life, often after significant life events or extended periods of marriage. The age distribution suggests a transition phase between marriage and other statuses, with a 25th percentile age of 43 years, similar to the married group. Divorced Individuals: The divorced category has a mean age of 57.1 years, making it one of the older demographic groups in the dataset. Divorce often occurs after longer periods of marriage, indicating a higher likelihood of older age among divorced individuals. The 25th percentile age of 49 years and the 75th percentile age of 65 years further highlight the range of ages within this group. Widowed Individuals: The widowed group is the oldest among the categories, with a mean age of 64.3 years. The loss of a spouse usually occurs later in life, reflecting the advanced age associated with this status. The age distribution suggests a higher concentration of older individuals, with a 25th percentile age of 57 years and a 75th percentile age of 73 years. In summary, the age distributions across marital statuses offer a nuanced understanding of the life stages and experiences associated with each category. Singles and significant others represent younger demographics, while married and separated individuals fall within the mid-age range. Divorced individuals tend to be older, and the widowed category represents the most advanced age group.

plt.figure(figsize=(8, 6)) sns.histplot(data=cp, x='age', hue='MaritalStatus', kde=True) plt.title('Histogram of Marital Status by Gender') plt.show()

Run to view results

# Create boxplot plt.figure(figsize=(12, 8)) sns.boxplot(x="MaritalStatus", y="age", data=cp) plt.title('Relationship between Marital Status and age') plt.xlabel('Marital Status') plt.ylabel('age') plt.show()

Run to view results

We see that the age distributions across marital statuses offer a nuanced understanding of the life stages and experiences associated with each category. Singles and significant others represent younger demographics, while married and separated individuals fall within the mid-age range. Divorced individuals tend to be older, and the widowed category represents the most advanced age group.

# Create a scatter plot plt.figure(figsize=(12, 2)) sns.scatterplot(data=cp, x='age', y='MaritalStatus', hue='MaritalStatus', legend=False) plt.title('Scatter Plot of Age by marital status') plt.show()

Run to view results

Pearson correlation between age and marital status

cp.loc[:, 'marital_status'] = cp['MaritalStatus'].map({"Single": 0, "Married": 1, "Significant Other" : 2, "Divorced": 3, "Separated": 4, "Widowed": 5, "Unknown": 6})

Run to view results

pearson_corr = cp['age'].corr(cp['marital_status'], method='pearson') print(f"Pearson correlation coefficient: {pearson_corr}")

Run to view results

The pearson correlation between age and marital status is 0.35 and this signifies that as age increases, the marital status also follows the pattern we descibed above.

T-Test to check differences in mean of ages between different marital status

Null Hypothesis: The means of the numerical variable are equal across the different categories of the categorical variable.

# Sample DataFrames for two groups single_group = cp[cp['MaritalStatus'] == 'Single']['age'] married_group = cp[cp['MaritalStatus'] == 'Married']['age'] sig_other_group = cp[cp['MaritalStatus'] == 'Significant Other']['age'] divorced_group = cp[cp['MaritalStatus'] == 'Divorced']['age'] separated_group = cp[cp['MaritalStatus'] == 'Separated']['age'] widowed_group = cp[cp['MaritalStatus'] == 'Widowed']['age'] # Perform Independent Samples T-test t_stat, p_value = stats.ttest_ind(single_group, married_group) print(f"T-statistic between single and married: {t_stat}, p-value: {p_value}") t_stat, p_value = stats.ttest_ind(single_group, sig_other_group) print(f"T-statistic between single and significant other: {t_stat}, p-value: {p_value}") t_stat, p_value = stats.ttest_ind(sig_other_group, married_group) print(f"T-statistic between significant other and married: {t_stat}, p-value: {p_value}") t_stat, p_value = stats.ttest_ind(married_group, divorced_group) print(f"T-statistic between married and divorced: {t_stat}, p-value: {p_value}") t_stat, p_value = stats.ttest_ind(married_group, separated_group) print(f"T-statistic between married and separated: {t_stat}, p-value: {p_value}") t_stat, p_value = stats.ttest_ind(separated_group, divorced_group) print(f"T-statistic between separated and divorced: {t_stat}, p-value: {p_value}") t_stat, p_value = stats.ttest_ind(divorced_group, widowed_group) print(f"T-statistic between divorced and widowed: {t_stat}, p-value: {p_value}")

Run to view results

The t-statistics and corresponding p-values provide valuable insights into the significance of the age differences across various marital status categories.

Comparing singles and married individuals yields a highly significant t-statistic of -77.66 with a p-value close to zero, indicating a substantial difference in age between these two groups. Similarly, the t-statistics between singles and significant others, and between married and significant others* are -7.28 and -23.44 respectively, both with extremely low p-values. These results signify that the age distributions of these groups are significantly different from each other.

Interestingly, the t-statistic between married and separated individuals is close to zero (1.52) with a p-value of 0.13, suggesting that there is no statistically significant difference in age between these two groups. In contrast, the t-statistics between separated and divorced, divorced and widowed, and married and divorced are -17.86, -12.61, and -22.69 respectively, all with p-values close to zero. These results highlight significant age differences between these marital status categories.

ANOVA Test

Null Hypothesis: The means of the numerical variable are equal across all categories of the categorical variable.

groups = [cp[cp['MaritalStatus'] == status]['age'] for status in cp['MaritalStatus'].unique()] # Perform ANOVA f_stat, p_value = stats.f_oneway(*groups) print(f"F-statistic: {f_stat}, p-value: {p_value}")

Run to view results

Relationship between Gender and Marital Status

contingency_table = pd.crosstab(cp['Sex_Code_Text'], cp['MaritalStatus']) print(contingency_table)

Run to view results

plt.figure(figsize=(8, 6)) sns.countplot(data=cp, x='Sex_Code_Text', hue='MaritalStatus') plt.title('Distribution of Marital Status by Gender') plt.xlabel('Gender') plt.ylabel('Count') plt.legend(title='Marital Status') plt.show()

Run to view results

percentage_df = cp.groupby('Sex_Code_Text')['MaritalStatus'].value_counts(normalize=True).mul(100).reset_index(name='Percentage') print(percentage_df)

Run to view results

Chi Square Test

chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table) print(f"Chi-square Statistic: {chi2_stat}") print(f"P-value: {p_value}") print(f"Degrees of Freedom: {dof}")

Run to view results

expected_df = pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns) print(expected_df) print("--------------------------------------------------------------------") print(contingency_table)

Run to view results

chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table) print(f"Chi-square Statistic: {chi2_stat}") print(f"P-value: {p_value}") print(f"Degrees of Freedom: {dof}") print("Expected Frequencies Table:") print(expected)

Run to view results

Relationship between Age and Marital Status and Gender

contingency_table_3way = pd.crosstab([cp['age'], cp['Sex_Code_Text']], cp['MaritalStatus']) contingency_table_3way

Run to view results

# Create separate three-way contingency tables for Male and Female contingency_table_male = pd.crosstab([cp[cp['Sex_Code_Text'] == 'Male']['age'], cp[cp['Sex_Code_Text'] == 'Male']['Sex_Code_Text']], cp[cp['Sex_Code_Text'] == 'Male']['MaritalStatus']) contingency_table_female = pd.crosstab([cp[cp['Sex_Code_Text'] == 'Female']['age'], cp[cp['Sex_Code_Text'] == 'Female']['Sex_Code_Text']], cp[cp['Sex_Code_Text'] == 'Female']['MaritalStatus'])

Run to view results

# Create heatmap for Male plt.figure(figsize=(12, 12)) sns.heatmap(contingency_table_male, annot=True, cmap="YlGnBu", fmt="d") plt.title('Three-Way Contingency Table for Male') plt.xlabel('Marital Status') plt.ylabel('Age - Male') plt.show()

Run to view results

# Create heatmap for Female plt.figure(figsize=(12, 12)) sns.heatmap(contingency_table_female, annot=True, cmap="YlGnBu", fmt="d") plt.title('Three-Way Contingency Table for Female') plt.xlabel('Marital Status') plt.ylabel('Age - Female') plt.show()

Run to view results

In the two above heat maps we can only see that the single percentage reduces as you get older from age 30. However to see more trends I will remove the single column and recreate the heat map

Heat map after removing "Singles" to see trend better

# Filter out the "Single" column for male contingency_table_male_filtered = contingency_table_male.drop(columns=["Single"]) # Filter out the "Single" column for female contingency_table_female_filtered = contingency_table_female.drop(columns=["Single"]) # Create subplots fig, axes = plt.subplots(1, 2, figsize=(24, 12)) # Plot heatmap for male sns.heatmap(contingency_table_male_filtered, annot=True, cmap="YlGnBu", fmt="d", ax=axes[0]) axes[0].set_title('Three-Way Contingency Table for Male') axes[0].set_xlabel('Marital Status') axes[0].set_ylabel('Age - Male') # Plot heatmap for female sns.heatmap(contingency_table_female_filtered, annot=True, cmap="YlGnBu", fmt="d", ax=axes[1]) axes[1].set_title('Three-Way Contingency Table for Female') axes[1].set_xlabel('Marital Status') axes[1].set_ylabel('Age - Female') plt.show()

Run to view results

In the heat map which looks to find trend mutually between age, gender, and marital status, we see that married column tends to be denser for the males than the females and it extends for the males till age 72 for the males while it stops around age 65 for the female. This suggests that female defendants tend to lose their marriages. This also conforms to the trend in the heat map graph where the divorced and separated columns tend to be denser for the female than for the male. The significant other column appears to look the same. If you are wondering why single is not on the heat map.

Question 2: Is there a statistical relationship between age, gender, marital status and the 3 Compas model scores? Hint: You must analyze each model individually. Risk of a) failure to appear, b) violence and c) recidivism.

Subsetting

cprfa = cp[cp["DisplayText"] == "Risk of Failure to Appear"] cprov = cp[cp["DisplayText"] == "Risk of Violence"] cpror = cp[cp["DisplayText"] == "Risk of Recidivism"]

Run to view results

confirm Gender subset size

#Ratio of males and females in the whole dataset. print("Ratio of males in the full dataset is", len(cp[cp["gender"] == 1])/len(cp)) #Ratio in risk of failure to appear print("Ratio of males in the risk of failure to appear dataset is", len(cprfa[cprfa["gender"] == 1])/len(cprfa)) #Ratio in risk of violence print("Ratio of males in the risk of violence dataset is", len(cprov[cprov["gender"] == 1])/len(cprov)) #Ratio in risk of violence print("Ratio of males in the risk of recidivism dataset is", len(cpror[cpror["gender"] == 1])/len(cpror))

Run to view results

Confirming Age subset size

plt.figure(figsize=(8, 6)) sns.histplot(data=cprfa, x='age', kde=True) sns.histplot(data=cprov, x='age', kde=True) sns.histplot(data=cpror, x='age', kde=True) sns.histplot(data=cp, x='age', kde=True) plt.title('Histogram of Marital Status by Gender') plt.show()

Run to view results

Confirming Marital Status subset size

import pandas as pd # Calculate percentages for cprfa cprfa_percentages = (cprfa.groupby("MaritalStatus")["MaritalStatus"].count() / len(cprfa)) * 100 # Calculate percentages for cpror cpror_percentages = (cpror.groupby("MaritalStatus")["MaritalStatus"].count() / len(cpror)) * 100 # Calculate percentages for cprov cprov_percentages = (cprov.groupby("MaritalStatus")["MaritalStatus"].count() / len(cprov)) * 100 # Calculate percentages for cp cp_percentages = (cp.groupby("MaritalStatus")["MaritalStatus"].count() / len(cp)) * 100 # Combine the percentages into a single DataFrame combined_df = pd.DataFrame({ 'cprfa (%)': cprfa_percentages, 'cprov (%)': cprov_percentages, 'cpror (%)': cpror_percentages, 'cp (%)': cp_percentages }) # Display the combined DataFrame print(combined_df)

Run to view results

Risk of Failure to Appear

Gender

# Assuming cprfa is your DataFrame male_decile_counts = cprfa[cprfa["gender"] == 1]["DecileScore"].value_counts() female_decile_counts = cprfa[cprfa["gender"] == 0]["DecileScore"].value_counts() # Plotting fig, axes = plt.subplots(1, 2, figsize=(18, 8)) # 1 row, 2 columns # Male Pie Chart axes[0].pie(male_decile_counts, labels=male_decile_counts.index, autopct='%1.1f%%', startangle=140) axes[0].set_title('Distribution of Decile Scores for Male') # Female Pie Chart axes[1].pie(female_decile_counts, labels=female_decile_counts.index, autopct='%1.1f%%', startangle=140) axes[1].set_title('Distribution of Decile Scores for Female') plt.show()

Run to view results

pd.crosstab(cprfa.DecileScore,cprfa.Sex_Code_Text, margins=True, normalize='columns')\ .sort_index(axis=0, ascending=True)\ .style.format('{:.2%}')\ .set_caption("Percentages Calculated Down Columns")

Run to view results

# Compute the proportions total_by_gender = cprfa.groupby('Sex_Code_Text')['DecileScore'].count() proportions = cprfa.groupby(['DecileScore', 'Sex_Code_Text']).size() / total_by_gender * 100 proportions = proportions.reset_index(name='Proportion') # Create a bar plot plt.figure(figsize=(12, 6)) sns.barplot(data=proportions, x='DecileScore', y='Proportion', hue='Sex_Code_Text', palette='pastel') plt.title('Proportion of Decile Score by Gender') plt.xlabel('Decile Score') plt.ylabel('Proportion (%)') plt.legend(title='Gender', loc='upper right') plt.show()

Run to view results

We notice above that there are more (in relative ) female who have more decile score of 1, 4, and 9. Lets check of the

T-Test between Gender and Decile Score

cprfa.groupby("gender")["DecileScore"].mean()

Run to view results

# Assuming cprfa is your DataFrame group1 = cprfa[cprfa["gender"] == 1]["DecileScore"] group2 = cprfa[cprfa["gender"] == 0]["DecileScore"] t_stat, p_val = stats.ttest_ind(group1, group2) print(f"T-statistic: {t_stat}") print(f"P-value: {p_val}")

Run to view results

Our p-value is less than 0.05 which suggests that we can reject the null hypothesis and conclude that there is a significant difference between the groups. Therefore, the mean of decile score for men are more than the mean of decile score for women.

Correlation between Age and Decile Score

plt.figure(figsize=(12, 6)) sns.regplot(data=cprfa, x='age', y='DecileScore', scatter_kws={'s': 10}) plt.title('Relationship between Age and Decile Score') plt.xlabel('Age') plt.ylabel('Decile Score') plt.show()

Run to view results

We see that your decile score for risk of failure to appear increases as you age. This was a surprise to me because I was expecting younger people to be more sneaky and have more risk of failure to appear. I will now find the correlation coefficient(slope) : Pearson Correlation coefficient

# Calculate Pearson correlation coefficient correlation, _ = stats.pearsonr(cprfa['age'], cprfa['DecileScore']) # Calculate the degrees of freedom n = len(cprfa['age']) df = n - 2 # Calculate t-statistic t_stat = correlation * np.sqrt(df / (1 - correlation**2)) # Calculate p-value p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) print(f"Pearson correlation coefficient: {correlation}") print(f"T-statistic: {t_stat}") print(f"P-value: {p_value}")

Run to view results

Since we have a low p-value (typically ≤ 0.05), it indicates that you can reject the null hypothesis, suggesting that there is a statistically significant correlation between age and decile score.

Marital Status

pd.crosstab(cprfa.DecileScore,cprfa.MaritalStatus, margins=True, normalize='columns')\ .sort_index(axis=0, ascending=True)\ .style.format('{:.2%}')\ .set_caption("Percentages Calculated Down Columns")

Run to view results

We see that

# Create boxplot plt.figure(figsize=(12, 8)) sns.boxplot(x="MaritalStatus", y="DecileScore", data=cprfa) plt.title('Relationship between Marital Status and Decile Score') plt.xlabel('Marital Status') plt.ylabel('Decile Score') plt.show()

Run to view results

We notice that Married and significant other have lower score for risk of failure to appear relative to single, divorced, separated, and widowed. Could this be because they have a partner to keep them accountable and probably remind them. For married and significant other, high decile score are even considered outliers.

Anova Test of Marital Status and Decile Score

# Extract decile scores for each marital status category single_scores = cprfa[cprfa["MaritalStatus"] == "Single"]["DecileScore"] sig_other_scores = cprfa[cprfa["MaritalStatus"] == "Significant Other"]["DecileScore"] married_scores = cprfa[cprfa["MaritalStatus"] == "Married"]["DecileScore"] separated_scores = cprfa[cprfa["MaritalStatus"] == "Separated"]["DecileScore"] divorced_scores = cprfa[cprfa["MaritalStatus"] == "Divorced"]["DecileScore"] widowed_scores = cprfa[cprfa["MaritalStatus"] == "Widowed"]["DecileScore"] # Perform ANOVA test f_stat, p_value = stats.f_oneway(single_scores,sig_other_scores,separated_scores, married_scores, divorced_scores, widowed_scores) print(f"F-statistic: {f_stat}") print(f"P-value: {p_value}") if p_value < 0.05: print("The differences in mean decile scores are statistically significant.") else: print("There is no statistically significant difference in mean decile scores.")

Run to view results

Risk of Violence

Gender

print(cprov.groupby("gender")["DecileScore"].mean()) group1 = cprov[cprov["gender"] == 1]["DecileScore"] group2 = cprov[cprov["gender"] == 0]["DecileScore"] t_stat, p_val = stats.ttest_ind(group1, group2) print(f"T-statistic: {t_stat}") print(f"P-value: {p_val}")

Run to view results

pd.crosstab(cprov.DecileScore,cprov.Sex_Code_Text, margins=True, normalize='columns')\ .sort_index(axis=0, ascending=True)\ .style.format('{:.2%}')\ .set_caption("Percentages Calculated Down Columns")

Run to view results

# Compute the proportions total_by_gender = cprov.groupby('Sex_Code_Text')['DecileScore'].count() proportions = cprov.groupby(['DecileScore', 'Sex_Code_Text']).size() / total_by_gender * 100 proportions = proportions.reset_index(name='Proportion') # Create a bar plot plt.figure(figsize=(12, 6)) sns.barplot(data=proportions, x='DecileScore', y='Proportion', hue='Sex_Code_Text', palette='pastel') plt.title('Proportion of Decile Score by Gender') plt.xlabel('Decile Score') plt.ylabel('Proportion (%)') plt.legend(title='Gender', loc='upper right') plt.show()

Run to view results

In the above we see that there is clear distinction between lower decile score and higher decile score. For decile score 1, 2,3,4, the female have more percentages, and the male have more percentages in 5,6,7,8,9, and 10. This follows societal view that male are more violent than female.The difference between the mean of violence between male and female is higher.

Correlation between Age and Decile Score

plt.figure(figsize=(12, 6)) sns.regplot(data=cprov, x='age', y='DecileScore', scatter_kws={'s': 10}) plt.title('Relationship between Age and Decile Score in risk of violence') plt.xlabel('Age') plt.ylabel('Decile Score') plt.show()

Run to view results

plt.figure(figsize=(12, 6)) sns.regplot(data=cprfa, x='age', y='DecileScore', scatter_kws={'s': 10}) plt.title('Relationship between Age and Decile Score in risk of failure to appear') plt.xlabel('Age') plt.ylabel('Decile Score') plt.show()

Run to view results

Here we see that unlike risk of failure to appear, the correlation of age and decile score is negative. That is, younger people are more likely to have higher decile score for risk of violence than older people. This makes sense because younger people have more energy to be violent. Also, there is a stronger correlation between age and decile score in risk of violence than between age and decile score in risk of failure to appear.

# Calculate Pearson correlation coefficient correlation, _ = stats.pearsonr(cprov['age'], cprov['DecileScore']) # Calculate the degrees of freedom n = len(cprov['age']) df = n - 2 # Calculate t-statistic t_stat = correlation * np.sqrt(df / (1 - correlation**2)) # Calculate p-value p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) print(f"Pearson correlation coefficient: {correlation}") print(f"T-statistic: {t_stat}") print(f"P-value: {p_value}")

Run to view results

We see above that the p-value is less than 0.05 which shows that there correlation is significant.

Marital Status

pd.crosstab(cprov.DecileScore,cprov.MaritalStatus, margins=True, normalize="columns")\ .sort_index(axis=0, ascending=True)\ .style.format('{:.2%}')\ .set_caption("Percentages Calculated Down Columns")

Run to view results

# Create boxplot plt.figure(figsize=(12, 8)) sns.boxplot(x="MaritalStatus", y="DecileScore", data=cprov) plt.title('Relationship between Marital Status and Decile Score') plt.xlabel('Marital Status') plt.ylabel('Decile Score') plt.show()

Run to view results

The box plot in rate of violence is more interesting. We see that the median is higher for single and significant other. In the remaining, more than half of the population have a score of 1.

Anova Test of Marital Status and Decile Score

# Extract decile scores for each marital status category single_scores = cprov[cprov["MaritalStatus"] == "Single"]["DecileScore"] sig_other_scores = cprov[cprov["MaritalStatus"] == "Significant Other"]["DecileScore"] married_scores = cprov[cprov["MaritalStatus"] == "Married"]["DecileScore"] separated_scores = cprov[cprov["MaritalStatus"] == "Separated"]["DecileScore"] divorced_scores = cprov[cprov["MaritalStatus"] == "Divorced"]["DecileScore"] widowed_scores = cprov[cprov["MaritalStatus"] == "Widowed"]["DecileScore"] # Perform ANOVA test f_stat, p_value = stats.f_oneway(single_scores, sig_other_scores, married_scores, separated_scores, divorced_scores, widowed_scores) print(f"F-statistic: {f_stat}") print(f"P-value: {p_value}") if p_value < 0.05: print("The differences in mean decile scores are statistically significant.") else: print("There is no statistically significant difference in mean decile scores.")

Run to view results

Here we see that the p-value is greater than 0.05 which suggests that there is not significant difference between the means of the different groups.

Risk of Recidivism

Gender

print(cpror.groupby("gender")["DecileScore"].mean()) group1 = cpror[cpror["gender"] == 1]["DecileScore"] group2 = cpror[cpror["gender"] == 0]["DecileScore"] t_stat, p_val = stats.ttest_ind(group1, group2) print(f"T-statistic: {t_stat}") print(f"P-value: {p_val}")

Run to view results

# Compute the proportions total_by_gender = cpror.groupby('Sex_Code_Text')['DecileScore'].count() proportions = cpror.groupby(['DecileScore', 'Sex_Code_Text']).size() / total_by_gender * 100 proportions = proportions.reset_index(name='Proportion') # Create a bar plot plt.figure(figsize=(12, 6)) sns.barplot(data=proportions, x='DecileScore', y='Proportion', hue='Sex_Code_Text', palette='pastel') plt.title('Proportion of Decile Score by Gender') plt.xlabel('Decile Score') plt.ylabel('Proportion (%)') plt.legend(title='Gender', loc='upper right') plt.show()

Run to view results

Similarly males have higher score of recidivism (7, 8, 9, 10).

Correlation between Age and Decile Score

plt.figure(figsize=(12, 6)) sns.regplot(data=cpror, x='age', y='DecileScore', scatter_kws={'s': 10}) plt.title('Relationship between Age and Decile Score in risk of violence') plt.xlabel('Age') plt.ylabel('Decile Score') plt.show()

Run to view results

# Calculate Pearson correlation coefficient correlation, _ = stats.pearsonr(cpror['age'], cpror['DecileScore']) # Calculate the degrees of freedom n = len(cpror['age']) df = n - 2 # Calculate t-statistic t_stat = correlation * np.sqrt(df / (1 - correlation**2)) # Calculate p-value p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df)) print(f"Pearson correlation coefficient: {correlation}") print(f"T-statistic: {t_stat}") print(f"P-value: {p_value}")

Run to view results

As seen we also have a negative correlation between age and decile score. That is, as people age the percentage of decile score for risk of recidivism reduces.

Marital Status

# Create boxplot plt.figure(figsize=(12, 8)) sns.boxplot(x="MaritalStatus", y="DecileScore", data=cpror) plt.title('Relationship between Marital Status and Decile Score') plt.xlabel('Marital Status') plt.ylabel('Decile Score') plt.show()

Run to view results

We also see here that we have the highest median decile score for groups of single and significant other. It is worth nothing that married group have the lowest decile score in all three groups(risk of failure to appear, risk or violence, and risk of recidivism)

Anova Test of Marital Status and Decile Score

Run to view results

Conclusion

This research delved into the intricate relationships between demographic factors such as age, gender, and marital status and their impact on COMPAS risk assessment scores. Our analysis revealed several noteworthy findings: 1. Age exhibited varied correlations with different risk models, with older individuals generally showing higher risk of failure to appear but lower risk of violence and recidivism. 2. Gender differences were evident, particularly in the risk of failure to appear and violence models, where males tended to have higher risk scores compared to females. 3. Marital status played a nuanced role in risk assessment scores. Interestingly, married individuals exhibited lower risk scores across multiple models, suggesting a potential stabilizing influence of marital partnership. These findings challenge some conventional assumptions about risk assessment and highlight the need for more nuanced, data-driven approaches in criminal justice decision-making. It underscores the importance of considering multiple demographic factors in predictive modeling to ensure fair and accurate outcomes. Further research could explore additional variables, such as socioeconomic status or educational background, to gain a more comprehensive understanding of risk assessment determinants. Additionally, examining the potential biases and limitations of the COMPAS algorithm itself could offer insights into improving its predictive accuracy and fairness.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}David Alonge

Understanding COMPAS Dataset

Quality Assurance

Check for Duplicates

Fix the year

Exploratory Data Analysis

Understanding Age

Understanding Gender

Understanding Marital Status

Question 1: What is the statistical relationship between age, gender, and marital status?

.;Relationship between Age and Gender

Pearson correlation between age and gender

Relationship between Age and Marital Status

Pearson correlation between age and marital status

T-Test to check differences in mean of ages between different marital status

Relationship between Gender and Marital Status

Chi Square Test

Relationship between Age and Marital Status and Gender

Heat map after removing "Singles" to see trend better

Question 2: Is there a statistical relationship between age, gender, marital status and the 3 Compas model scores? Hint: You must analyze each model individually. Risk of a) failure to appear, b) violence and c) recidivism.

Subsetting

confirm Gender subset size

Confirming Age subset size

Confirming Marital Status subset size

Risk of Failure to Appear

Gender

T-Test between Gender and Decile Score

Correlation between Age and Decile Score

Marital Status

Anova Test of Marital Status and Decile Score

Risk of Violence

Gender

Correlation between Age and Decile Score

Marital Status

Anova Test of Marital Status and Decile Score

Risk of Recidivism

Gender

Correlation between Age and Decile Score

Marital Status

Anova Test of Marital Status and Decile Score

Conclusion

David Alonge