David Alonge
Understanding COMPAS Dataset
COMPAS means Correctional Offender Management Profiling for Alternative Sanctions. It is a popular commercial algorithm used by judges and parole officers for scoring criminal defendant’s likelihood of reoffending (recidivism). It has been shown that the algorithm is biased in favor of white defendants, and against black inmates, based on a 2 year follow up study (i.e who actually committed crimes or violent crimes after 2 years). The pattern of mistakes, as measured by precision/sensitivity is notable.
Quality Assurance
We see from above that there are 74.32% of the middle names are missing. This makes sense because some people don't identify by their middle name. Also, 0.07 of the scoreText are missing. Can't understand an explanation but i will check that out
Check for Duplicates
Fix the year
As we see above, there are some people who have. From our .describe() we see that the max year is 2029, which does not make sense. Let's find which values of year are greater than or equal to 2000. We observe that there are 4 IDs that correspond to this range, corresponding to 12 records. Since we have 60843 records, I will make the decision to drop these 12 records.
Exploratory Data Analysis
Understanding Age
use qcut, describe and groupby on the age
Understanding Gender
We can see that there are more men than women in the dataset. Almost more than 4 times.
Understanding Marital Status
Question 1: What is the statistical relationship between age, gender, and marital status?
.;Relationship between Age and Gender
We see that we see the same skew in each gender group. They have similar distribution as I will show below.
Pearson correlation between age and gender
The Pearson correlation coefficient of approximately 0.01 (<0.05) suggests that there is a very weak positive correlation between "age" and "gender" in your dataset.
Relationship between Age and Marital Status
The analysis of marital status across age distributions offers intriguing insights into the demographics of the dataset. Each marital status category presents distinct age profiles, reflecting the life stages and experiences associated with each status. Single Individuals: The largest group in the dataset comprises singles, with a mean age of approximately 41.8 years. This younger age profile aligns with the typical age range of individuals who have not yet entered into a formal marital partnership. The 25th percentile age for this group is 34 years, indicating a significant portion of younger individuals in this category. Significant Others: Those categorized as having a "Significant Other" exhibit a slightly higher mean age of around 44.1 years. This can be interpreted as individuals who are likely in committed relationships but have not formalized their status through marriage. The age distribution suggests a progression towards more stable relationships compared to singles. Married Individuals: The married group has a mean age of 52.1 years, indicating a higher age demographic than singles and significant others. Marriage is often associated with more established life stages, such as raising a family and achieving career stability. The 25th percentile age of 43 years suggests that many in this group have been married for a considerable period, further reflecting the stability associated with this status. Separated Individuals: Those who are separated have a mean age close to the married group at approximately 51.6 years. Separation usually occurs later in life, often after significant life events or extended periods of marriage. The age distribution suggests a transition phase between marriage and other statuses, with a 25th percentile age of 43 years, similar to the married group. Divorced Individuals: The divorced category has a mean age of 57.1 years, making it one of the older demographic groups in the dataset. Divorce often occurs after longer periods of marriage, indicating a higher likelihood of older age among divorced individuals. The 25th percentile age of 49 years and the 75th percentile age of 65 years further highlight the range of ages within this group. Widowed Individuals: The widowed group is the oldest among the categories, with a mean age of 64.3 years. The loss of a spouse usually occurs later in life, reflecting the advanced age associated with this status. The age distribution suggests a higher concentration of older individuals, with a 25th percentile age of 57 years and a 75th percentile age of 73 years. In summary, the age distributions across marital statuses offer a nuanced understanding of the life stages and experiences associated with each category. Singles and significant others represent younger demographics, while married and separated individuals fall within the mid-age range. Divorced individuals tend to be older, and the widowed category represents the most advanced age group.
We see that the age distributions across marital statuses offer a nuanced understanding of the life stages and experiences associated with each category. Singles and significant others represent younger demographics, while married and separated individuals fall within the mid-age range. Divorced individuals tend to be older, and the widowed category represents the most advanced age group.
Pearson correlation between age and marital status
The pearson correlation between age and marital status is 0.35 and this signifies that as age increases, the marital status also follows the pattern we descibed above.
T-Test to check differences in mean of ages between different marital status
Null Hypothesis: The means of the numerical variable are equal across the different categories of the categorical variable.
The t-statistics and corresponding p-values provide valuable insights into the significance of the age differences across various marital status categories.
Comparing singles and married individuals yields a highly significant t-statistic of -77.66 with a p-value close to zero, indicating a substantial difference in age between these two groups. Similarly, the t-statistics between singles and significant others, and between married and significant others* are -7.28 and -23.44 respectively, both with extremely low p-values. These results signify that the age distributions of these groups are significantly different from each other.
Interestingly, the t-statistic between married and separated individuals is close to zero (1.52) with a p-value of 0.13, suggesting that there is no statistically significant difference in age between these two groups. In contrast, the t-statistics between separated and divorced, divorced and widowed, and married and divorced are -17.86, -12.61, and -22.69 respectively, all with p-values close to zero. These results highlight significant age differences between these marital status categories.
ANOVA Test
Null Hypothesis: The means of the numerical variable are equal across all categories of the categorical variable.
Relationship between Gender and Marital Status
Chi Square Test
Relationship between Age and Marital Status and Gender
In the two above heat maps we can only see that the single percentage reduces as you get older from age 30. However to see more trends I will remove the single column and recreate the heat map
Heat map after removing "Singles" to see trend better
In the heat map which looks to find trend mutually between age, gender, and marital status, we see that married column tends to be denser for the males than the females and it extends for the males till age 72 for the males while it stops around age 65 for the female. This suggests that female defendants tend to lose their marriages. This also conforms to the trend in the heat map graph where the divorced and separated columns tend to be denser for the female than for the male. The significant other column appears to look the same. If you are wondering why single is not on the heat map.
Question 2: Is there a statistical relationship between age, gender, marital status and the 3 Compas model scores? Hint: You must analyze each model individually. Risk of a) failure to appear, b) violence and c) recidivism.
Subsetting
confirm Gender subset size
Confirming Age subset size
Confirming Marital Status subset size
Risk of Failure to Appear
Gender
We notice above that there are more (in relative ) female who have more decile score of 1, 4, and 9. Lets check of the
T-Test between Gender and Decile Score
Our p-value is less than 0.05 which suggests that we can reject the null hypothesis and conclude that there is a significant difference between the groups. Therefore, the mean of decile score for men are more than the mean of decile score for women.
Correlation between Age and Decile Score
We see that your decile score for risk of failure to appear increases as you age. This was a surprise to me because I was expecting younger people to be more sneaky and have more risk of failure to appear. I will now find the correlation coefficient(slope) : Pearson Correlation coefficient
Since we have a low p-value (typically ≤ 0.05), it indicates that you can reject the null hypothesis, suggesting that there is a statistically significant correlation between age and decile score.
Marital Status
We see that
We notice that Married and significant other have lower score for risk of failure to appear relative to single, divorced, separated, and widowed. Could this be because they have a partner to keep them accountable and probably remind them. For married and significant other, high decile score are even considered outliers.
Anova Test of Marital Status and Decile Score
Risk of Violence
Gender
In the above we see that there is clear distinction between lower decile score and higher decile score. For decile score 1, 2,3,4, the female have more percentages, and the male have more percentages in 5,6,7,8,9, and 10. This follows societal view that male are more violent than female.The difference between the mean of violence between male and female is higher.
Correlation between Age and Decile Score
Here we see that unlike risk of failure to appear, the correlation of age and decile score is negative. That is, younger people are more likely to have higher decile score for risk of violence than older people. This makes sense because younger people have more energy to be violent. Also, there is a stronger correlation between age and decile score in risk of violence than between age and decile score in risk of failure to appear.
We see above that the p-value is less than 0.05 which shows that there correlation is significant.
Marital Status
The box plot in rate of violence is more interesting. We see that the median is higher for single and significant other. In the remaining, more than half of the population have a score of 1.
Anova Test of Marital Status and Decile Score
Here we see that the p-value is greater than 0.05 which suggests that there is not significant difference between the means of the different groups.
Risk of Recidivism
Gender
Similarly males have higher score of recidivism (7, 8, 9, 10).
Correlation between Age and Decile Score
As seen we also have a negative correlation between age and decile score. That is, as people age the percentage of decile score for risk of recidivism reduces.
Marital Status
We also see here that we have the highest median decile score for groups of single and significant other. It is worth nothing that married group have the lowest decile score in all three groups(risk of failure to appear, risk or violence, and risk of recidivism)
Anova Test of Marital Status and Decile Score
Conclusion
This research delved into the intricate relationships between demographic factors such as age, gender, and marital status and their impact on COMPAS risk assessment scores. Our analysis revealed several noteworthy findings: 1. Age exhibited varied correlations with different risk models, with older individuals generally showing higher risk of failure to appear but lower risk of violence and recidivism. 2. Gender differences were evident, particularly in the risk of failure to appear and violence models, where males tended to have higher risk scores compared to females. 3. Marital status played a nuanced role in risk assessment scores. Interestingly, married individuals exhibited lower risk scores across multiple models, suggesting a potential stabilizing influence of marital partnership. These findings challenge some conventional assumptions about risk assessment and highlight the need for more nuanced, data-driven approaches in criminal justice decision-making. It underscores the importance of considering multiple demographic factors in predictive modeling to ensure fair and accurate outcomes. Further research could explore additional variables, such as socioeconomic status or educational background, to gain a more comprehensive understanding of risk assessment determinants. Additionally, examining the potential biases and limitations of the COMPAS algorithm itself could offer insights into improving its predictive accuracy and fairness.