Project 2: Prison Population Rate and National Happiness
Is prison population rate associated with happiness scores across countries?
Adriana lasso & Ashlin Enright MA705 4/29/2026
The goal of this project is to examine whether countries with higher prison population rates tend to report different happiness scores. To answer this question, we merge two country-year datasets: prison population rates from Our World in Data and happiness scores from the World Happiness Report.
The happiness data comes from the World Happiness Report and uses life evaluation (3-year average) as its main measure of happiness. This score reflects how people evaluate the overall quality of their lives, typically on a scale from 0 to 10, where higher values indicate greater life satisfaction.
Because the two datasets come from different sources, they do not always use the same country names and do not have the same coverage in every year. For that reason, this project first cleans the datasets, standardizes country names, and restricts both datasets to the years with the strongest overlap before performing the merge.
In this project, prison population rate means the number of incarcerated people per 100,000 people in a country’s population. It is not a percentage. For example, a prison population rate of 700 means that about 700 out of every 100,000 people in that country are incarcerated.
After merging the data, we use a scatterplot, correlation, and simple linear regression to study the relationship between prison population rate and happiness score.
Summary of findings:
This project finds a weak positive association between prison population rate and national happiness score across the matched country-year observations. The Pearson correlation is about 0.116, and the simple regression coefficient is statistically significant, but the model explains only about 1.3% of the variation in happiness scores. After controlling for year, the relationship remains positive and statistically significant, but still very small. Overall, prison population rate does not appear to be a strong predictor of national happiness.
Data Sources
This project uses two real-world datasets from different sources.
The first dataset is the Prison Population Rate dataset from Our World in Data: https://ourworldindata.org/grapher/prison-population-rate
This dataset reports the number of incarcerated people per 100,000 people in each country. In this project, this variable is used as the explanatory variable.
The second dataset is from the World Happiness Report data-sharing page: https://www.worldhappiness.report/data-sharing/
This dataset reports national happiness scores based on life evaluation averages. In this project, the happiness score is used as the response variable.
These two datasets fit the project because both include country-level data across multiple years. After cleaning country names and keeping the years with the strongest overlap, the datasets were merged by Country and Year. This made it possible to compare prison population rates and happiness scores for the same country in the same year.
Load the two datasets
We begin by loading the prison population rate dataset and the happiness dataset into Python so that we can inspect, clean, and merge them.
Run to view results
Run to view results
Run to view results
Run to view results
The prison dataset contains country-year prison population rates, while the happiness dataset contains country-year happiness scores. Since the two datasets come from different sources, the next step is to keep only the columns needed for the merge and standardize the variable names.
Prepare the datasets for merging
To prepare the data for merging, we keep only the columns needed for this project. We also rename the country and happiness columns so that the two datasets use consistent variable names.
Run to view results
Check for missing values and clean the year variable
Before merging the datasets, we check for missing values in the main variables and ensure that the year variable is stored numerically in both datasets.
Run to view results
Run to view results
These steps remove observations missing key merge information and ensure that both datasets store year consistently as an integer variable.
Identify the years with the strongest overlap
Because not every country appears in both datasets in every year, we next inspect how many countries are available by year in each dataset. This helps us choose years that maximize the overlap between the two sources.
Run to view results
Run to view results
Based on the overlap counts, we keep the years with the strongest country coverage in both datasets. This helps maximize the number of matched country-year observations in the merged dataset.
Run to view results
Standardize country names
Since the two datasets come from different sources, they do not always use the same country names. Before merging, we check which country names appear in one dataset but not the other. This helps us identify true naming mismatches that need to be standardized.
Run to view results
The unmatched country lists help us identify differences in naming conventions across the two datasets. Some mismatches are caused by different spellings or naming styles for the same country, while others are due to territories, subdivisions, or countries that appear in only one dataset for the selected years.
Fix country-name mismatches
We reviewed the unmatched country lists and standardized country names that clearly referred to the same place. This step improves the quality of the merge by making sure that country-year observations can match correctly across the two datasets.
Run to view results
This first round of replacements addresses several of the biggest naming differences, including alternate spellings, official country names, and regional labels. One especially important case was Côte d’Ivoire, which appeared in different forms across the two datasets.
Recheck unmatched countries after cleaning
After the first round of name standardization, we check the unmatched country lists again. This helps confirm whether any true naming mismatches still remain.
Run to view results
After the first round of cleaning, a few true naming mismatches still remained. We fixed those in a second round of replacements.
Run to view results
Run to view results
Run to view results
After reviewing all unmatched country names, we standardized the remaining true naming mismatches, including Côte d’Ivoire, South Korea, Moldova, and Eswatini/Swaziland. The unmatched entries that remained after cleaning were mostly territories, subdivisions, or countries that appeared in only one dataset for the selected years, rather than unresolved naming inconsistencies.
This suggests that the merge keys were cleaned appropriately and that the remaining unmatched observations are mainly due to differences in dataset coverage.
Verify that country-year observations are unique
Because each row in both datasets is supposed to represent one country in one year, we check whether either dataset contains duplicate country-year observations. If the merge keys are unique, then a one-to-one merge is appropriate.
Run to view results
If both duplicate counts are zero, then each dataset contains unique country-year observations and a one-to-one merge is appropriate.
Merge the datasets
We now merge the prison and happiness datasets using an inner join on country and year. An inner join is appropriate because our goal is to analyze only those country-year observations that appear in both datasets.
We also use `validate="one_to_one"` to confirm that each country-year observation in one dataset matches at most one country-year observation in the other.
Run to view results
Verify the merged dataset
To confirm that the merge worked as expected, we inspect the size of the merged dataset, review its columns, check for missing values in the key analysis variables, and preview the first several rows.
Run to view results
Run to view results
These checks show that the merged dataset contains country-year observations with both a prison population rate and a happiness score. Since the merge keys were unique before merging and the merged variables do not contain missing values, the merge appears to have worked as intended.
Run to view results
Export the merged dataset
After verifying that the merge worked correctly, we export the full merged dataset so it can be reused for later analysis and submitted as a clean combined file.
Run to view results
Load the merged CSV for analysis
To keep the data-preparation stage separate from the analysis stage, we now re-read the merged CSV file and create a smaller analysis dataset containing only the two variables needed for the statistical analysis.
Run to view results
Measure the linear relationship
We now compute the Pearson correlation coefficient and its p-value using the analysis dataset. The correlation coefficient measures the direction and strength of the linear relationship between prison population rate and happiness score, while the p-value helps us assess whether the observed relationship is statistically distinguishable from zero.
Run to view results
The Pearson correlation coefficient is approximately 0.116, which indicates a weak positive linear relationship between prison population rate and happiness score. This means that countries with higher prison population rates tend to have slightly higher happiness scores in this merged dataset, but the relationship is very small.
The p-value is approximately 0.007, which is below 0.05. This means the relationship is statistically significant at conventional significance levels. However, statistical significance does not mean the relationship is strong. In this case, the correlation is still weak, so prison population rate and happiness score are only weakly associated.
Fit a simple linear regression
We next fit a simple linear regression with happiness score as the response variable and prison population rate as the explanatory variable. This helps us estimate the direction of the relationship and see how much variation in happiness score is explained by prison population rate alone.
Run to view results
The linear regression shows a positive slope for prison population rate, which means that countries with higher prison population rates tend to have slightly higher happiness scores in this sample. The slope is statistically significant, which is consistent with the Pearson correlation result.
However, the model’s R-squared is very low, which means prison population rate explains only a small share of the variation in happiness score. In practical terms, this suggests that prison population rate alone is not a strong predictor of happiness.
Visualize the fitted regression line
To make the linear relationship easier to interpret, we add a fitted regression line to the scatterplot.
Run to view results
The fitted regression line slopes slightly upward, which matches the positive correlation and regression coefficient. However, the points are widely scattered around the line, which suggests that the relationship is weak even though it is statistically significant.
Improved model: controlling for year
Because the merged dataset includes multiple years, it is possible that average happiness levels differ across years for reasons unrelated to prison population rate. To account for that, we fit a second model that includes year fixed effects.
Run to view results
This regression model predicts happiness score using prison population rate while also controlling for year. The coefficient for prison population rate is positive and statistically significant (p ≈ 0.007), which suggests that, holding year constant, countries with higher prison population rates tend to have slightly higher happiness scores in this sample.
However, the size of the effect is small. The estimated coefficient is about 0.0011, which means that even a 100-point increase in prison population rate would correspond to only about a 0.11-point increase in happiness score. In addition, the model’s R-squared is only about 0.015, meaning that the model explains just 1.5% of the variation in happiness score.
The year indicators are not statistically significant, which suggests that average differences across the selected years are not large once prison population rate is included. Overall, this model still suggests that prison population rate is only a weak predictor of happiness.
Although prison population rate remains statistically significant after controlling for year, the relationship is still weak. The coefficient is small, and the model explains only a very small share of the variation in happiness score. This suggests that prison population rate alone is not a strong predictor of happiness, even after accounting for differences across years.
This result should not be interpreted as evidence that higher incarceration causes higher happiness. The relationship is weak, and many other social, political, and economic factors likely play a much larger role in shaping national happiness.
Limitations and possible improvements
This project uses prison population rate as the main explanatory variable, but happiness is influenced by many other factors that are not included in the model. A stronger model could include additional variables such as GDP per capita, life expectancy, social support, freedom, or corruption perceptions. Another useful extension would be to add country fixed effects, which would allow the analysis to compare changes within countries over time rather than mainly comparing countries to one another.
Conclusion
This project examined whether countries with higher prison population rates tend to report different happiness scores. To answer this question, we merged prison population rate data with happiness data by country and year, after cleaning country-name differences and restricting the analysis to years with the strongest overlap.
The results showed a weak positive relationship between prison population rate and happiness score. The Pearson correlation was approximately 0.116, with a p-value of about 0.007, indicating that the relationship was statistically significant but very small. A simple regression and a regression controlling for year produced a similar conclusion: prison population rate has only a weak association with happiness and explains very little of the variation in happiness scores.
Overall, prison population rate does not appear to be a strong predictor of national happiness. These findings suggest that happiness is shaped much more strongly by other social, economic, and political factors than by incarceration rates alone.