Is there a correlation between Covid-19 and World Happiness Parameters?
-Krisha Ameet Panchal
This COVID19 dataset consists of data related to the cumulative number of confirmed cases, per day, in each country. The World Happiness dataset consists of various life factors scored by the people living in each country around the globe. By merging these two datasets, the relationship between the GDP per capita, social support, life expectancy and mean/max infection is studied.
Through this project, I found out that there is no relationship between these variables however, the data may not be sufficient. There may be various other variables affecting the relationship between these variables. A more in-depth study needs to be done to show more accurate results.
I imported the covid-19 dataset from github (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv) and the World Happiness Dataset from worldhappiness.red
Dataset 1
The first dataset was the Covid-19 dataset. After importing the dataset, I did some data cleaning and conducted some statistical tests:
Dataset 2
With this the data cleaning is complete for dataset 1. Now importing dataset 2: World Happiness Data
To conduct data cleaning, I dropped all the columns that were irrelevant for my analysis. The columns that remain were Country name, Log GDP per capital, Social Support and Life Expectancy. I also changed the column name from 'Country name' to 'Country' so I could merge the two datasets on a common variable with ease.
Merged Data
I then proceeded to do an inner join to merge the data on 'Country'. The resulting inner joined DataFrame will contain only the rows where the 'country' value is present in both df1 and df2. It will include columns from both DataFrames for the matching rows. The resulting output is as follows:
I created a correlation matrix to provide a concise and visual summary of the relationships between variables, aiding in data exploration, understanding the data structure, and generating insights for further analysis.
The correlation matrix is a square matrix where each cell represents the correlation coefficient between two variables. The correlation coefficient measures the linear relationship between two variables and ranges from -1 to 1. A positive value indicates a positive correlation, a negative value indicates a negative correlation, and a value close to zero suggests a weak or no correlation.
This led me to conduct hypothesis testing:
H0 = There is evidence of significant correlation between Logged GDP per capita and Mean infection rate
H1 = There is no correlation between Logged GDP per capital and Mean Infection rate
Since the scatter plot is randomly distributed, there is no evident relationship between Logged GDP per capita and the Mean Infection Rates and hence we reject the null hypothesis
My next thought was to study the relationship between social support and the max infection in a country in 2021. Hence, the next hypothesis test is:
H0 = There is no correlation between social support and max covid infection
H1= There is significant correlation between social support and max covid infection
The above scatter plot shows an extremely weak positive correlation between social support and max infection rate. This is not enough to tell if developed countries were worse affected during the pandemic. Thus, we fail to reject the null hypothesis
The below graph shows the top 20 countries with the highest social support and the corresponding mean infection rates.
This graph also proves our previous theory that we cannot tell if developed countries were infected more than underdeveloped or developing countries during the pandemic.
Next, I studied two other variables that are healthy life expectancy and mean/max infection rate. I created scatter plots which show no correlation between the two variables
In order to confirm my analysis, I created a heat map and formed the following hypothesis:
H0 = There is evidence of significant correlation between healthy life expectancy and mean/max infections
H1 = There is no correlation between the variables
Interpreting the heatmaps involves understanding the correlation between variables. The correlation values in the heatmap range from -1 to 1, where:
A value of 1 indicates a strong positive correlation, meaning the variables increase or decrease together. A value of -1 indicates a strong negative correlation, meaning one variable increases while the other decreases. A value close to 0 indicates a weak or no correlation, meaning the variables are not strongly related.
The correlation between life expectancy and mean infections is not extremely strong, indicating that other factors may also influence the mean infection rate. The correlation between max infection rates and life expectancy is not extremely strong either, indicating that other factors may also contribute to the maximum infection rates.
Thus, we reject the null hypothesis
Conclusion
Keeping in mind that correlation does not imply causation. While these heatmaps and scatter plots show the relationship between variables, they do not establish a cause-and-effect relationship. Other factors and confounding variables may be influencing the relationship observed in the plots.