Is there a correlation between Covid-19 and World Happiness Parameters?

-Krisha Ameet Panchal

This COVID19 dataset consists of data related to the cumulative number of confirmed cases, per day, in each country. The World Happiness dataset consists of various life factors scored by the people living in each country around the globe. By merging these two datasets, the relationship between the GDP per capita, social support, life expectancy and mean/max infection is studied.

Through this project, I found out that there is no relationship between these variables however, the data may not be sufficient. There may be various other variables affecting the relationship between these variables. A more in-depth study needs to be done to show more accurate results.

I imported the covid-19 dataset from github (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv) and the World Happiness Dataset from worldhappiness.red

Dataset 1

The first dataset was the Covid-19 dataset. After importing the dataset, I did some data cleaning and conducted some statistical tests:

corona_df = pd.read_csv('Covid-19.csv') corona_df.head(5)

The data ranged from 2020 to 2023 but my main focus was on 2021 and so I dropped the columns for 2020, 2022 and 2023

corona_df.describe()

corona_df.info()

I dropped columns which were unnecessary for my analysis and grouped the data based on country

Calculating the mean will allow me to to understand the average number of infections per day or over a specific time period. This will help in assessing the overall impact of the virus and can provide a measure of the disease's prevalence. Finding the maximum number of infections will help to identify the peak or highest point of the outbreak. This information is crucial for understanding the scale of the virus's impact and assessing the healthcare system's capacity to handle the surge in cases.

Dataset 2

With this the data cleaning is complete for dataset 1. Now importing dataset 2: World Happiness Data

happiness_df = pd.read_csv('World Happiness Data 2021.csv') happiness_df.head(5)

To conduct data cleaning, I dropped all the columns that were irrelevant for my analysis. The columns that remain were Country name, Log GDP per capital, Social Support and Life Expectancy. I also changed the column name from 'Country name' to 'Country' so I could merge the two datasets on a common variable with ease.

columns_to_keep = ['Country name','Logged GDP per capita','Social support','Healthy life expectancy'] happiness_df = happiness_df.drop(columns=happiness_df.columns.difference(columns_to_keep)) happiness_df = happiness_df.rename(columns={'Country name': 'Country'}) happiness_df

Merged Data

I then proceeded to do an inner join to merge the data on 'Country'. The resulting inner joined DataFrame will contain only the rows where the 'country' value is present in both df1 and df2. It will include columns from both DataFrames for the matching rows. The resulting output is as follows:

data = pd.merge(new_corona,happiness_df, on='Country', how='inner') data

I created a correlation matrix to provide a concise and visual summary of the relationships between variables, aiding in data exploration, understanding the data structure, and generating insights for further analysis.

The correlation matrix is a square matrix where each cell represents the correlation coefficient between two variables. The correlation coefficient measures the linear relationship between two variables and ranges from -1 to 1. A positive value indicates a positive correlation, a negative value indicates a negative correlation, and a value close to zero suggests a weak or no correlation.

data.corr()

This led me to conduct hypothesis testing:

H0 = There is evidence of significant correlation between Logged GDP per capita and Mean infection rate

H1 = There is no correlation between Logged GDP per capital and Mean Infection rate

Since the scatter plot is randomly distributed, there is no evident relationship between Logged GDP per capita and the Mean Infection Rates and hence we reject the null hypothesis

My next thought was to study the relationship between social support and the max infection in a country in 2021. Hence, the next hypothesis test is:

H0 = There is no correlation between social support and max covid infection

H1= There is significant correlation between social support and max covid infection

The above scatter plot shows an extremely weak positive correlation between social support and max infection rate. This is not enough to tell if developed countries were worse affected during the pandemic. Thus, we fail to reject the null hypothesis

The below graph shows the top 20 countries with the highest social support and the corresponding mean infection rates.

This graph also proves our previous theory that we cannot tell if developed countries were infected more than underdeveloped or developing countries during the pandemic.

Next, I studied two other variables that are healthy life expectancy and mean/max infection rate. I created scatter plots which show no correlation between the two variables

In order to confirm my analysis, I created a heat map and formed the following hypothesis:

H0 = There is evidence of significant correlation between healthy life expectancy and mean/max infections

H1 = There is no correlation between the variables

Interpreting the heatmaps involves understanding the correlation between variables. The correlation values in the heatmap range from -1 to 1, where:

A value of 1 indicates a strong positive correlation, meaning the variables increase or decrease together. A value of -1 indicates a strong negative correlation, meaning one variable increases while the other decreases. A value close to 0 indicates a weak or no correlation, meaning the variables are not strongly related.

The correlation between life expectancy and mean infections is not extremely strong, indicating that other factors may also influence the mean infection rate. The correlation between max infection rates and life expectancy is not extremely strong either, indicating that other factors may also contribute to the maximum infection rates.

Thus, we reject the null hypothesis

Conclusion

Keeping in mind that correlation does not imply causation. While these heatmaps and scatter plots show the relationship between variables, they do not establish a cause-and-effect relationship. Other factors and confounding variables may be influencing the relationship observed in the plots.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Is there a correlation between Covid-19 and World Happiness Parameters?

Dataset 1

Dataset 2

Merged Data

Conclusion

Is there a correlation between Covid-19 and World Happiness Parameters?