Effect of GDP and Population on Olympics Performance
Due: Dec 9, 2024, 11:59pm
Hypothesis
Null Hypothesis: There is no correlation between a country's GDP and/or population and their performance at the Olympics.
Alternate Hypothesis: There is a positive correlation between a country's GDP and/or population and their performance at the Olympics.
Data Sets
Three data sets were used to test the hypothesis above. First, we utilized a GDP data set created by the World Bank that compiles historical GDP data using today's USD value, a population data set created by Our World in Data, and an Olympics data set found on data.world which utilized data from the IOC Research and Reference Service.
Importing Data Sets
Goal: Load each of the data sets we will be utilizing in order for us to understand the additional data cleaning that will need to be done to perform merges and further data analysis.
Run to view results
Result: Successfully loaded all 3 data sets into data frames. Identified the need to change the "Medal" column in df_athletes to numeric for statistical analysis to be performed.
Run to view results
Result: Successfully changed "Medal" column to integer, allowing us to sum the column while grouping by NOC and Year. NOC is the three digit code that the IOC uses to identify participating countries.
Merging the data sets
Goal: Fix country names such as "Bahamas, The" to how the country name appears in the "Entity" column in df_pop "Bahamas". This required manual review through the use of Alteryx to identify countries that did not have have a match in both data sets.
Run to view results
Run to view results
Run to view results
Result: Changed various "Country" names to match the desired format. Successfully joined the population and GDP datasets.
Run to view results
Goal: Similar as above, we realized that many of the three digit codes did not match between each dataset. Again, we performed a manual review of the data sets in Alteryx to identify each code that did not properly match.
Run to view results
Run to view results
Result: After manual review, the code above matches the country names for countries that we could find that were in both data sets, but did not share the same name, whether it was due to language differences or formalities. Note that many countries and regions were not included and a center join was performed because we only want data for countries and years where we have both GDP and Population data. Additionally, we had to consider how to evaluate countries that no longer exist, such as West and East Germany, the USSR, Yugoslavia, etc. that either joined to become a bigger country or split into smaller countries. We felt that it would be best to not include most of these countries, especially since GDP and population data are not readily available to compare.
Perform Statistical Analysis
Goal: Test the correlation between the independent variables (GDP & Population) and the dependent variable (Medal count).
Run to view results
Result: Regression shows that GDP is statistically significant, while Population is not. Upon review, we notice that GDP is listed in billions, while population is shown at its true amount. To make the regression more comparable, we feel that applying a logarithmic scale to both independent variables is appropriate.
Goal: Apply a logarithmic scale to the independent variables to make the regression analysis more meaningful.
Run to view results
Result: Applied a logarithmic scale to both the Population and GDP columns, which we believe makes the regression model more representative of the data set. Population still is not statistically significant with a p-value of 0.160 while GDP is statistically significant with a p-value below 0.000.
Excel File for Merged Data Set & Final Report
See the "Homework 4.xlsx" file for the merged data set which will allow you to do your own statistical analysis. Additionally, please see "MA 705 Final Project Report.docx" for our full write up.
Run to view results