MA346 - Project 1 | Mathaus Silva
First, we begin by importing the appropriate libraries and reading the US states vaccination dataset under 'df_vacc'.
No output from the cell above indicates it ran successfully with no errors. However, the 'date' column is incorrectly labeled as an object. We need convert the data type to a 'datetime'.
The dates column is now formatted by date and time as opposed to a regular string. Now, we want to write a function that takes as input a state name and column from DataFrame 'df_vacc' and returns a pandas Series.
No output indicates no errors, or success. To ensure function 'extract_data' is working, we test it by choosing 'people_vaccinated_per_hundred' from the state of Massachusetts.
Pandas Series 'mass' the chosen column indexed by the date. In this case, it's clear to see a change in the number of people vaccinated per hundred. Next, we want to plot the series to see, with better visualization, the change in the number of people vaccinated per hundred over time.
Shaped like an 'S', the line graph reveals a positive correlation for the percent of population vaccinated from 2021-01-12 to 2021-05-21. Following Chapter 9's course notes on SciPy's curve_fit, we want to create a function to take any pandas Series and fit any one-variable model to it. First, we must input the series into a function that outputs three 'β' values. In case it fails, except will return three np.nan values. We defined our initial guesses as the following: 'β0' = the maximum number of vaccinations achieved so far, 'β1' = 1, and 'β2' = the time that is half way from the start of the data to the end of the data (the length of the data divided by two).
No output indicates no errors, or success. Now that we have defined β0, β1, and β2, we want a function that takes in all three betas and 'x' and outputs a logistic curve.
No output indicates no errors, or success. We then write a function that plots the Massachusetts vaccination logistic model and compares it to the actual data.
No output indicates no errors, or success. Now that we have the 'curve' function to plot the logistic model, before we can plot the data, we must drop all NaN values and call the beta function for β0, β1, and β2. Having done that, we can input the clean Massachusetts series alongside the three betas into function 'curve'.
The plot above illustrates the comparison between Massachusetts' actual vaccination data and the logistic model. We see that this model fits the actual data pretty accurately. Using the beta function, we want to build a table with one row for each US state, including their names and three β values. But first, we need a list of all the state names and abbreviations for the new DataFrame.
['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California']
['AL', 'AK', 'AZ', 'AR', 'CA']
To make sure it's correct, we printed the first five state names and abbreviations inside each list. Next, we need a list of all three betas from every state inside the states list.
[38.19129854399601, 46.67701590793483, 47.68527115448869, 42.49499469720811, 63.78999988469429]
[0.03912295477481297, 0.03609232128873301, 0.03984067061206364, 0.03684363636911797, 0.03720972881751967]
[60.50876153031706, 45.68956802408799, 60.52793622230523, 60.76325094896763, 74.55222353700192]
To make sure it's correct, we printed the first five values for each list β list. From those five lists, we can finally build our table with one row for each US state.
DataFrame 'df_beta' outputs the abbreviation, β0, β1, and β2 for each state. However, it's not done yet. We need to merge each state's political alignment and rename columns for consistency.
The 'df_final' now outputs the abbreviation, β0, β1, β2, and percentage votes on the 2016 elections for each state. From this DataFrame, we want to use a correlation coefficient heat map to visualize the result, and explain which pairs of variables correlate the most, and to what degree.
The correlation coefficient heat map outputs the correlation of every variable between -1 and 1. The highest/lowest degree of correlation was seen between Donald Trump voters and the maximum number of vaccinations achieved so far (-0.85). In other words, as the maximum number of vaccinations increase, the number of Trump voters decrease. Being the most significant correlation, we want to conduct a null hypothesis test on whether β0 is the same for Donald Trump and Hillary Clinton voters. First, we define both high and low Trump vote percentages, making sure there are no NaN values.
No output indicates no errors, or success. Now, using both high and low percentages, we can finally conduct our t-test.
6.02236572415464e-08
True as the 't_test' output indicates that our p-value is less than alpha at 5%. As the result is statistically significant, we can reject the null hypothesis, and conclude that the maximum number of vaccinations achieved so far are different for Trump and Hillary voters.