# MA346 - Project 1 | Mathaus Silva

First, we begin by importing the appropriate libraries and reading the US states vaccination dataset under 'df_vacc'.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
from scipy.optimize import curve_fit
df_vacc = pd.read_csv('us_state_vaccinations.csv')
df_vacc

No output from the cell above indicates it ran successfully with no errors.
However, the 'date' column is incorrectly labeled as an object. We need convert the data type to a 'datetime'.

df_vacc['date'] = pd.to_datetime(df_vacc['date'])
df_vacc

The dates column is now formatted by date and time as opposed to a regular string.
Now, we want to write a function that takes as input a state name and column from DataFrame 'df_vacc' and returns a pandas Series.

def extract_data (state, column):
return df_vacc[df_vacc['location'] == state].drop(columns='location').set_index('date')[column]

No output indicates no errors, or success.
To ensure function 'extract_data' is working, we test it by choosing 'people_vaccinated_per_hundred' from the state of Massachusetts.

mass = extract_data('Massachusetts', 'people_vaccinated_per_hundred')
mass

Pandas Series 'mass' the chosen column indexed by the date. In this case, it's clear to see a change in the number of people vaccinated per hundred.
Next, we want to plot the series to see, with better visualization, the change in the number of people vaccinated per hundred over time.

mass.plot()
plt.xlabel('Date')
plt.ylabel('Percent of Vaccinated Population')
plt.title('Percentage of People Vaccinated in Massachusetts')

Shaped like an 'S', the line graph reveals a positive correlation for the percent of population vaccinated from 2021-01-12 to 2021-05-21.
Following Chapter 9's course notes on SciPy's curve_fit, we want to create a function to take any pandas Series and fit any one-variable model to it. First, we must input the series into a function that outputs three 'β' values. In case it fails, except will return three np.nan values. We defined our initial guesses as the following: 'β0' = the maximum number of vaccinations achieved so far, 'β1' = 1, and 'β2' = the time that is half way from the start of the data to the end of the data (the length of the data divided by two).

def beta(series):
try:
guesses = [series.values.max(), 1, len(series.values) / 2]
xdata = np.arange(len(series))
ydata = series.values
found_betas, covariance = curve_fit(logistic_curve, xdata, ydata, p0=guesses)
β0, β1, β2 = found_betas
return β0, β1, β2
except:
return [np.nan, np.nan, np.nan]

No output indicates no errors, or success.
Now that we have defined β0, β1, and β2, we want a function that takes in all three betas and 'x' and outputs a logistic curve.

def logistic_curve(x, β0, β1, β2):
return β0 / (1 + np.exp(β1 * (-x + β2)))

No output indicates no errors, or success.
We then write a function that plots the Massachusetts vaccination logistic model and compares it to the actual data.

def curve(series, betas):
fit_model = lambda x: logistic_curve(x, betas[0], betas[1], betas[2])
xdata = np.arange(len(series))
series.index = xdata
series.plot(label='data')
xdata2 = np.linspace(0, len(series)+1)
plt.plot(xdata2, fit_model(xdata2), label='model')
plt.xlabel('Days from 2021-01-12 to 2021-05-21')
plt.ylabel('Percent of the population vaccinated')
plt.title('Massachusetts Vaccinations and Logistic Model')
plt.legend()
return plt.show()

No output indicates no errors, or success.
Now that we have the 'curve' function to plot the logistic model, before we can plot the data, we must drop all NaN values and call the beta function for β0, β1, and β2. Having done that, we can input the clean Massachusetts series alongside the three betas into function 'curve'.

mass_clean = mass.dropna()
betas = beta(mass_clean)
curve(mass_clean, betas)

The plot above illustrates the comparison between Massachusetts' actual vaccination data and the logistic model. We see that this model fits the actual data pretty accurately.
Using the beta function, we want to build a table with one row for each US state, including their names and three β values. But first, we need a list of all the state names and abbreviations for the new DataFrame.

df_abb = pd.read_csv('us_state_abbreviation.csv')
df_pol = pd.read_csv('us_state_political_alignment.csv')
states = df_abb['us state'].tolist()
abb = df_abb['abbreviation'].tolist()
print(states[:5])
print(abb[:5])

To make sure it's correct, we printed the first five state names and abbreviations inside each list.
Next, we need a list of all three betas from every state inside the states list.

list_b0 = []
list_b1 = []
list_b2 = []
for state in states:
series = extract_data(state, 'people_vaccinated_per_hundred')
series = series.dropna()
betas = beta(series)
list_b0.append(betas[0])
list_b1.append(betas[1])
list_b2.append(betas[2])
print(list_b0[:5])
print(list_b1[:5])
print(list_b2[:5])

To make sure it's correct, we printed the first five values for each list β list.
From those five lists, we can finally build our table with one row for each US state.

dict_beta = {'State': states, 'Abbreviation': abb, 'Beta0': list_b0, 'Beta1': list_b1, 'Beta2': list_b2}
df_beta = pd.DataFrame(dict_beta)
df_beta

DataFrame 'df_beta' outputs the abbreviation, β0, β1, and β2 for each state.
However, it's not done yet. We need to merge each state's political alignment and rename columns for consistency.

df = df_beta.merge(df_pol, left_on='Abbreviation', right_on='abbreviation', how='left')
df = df.drop(['us state', 'abbreviation', 'precincts'], axis=1, inplace=False)
df_final = df.rename(columns={'hillary clinton':'Hillary Clinton', 'donald trump':'Donald Trump'})
df_final

The 'df_final' now outputs the abbreviation, β0, β1, β2, and percentage votes on the 2016 elections for each state.
From this DataFrame, we want to use a correlation coefficient heat map to visualize the result, and explain which pairs of variables correlate the most, and to what degree.

def heatmap(df):
matrix = np.triu(df.corr())
heatmap = sns.heatmap(df.corr(), annot=True, fmt='.3g', vmin=-1, vmax=1, center=0, cmap='coolwarm')
heatmap.set_title('Correlation Heatmap')
plt.xticks(rotation=45)
plt.show()
heatmap(df_final)

The correlation coefficient heat map outputs the correlation of every variable between -1 and 1. The highest/lowest degree of correlation was seen between Donald Trump voters and the maximum number of vaccinations achieved so far (-0.85). In other words, as the maximum number of vaccinations increase, the number of Trump voters decrease.
Being the most significant correlation, we want to conduct a null hypothesis test on whether β0 is the same for Donald Trump and Hillary Clinton voters. First, we define both high and low Trump vote percentages, making sure there are no NaN values.

high_trump_percentage = df_final[df_final['Donald Trump'] > df_final['Hillary Clinton']]
high_trump_percentage = high_trump_percentage.dropna()
low_trump_percentage = df_final[df_final['Donald Trump'] < df_final['Hillary Clinton']]
low_trump_percentage = low_trump_percentage.dropna()

No output indicates no errors, or success.
Now, using both high and low percentages, we can finally conduct our t-test.

def t_test(high, low, alpha):
statistic, pvalue = stats.ttest_ind(high['Beta0'], low['Beta0'], equal_var=False)
print(pvalue)
return pvalue < alpha # reject H_0?
t_test(high_trump_percentage, low_trump_percentage, 0.05)

True as the 't_test' output indicates that our p-value is less than alpha at 5%. As the result is statistically significant, we can reject the null hypothesis, and conclude that the maximum number of vaccinations achieved so far are different for Trump and Hillary voters.