Example of testing for a difference in medians using bootstrapping
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
previewers = pd.read_csv('2021-04-01 image preset previewed.csv')
previewers.head()
let's see if there's a difference in presets applied by platform
previewers.groupby(['os']).mean()
previewers.groupby(['os']).median()
observationally there's some difference in the mean and median but not sure if it's statistically significant. we can check though
testing the means (welch's t test)
ios = previewers[previewers['os'] == 'ios']['row_count']
android = previewers[previewers['os'] == 'android']['row_count']
stats.ttest_ind(ios, android, equal_var=False)
Generating the data for the median test
# set some parameters for the bootstrapping
# in this example, I want to obtain 2000 samples per os, drawing 200 observations each time
np.random.seed(48602)
samples = 2000
draws = 200
this generates a data frame of the samples based on the attributes we define
ios_sample = []
for i in range(samples):
ios_sample += [ios.sample(draws, replace=True).median()]
ios_sample = pd.DataFrame(ios_sample)
android_sample = []
for i in range(samples):
android_sample += [android.sample(draws, replace=True).median()]
android_sample = pd.DataFrame(android_sample)
the distributions are normalish looking (play around with the number of samples and/or number of draws to see how the histograms change
this also a great example of how the CLT works
fig,axs = plt.subplots(1,2)
axs[0].hist(android_sample, bins=range(5,15))
axs[0].set_title('android')
axs[1].hist(ios_sample, bins=range(5,15))
axs[1].set_title('ios')
plt.show()
from here we can just run a regular t test on the average median from the sampled data
print(android_sample.mean())
print(ios_sample.mean())
0 9.67525
dtype: float64
0 8.472
dtype: float64
stats.ttest_ind(ios_sample, android_sample, equal_var=False)