The exercise in this notebook was taken from the book Computational and Inferential Thinking The Foundations of Data Science By Ani Adhikari and John DeNero
Licensed for free consumption under the following license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).
Quotations refer to the original text.
Statistics on employee compensation in the City of San Francisco in 2015
In this notebook, we explore the employee compensations in the City of San Francisco in 2015 using various statics, such as mean, standard, median and other percentiles.
The mentioned dataset contains all employees of the City. However, most often, we don't have access to an entire population but only to a sample of it. Data scientists are interested in making estimates about the population from a sample. However, because the statistic depends on the random sample, a question that arises is how much does the static depend on the particular sample; how different will it be if we had drawn a different sample?
To answer this question, in the second part of this notebook, we learn about a technique called bootstrapping used to quantify the error or confidence of an estimate.
The dataset
SF OpenData is a website where the City and County of San Francisco make some of their data publicly available. One of the data sets contains compensation data for employees of the City. These include medical professionals at City-run hospitals, police officers, fire fighters, transportation workers, elected officials, and all other employees of the City.
We're going to start by importing some modules for loading, plotting and manipulating the data.
We load a pre-cleaned version of the dataset from Berkeley's Data 8 course material
Print dataset size and column names
We have one row for each of the 42989 employees and 22 columns. A full description of the dataset can be found here. In this notebook we're interested in the Total compensation, which is the employee's salary plus the state contributions such as retirement and benefit plans (Employee Compensation | DataSF).
Top three highest compensations
Let's have a look at the rows with top 3 highest 'Total Compensation'.
Hint: sort the rows using the method sort_values()
and slicing the dataframe to print the top three rows [:3]
Three lowest compensations
Now, let's have a look at the other end of the distribution. The rows with the top 3 lowest 'Total Compensation'.
Hint: same as above but changing the keyword parameter ascending
to True
Negative salaries can occur due to compensation adjustments. For example an employee that was mistakenly overpaid the previous year (Employee Compensation FAQ).
Keep rows with salaries equivalent to a half-time job at a minimum wage
For clarity of comparison, we will focus our attention on those who had at least the equivalent of a half-time job for the whole year. At a minimum wage of about $\$ 10$ per hour, and $20$ hours per week for 52 weeks, that's a salary of about $\$10,000$.
Part 1: Statistics on the Total Compensation
Let's read the column 'Total Compensation into an array (TC
) and scale it to have units on thousand dollars.
Mean and Standard Deviation
The mean, often abbreviated with the Greek letter $\mu$, is a measurement of the centrality of a collection. It's defined as the sum of the elements, divided by the number of elements, $$ \mu = \sum \limits_{i=1}^{N} \frac{x_i}{N}. $$
The standard deviation, often abbreviated as SD or with the Greek letter $\sigma$, tells us how close are the points of the collection to the mean. It's defines as the root mean square of the deviations from the mean, $$ \sigma = \sqrt{\sum \limits_{i=1}^{N} \frac{(x_i - \mu)^2}{N}}. $$
Because the mean tells us about the center of the collection and the standard deviation about its spread, means are often reported using the SD as uncertainties.
Calculate the mean 'Total Compensation' and report it using one standard deviation as uncertainties. Calculate also the range of the distribution, i.e. the minimum and minimum and maximum of the total compensation.
Reporting the mean with such uncertainties makes sense of symmetric distributions. However, from the range of the distribution, we can tell that the 'Total Compensation' is not symmetrically distributed. So let's have a look at the distribution!
Distribution of total compensation
Plot an histogram of the 'Total Compensation' in the range 0 to 700 K dollars.
Hint: use plt.hist()
As expected, the distribution is not symmetric, but right-skewed due to the few high-income points.
One thing to ask know is what is the middle point of the distribution - the point that divides the distribution into two halves with the same number of items each, and this is what the median tells us.
Median
Percentiles
In addition to the midpoint of the distribution one can split the distribution into any fraction and this is what the percentiles measure.
What are the 10, 50 and 90 percentiles?
Notice that the 50 percentile is he same as to the median, as is natural from the definition of the median.
How would you describe the employee compensation in the City of San Francisco?
Describe with words referring to the distribution and the statistics from above.
In 2015, employees in San Francisco earned 115.000 a year on average, where 11.000 was the lowest and 648.000 the highest yearly income. The top 10% earn more than 196.000 yearly, whereas the lowest 10% earn less than 32.000 a year
Part 2: Estimating the Median from a Random Sample
Normally we don't have access to the entire population and we want to make estimates, for example, the median, about it from a random sample. In this part, we simulate this situation by drawing a random sample of 500 points from the population, calculating the median as an estimate of the population median (parameter). Then we use Bootstrapping to calculate the confidence of the estimate.
Random sample
Draw a random sample form the population TC
.
Hint: use np.random.choice()
with and set the keyword argument replace to False
To see how different the estimate would be if the sample had come out differently, we could just draw another sample from the population, but that would be cheating. We are trying to mimic real life, in which we won't have all the population data at hand.
Somehow, we have to get another random sample without sampling from the population.
The Bootstrap: Resampling from the Sample
WARNING: the following paragraph is quite loaded with jargon and may sound like a tongue twister.
The idea behind bootstrapping is to simulate possible samples by randomly sampling our sample. For each of this samples we compute the statistic, in our case the median, thus obtaining a distribution of bootstrap sample medians. The 95% confidence interval is given by the 2.5% and 97.5% of this distribution.
Read more about the Bootstrap method here.
Resample
The because we're drawing a sample from the original sample we want to draw the elements mantaining the probabilities of the original sample, thus we sample with replacement.
Bootstrap Empirical Distribution of the Sample Median
Let us define a function bootstrap_median that takes our original sample, the label of the column containing the variable, and the number of bootstrap samples we want to take, and returns an array of the corresponding resampled medians.
Each time we resample and find the median, we replicate the bootstrap process. So the number of bootstrap samples will be called the number of replications.
Let's use the function bootstrap_median
to get the distribution of medians with 5000 replications.
Now let's plot the histogram of the 5000 bootstrap medians together with the sample median (sample_median
) that we calculated with the original sample and the middle 95% interval.
To summarize what the simulation shows, suppose you are estimating the population median by the following process:
- Draw a large random sample from the population.
- Bootstrap your random sample and get an estimate from the new random sample.
- Repeat the above step thousands of times, and get thousands of estimates.
- Pick off the "middle 95%" interval of all the estimates.
That gives you one interval of estimates. Now if you repeat the entire process 100 times, ending up with 100 intervals, then about 95 of those 100 intervals will contain the population parameter. In other words, this process of estimation captures the parameter about 95% of the time. You can replace 95% by a different value, as long as it's not 100. Suppose you replace 95% by 80% and keep the sample size fixed at 500. Then your intervals of estimates will be shorter than those we simulated here, because the "middle 80%" is a smaller range than the "middle 95%". Only about 80% of your intervals will contain the parameter.