How to analyze San Francisco city employee salary data with Pandas

Analyze San Francisco city employee salary data using the Pandas library in Python.

Welcome to a tutorial designed to help you practice your pandas skills! In this blog post, we will be using the SF Salaries Dataset from Kaggle. The tasks will guide you step-by-step through the data analysis process using Pandas, starting with basic operations and advancing to more complex queries.

Let's get started!

Import pandas

The first step in any data analysis task involving Pandas is to import the library. We give it an alias 'pd' for convenience.

import pandas as pd

Read Salaries.csv as a dataframe called `sal`

Pandas provides an easy way to read data from a CSV file and store it in a DataFrame object. Here's how you do it:

sal = pd.read_csv('Salaries.csv')

Check the head of the DataFrame

Once the data is loaded into the DataFrame, it's often useful to check the first few rows of the DataFrame to get an understanding of the data and its structure.

sal.head()

Id	EmployeeName	JobTitle	BasePay	OvertimePay	OtherPay	Benefits	TotalPay	TotalPayBenefits	Year	Notes	Agency	Status
0	1	NATHANIEL FORD	GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY	167411.18	0.00	400184.25	NaN	567595.43	567595.43	2011	NaN	San Francisco	NaN
1	2	GARY JIMENEZ	CAPTAIN III (POLICE DEPARTMENT)	155966.02	245131.88	137811.38	NaN	538909.28	538909.28	2011	NaN	San Francisco	NaN
2	3	ALBERT PARDINI	CAPTAIN III (POLICE DEPARTMENT)	212739.13	106088.18	16452.60	NaN	335279.91	335279.91	2011	NaN	San Francisco	NaN
3	4	CHRISTOPHER CHONG	WIRE ROPE CABLE MAINTENANCE MECHANIC	77916.00	56120.71	198306.90	NaN	332343.61	332343.61	2011	NaN	San Francisco	NaN
4	5	PATRICK GARDNER	DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)	134401.60	9737.00	182234.59	NaN	326373.19	326373.19	2011	NaN	San Francisco	NaN

Use the .info() method to find out how many entries there are

The .info() method is a quick way to get an overview of the DataFrame, such as the number of entries, the data types of the columns, and which columns have missing values.

sal.info()  # Expected output: 148654 Entries

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id                  148654 non-null int64
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Notes               0 non-null float64
Agency              148654 non-null object
Status              0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

What is the average base pay?

To find out the average BasePay, you can use the .mean() method on the 'BasePay' column.

sal['BasePay'].mean()

66325.44884050643

What is the highest amount of OvertimePay in the dataset?

Similarly, to find out the maximum overtime pay, use the .max() method.

sal['OvertimePay'].max()

245131.88

What is the job title of Joseph Driscoll?

We can filter the data for 'Joseph Driscoll' and then select the 'JobTitle' column.

sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']

24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

How much does Joseph Driscoll make (including benefits)?

Similarly, to find out how much Joseph Driscoll makes, we need to include the benefits, so we select the 'TotalPayBenefits' column.

sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']

24    270324.91
Name: TotalPayBenefits, dtype: float64

What is the name of the highest-paid person (including benefits)?

To find the highest-paid person, we can sort by the 'TotalPayBenefits' column and then get the first entry.

sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]['EmployeeName']

What is the name of the lowest-paid person (including benefits)?

Similar to the previous query, except here we look for the minimum.

sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].min()]['EmployeeName']

You may find something strange about this entry as it can have negative pay!

What was the average (mean) BasePay of all employees per year? (2011-2014)

You can group by the 'Year' and then calculate the mean of the BasePay for each group.

sal.groupby('Year').mean()['BasePay']

How many unique job titles are there?

The .nunique() method gives us the number of unique entries in the 'JobTitle' column.

sal['JobTitle'].nunique()

What are the top 5 most common jobs?

Using .value_counts() will give us the frequency of each job title. We then take the top 5 with .head(5).

sal['JobTitle'].value_counts().head(5)

How many job titles were represented by only one person in 2013?

We filter the data to include only the year 2013 and then apply a similar technique as before, but this time we check for job titles that appear exactly once.

sum(sal[sal['Year'] == 2013]['JobTitle'].value_counts() == 1)

How many people have the word Chief in their job title?

This requires a custom function that checks if 'chief' is in the job title (case insensitive), and we then apply this function to each job title and sum up the True values.

def chief_string(title):
    if 'chief' in title.lower():
        return True
    else:
        return False

sum(sal['JobTitle'].apply(lambda x: chief_string(x)))

Bonus: is there a correlation between length of the job title string and salary?

First, we calculate the length of each job title, then we check the correlation with TotalPayBenefits.

sal['title_len'] = sal['JobTitle'].apply(len)
sal[['title_len', 'TotalPayBenefits']].corr()  # No significant correlation

title_len	TotalPayBenefits
title_len	1.000000	-0.036878
TotalPayBenefits	-0.036878	1.000000

Summary

In this blog post, we went through a series of tasks involving the San Francisco Salaries dataset to practice data manipulation and analysis using Pandas. We covered fundamental techniques such as reading data, selecting columns, filtering, grouping, and applying functions. These operations are essential for any data analyst or data scientist working with tabular data in Python.