1. Loading data from CSV and Excel files
You just got hired as the first and only data practitioner at a business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files, SQL databases.
The Head of People Operations wants to have a general view gathering all available information about a specific employee. Your job is to gather it all in a file that will serve as the reference moving forward. You will merge all of this data in a pandas DataFrame before exporting to CSV.
Data management at your company is not the best, but you need to start somewhere. You decide to tackle the most straightforward tasks first, and to begin by loading the company office addresses. They are currently saved into a CSV file, office_addresses.csv, which the Office Manager sent over to you. Additionally, an HR manager you remember interviewing with gave you access to the Excel file, employee_information.xlsx, where the employee addresses are saved. You need to load these datasets in two separate DataFrames.
office office_country office_city office_street \
0 Leuven Office BE Leuven Martelarenlaan
1 ESB Office US New York City Fifth Avenue
2 WeWork Office GB London Old Street
office_street_number
0 38
1 350
2 207
employee_id employee_last_name employee_first_name employee_country \
0 A2R5H9 Hunman Jax BE
1 H8K0L6 Siff Tara GB
2 G4R7V0 Sagal Gemma US
3 M1Z7U9 Coates Tig FR
employee_city employee_street employee_street_number
0 Leuven Grote Markt 9
1 London Baker Street 221
2 New-York Perry Street 66
3 Paris Rue de l'Université 7
2. Loading employee data from Excel sheets
It turns out the employee_information.xlsx file also holds information about emergency contacts for each employee in a second sheet titled emergency_contacts. However, this sheet was edited at some point, and the header was removed!
Looking at the data, you were able to figure out what the header should be, and you confirmed that they were appropriate with the HR manager: employee_id, last_name, first_name, emergency_contact, emergency_contact_number, relationship.
0
A2R5H9
Hunman
1
H8K0L6
Siff
2
G4R7V0
Sagal
3
M1Z7U9
Coates
3. Loading role data from JSON files
Now the next step is to gather information about employee roles, teams, and salaries. This information usually lives in a human resources management system, but the Head of People Operations exported the data for you into a JSON file titled employee_roles.json.
Looking at the JSON file, you see entries are structured in a specific way. It is built as a Python dictionary: the keys are employee IDs, and each employee ID has a corresponding dictionary value holding role, salary, and team information. Here are the first few lines of the file:
{"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
Load the JSON file to a variable df_employee_roles, choosing the appropriate orientation.
A2R5H9
$4500
Leadership
H8K0L6
$4500
Leadership
G4R7V0
$3000
Sales
M1Z7U9
$2000
People Operations
4. Merging several DataFrames into one
You now have all the data required! All that's left is bringing it all in a unique DataFrame. This unique DataFrame will enable the Head of People Operations to access all employee data at once.
In this step, you will merge all DataFrames.
In the next step, you will remove duplicates and reorganize the columns - don't worry about this for now.
employee_id employee_last_name employee_first_name employee_country \
0 A2R5H9 Hunman Jax BE
1 H8K0L6 Siff Tara GB
2 G4R7V0 Sagal Gemma US
3 M1Z7U9 Coates Tig FR
employee_city employee_street employee_street_number last_name \
0 Leuven Grote Markt 9 Hunman
1 London Baker Street 221 Siff
2 New-York Perry Street 66 Sagal
3 Paris Rue de l'Université 7 Coates
first_name emergency_contact emergency_contact_number relationship \
0 Jax Opie Hurst +32-456-5556-84 Brother
1 Tara Wendy de Matteo +44-020-5554-333 Sister
2 Gemma John Newmark +1-202-555-194 Husband
3 Tig Venus Noone +1-202-555-0130 Wife
monthly_salary team title office \
0 $4500 Leadership CEO Leuven Office
1 $4500 Leadership CFO WeWork Office
2 $3000 Sales Business Developer ESB Office
3 $2000 People Operations Office Manager NaN
office_country office_city office_street office_street_number
0 BE Leuven Martelarenlaan 38.0
1 GB London Old Street 207.0
2 US New York City Fifth Avenue 350.0
3 NaN NaN NaN NaN
Index(['employee_id', 'employee_last_name', 'employee_first_name',
'employee_country', 'employee_city', 'employee_street',
'employee_street_number', 'last_name', 'first_name',
'emergency_contact', 'emergency_contact_number', 'relationship',
'monthly_salary', 'team', 'title', 'office', 'office_country',
'office_city', 'office_street', 'office_street_number'],
dtype='object')
5. Editing column names
Now that you merged all of your DataFrames into one let's make sure you have the information required by People Ops.
Currently, your df_employees DataFrame has the following column titles: employee_id, employee_last_name, employee_first_name, employee_country, employee_city, employee_street, employee_street_number, last_name, first_name, emergency_contact, emergency_contact_number, relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.
The columns employee_last_name and last_name are duplicates. The columns employee_first_name and first_name are duplicates as well. On top of this, People Ops wants to rename some of the columns:
employee_id should be id
employee_country should be country
employee_city should be city
employee_street should be street
employee_street_number should be street_number
emergency_contact_number should be emergency_number
relationship should be emergency_relationship
So your header should look like this in the end: id, country, city, street, street_number, last_name, first_name, emergency_contact, emergency_number, emergency_relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.
0
A2R5H9
BE
1
H8K0L6
GB
2
G4R7V0
US
3
M1Z7U9
FR
6. Changing column order
Now that you have the appropriate column names, you can reorder the columns.
0
A2R5H9
Hunman
1
H8K0L6
Siff
2
G4R7V0
Sagal
3
M1Z7U9
Coates
7. The last minute request
Last touches! You were ready to let People Ops know that the DataFrame was ready, but the department head just went over to your desk after lunch, asking about some last-minute requirements.
Let's polish the DataFrame before exporting the data, sending it over to People Ops, and deploying the pipeline:
All street numbers should be integers
The index should be the actual employee ID rather than the row number
If the value for office is NaN then the employee is remote: add a column named "status", right after monthly_salary indicating whether the employee is "On-site" or "Remote."
A2R5H9
Hunman
Jax
H8K0L6
Siff
Tara
G4R7V0
Sagal
Gemma
M1Z7U9
Coates
Tig
8. Saving your work
You now have everything People Ops requested. The different people responsible for these various files can currently keep working on these files if they want. As long as they save it in the datasets folder, People Ops will have to execute this unique script to obtain just one file from the ones scattered across different teams.
You bumped into the Head of People Ops and shared a few caveats and areas of improvement. She booked a meeting with you so you can explain:
How the current structure isn't robust to role changes: what if an existing employee takes on a new role?
How the current structure doesn't fit best practices in terms of database schema:
Having data all over the place like it's the case right now is a no-go
But gathering everything in a single table is inefficient: you have to query all information even if all you want is a phone number
There should be a single SQL database for employee data, with several tables that can be joined
Views can be built on top of the database to simplify non-data practitioners access.
In any case, you still need to show up with what was requested - so let's export your DataFrame to a CSV file.