1. Loading data from CSV and Excel files
You just got hired as the first and only data practitioner at a business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files, SQL databases.
The Head of People Operations wants to have a general view gathering all available information about a specific employee. Your job is to gather it all in a file that will serve as the reference moving forward. You will merge all of this data in a pandas DataFrame before exporting to CSV.
Data management at your company is not the best, but you need to start somewhere. You decide to tackle the most straightforward tasks first, and to begin by loading the company office addresses. They are currently saved into a CSV file, , which the Office Manager sent over to you. Additionally, an HR manager you remember interviewing with gave you access to the Excel file, , where the employee addresses are saved. You need to load these datasets in two separate DataFrames.
office office_country office_city office_street \ 0 Leuven Office BE Leuven Martelarenlaan 1 ESB Office US New York City Fifth Avenue 2 WeWork Office GB London Old Street office_street_number 0 38 1 350 2 207 employee_id employee_last_name employee_first_name employee_country \ 0 A2R5H9 Hunman Jax BE 1 H8K0L6 Siff Tara GB 2 G4R7V0 Sagal Gemma US 3 M1Z7U9 Coates Tig FR employee_city employee_street employee_street_number 0 Leuven Grote Markt 9 1 London Baker Street 221 2 New-York Perry Street 66 3 Paris Rue de l'Université 7
2. Loading employee data from Excel sheets
It turns out the file also holds information about emergency contacts for each employee in a second sheet titled . However, this sheet was edited at some point, and the header was removed!
Looking at the data, you were able to figure out what the header should be, and you confirmed that they were appropriate with the HR manager: , , , , , .
3. Loading role data from JSON files
Now the next step is to gather information about employee roles, teams, and salaries. This information usually lives in a human resources management system, but the Head of People Operations exported the data for you into a JSON file titled .
Looking at the JSON file, you see entries are structured in a specific way. It is built as a Python dictionary: the keys are employee IDs, and each employee ID has a corresponding dictionary value holding role, salary, and team information. Here are the first few lines of the file:
Load the JSON file to a variable , choosing the appropriate orientation.
4. Merging several DataFrames into one
You now have all the data required! All that's left is bringing it all in a unique DataFrame. This unique DataFrame will enable the Head of People Operations to access all employee data at once.
In this step, you will merge all DataFrames.
In the next step, you will remove duplicates and reorganize the columns - don't worry about this for now.
employee_id employee_last_name employee_first_name employee_country \ 0 A2R5H9 Hunman Jax BE 1 H8K0L6 Siff Tara GB 2 G4R7V0 Sagal Gemma US 3 M1Z7U9 Coates Tig FR employee_city employee_street employee_street_number last_name \ 0 Leuven Grote Markt 9 Hunman 1 London Baker Street 221 Siff 2 New-York Perry Street 66 Sagal 3 Paris Rue de l'Université 7 Coates first_name emergency_contact emergency_contact_number relationship \ 0 Jax Opie Hurst +32-456-5556-84 Brother 1 Tara Wendy de Matteo +44-020-5554-333 Sister 2 Gemma John Newmark +1-202-555-194 Husband 3 Tig Venus Noone +1-202-555-0130 Wife monthly_salary team title office \ 0 $4500 Leadership CEO Leuven Office 1 $4500 Leadership CFO WeWork Office 2 $3000 Sales Business Developer ESB Office 3 $2000 People Operations Office Manager NaN office_country office_city office_street office_street_number 0 BE Leuven Martelarenlaan 38.0 1 GB London Old Street 207.0 2 US New York City Fifth Avenue 350.0 3 NaN NaN NaN NaN Index(['employee_id', 'employee_last_name', 'employee_first_name', 'employee_country', 'employee_city', 'employee_street', 'employee_street_number', 'last_name', 'first_name', 'emergency_contact', 'emergency_contact_number', 'relationship', 'monthly_salary', 'team', 'title', 'office', 'office_country', 'office_city', 'office_street', 'office_street_number'], dtype='object')
5. Editing column names
Now that you merged all of your DataFrames into one let's make sure you have the information required by People Ops.
Currently, your DataFrame has the following column titles: , , , , , , , , , , , , , , , , , , , .
The columns and are duplicates. The columns and are duplicates as well. On top of this, People Ops wants to rename some of the columns:
So your header should look like this in the end: , , , , , , , , , , , , , , , , , .
6. Changing column order
Now that you have the appropriate column names, you can reorder the columns.
7. The last minute request
Last touches! You were ready to let People Ops know that the DataFrame was ready, but the department head just went over to your desk after lunch, asking about some last-minute requirements.
Let's polish the DataFrame before exporting the data, sending it over to People Ops, and deploying the pipeline:
8. Saving your work
You now have everything People Ops requested. The different people responsible for these various files can currently keep working on these files if they want. As long as they save it in the datasets folder, People Ops will have to execute this unique script to obtain just one file from the ones scattered across different teams.
You bumped into the Head of People Ops and shared a few caveats and areas of improvement. She booked a meeting with you so you can explain:
In any case, you still need to show up with what was requested - so let's export your DataFrame to a CSV file.