1. Loading data from CSV and Excel files
You just got hired as the first and only data practitioner at a business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files, SQL databases.
The Head of People Operations wants to have a general view gathering all available information about a specific employee. Your job is to gather it all in a file that will serve as the reference moving forward. You will merge all of this data in a pandas DataFrame before exporting to CSV.
Data management at your company is not the best, but you need to start somewhere. You decide to tackle the most straightforward tasks first, and to begin by loading the company office addresses. They are currently saved into a CSV file, office_addresses.csv, which the Office Manager sent over to you. Additionally, an HR manager you remember interviewing with gave you access to the Excel file, employee_information.xlsx, where the employee addresses are saved. You need to load these datasets in two separate DataFrames.
2. Loading employee data from Excel sheets
It turns out the employee_information.xlsx file also holds information about emergency contacts for each employee in a second sheet titled emergency_contacts. However, this sheet was edited at some point, and the header was removed!
Looking at the data, you were able to figure out what the header should be, and you confirmed that they were appropriate with the HR manager: employee_id, last_name, first_name, emergency_contact, emergency_contact_number, relationship.
3. Loading role data from JSON files
Now the next step is to gather information about employee roles, teams, and salaries. This information usually lives in a human resources management system, but the Head of People Operations exported the data for you into a JSON file titled employee_roles.json.
Looking at the JSON file, you see entries are structured in a specific way. It is built as a Python dictionary: the keys are employee IDs, and each employee ID has a corresponding dictionary value holding role, salary, and team information. Here are the first few lines of the file:
{"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
Load the JSON file to a variable df_employee_roles, choosing the appropriate orientation.
4. Merging several DataFrames into one
You now have all the data required! All that's left is bringing it all in a unique DataFrame. This unique DataFrame will enable the Head of People Operations to access all employee data at once.
In this step, you will merge all DataFrames.
In the next step, you will remove duplicates and reorganize the columns - don't worry about this for now.
5. Editing column names
Now that you merged all of your DataFrames into one let's make sure you have the information required by People Ops.
Currently, your df_employees DataFrame has the following column titles: employee_id, employee_last_name, employee_first_name, employee_country, employee_city, employee_street, employee_street_number, last_name, first_name, emergency_contact, emergency_contact_number, relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.
The columns employee_last_name and last_name are duplicates. The columns employee_first_name and first_name are duplicates as well. On top of this, People Ops wants to rename some of the columns:
employee_id should be id
employee_country should be country
employee_city should be city
employee_street should be street
employee_street_number should be street_number
emergency_contact_number should be emergency_number
relationship should be emergency_relationship
So your header should look like this in the end: id, country, city, street, street_number, last_name, first_name, emergency_contact, emergency_number, emergency_relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.
6. Changing column order
Now that you have the appropriate column names, you can reorder the columns.
7. The last minute request
Last touches! You were ready to let People Ops know that the DataFrame was ready, but the department head just went over to your desk after lunch, asking about some last-minute requirements.
Let's polish the DataFrame before exporting the data, sending it over to People Ops, and deploying the pipeline:
All street numbers should be integers
The index should be the actual employee ID rather than the row number
If the value for office is NaN then the employee is remote: add a column named "status", right after monthly_salary indicating whether the employee is "On-site" or "Remote."
8. Saving your work
You now have everything People Ops requested. The different people responsible for these various files can currently keep working on these files if they want. As long as they save it in the datasets folder, People Ops will have to execute this unique script to obtain just one file from the ones scattered across different teams.
You bumped into the Head of People Ops and shared a few caveats and areas of improvement. She booked a meeting with you so you can explain:
In any case, you still need to show up with what was requested - so let's export your DataFrame to a CSV file.