Streamlining People Analytics Data

1. Loading data from CSV and Excel files

You just got hired as the first and only data practitioner at a business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files, SQL databases.

The Head of People Operations wants to have a general view gathering all available information about a specific employee. Your job is to gather it all in a file that will serve as the reference moving forward. You will merge all of this data in a pandas DataFrame before exporting to CSV.

Data management at your company is not the best, but you need to start somewhere. You decide to tackle the most straightforward tasks first, and to begin by loading the company office addresses. They are currently saved into a CSV file, office_addresses.csv, which the Office Manager sent over to you. Additionally, an HR manager you remember interviewing with gave you access to the Excel file, employee_information.xlsx, where the employee addresses are saved. You need to load these datasets in two separate DataFrames.

# Import the library you need import pandas as pd # Load office_addresses.csv df_office_addresses = pd.read_csv("office_addresses.csv") # Load employee_information.xlsx df_employee_addresses = pd.read_excel("employee_information.xlsx") # Take a look at the first rows of the DataFrames print(df_office_addresses.head()) print(df_employee_addresses.head())

2. Loading employee data from Excel sheets

It turns out the employee_information.xlsx file also holds information about emergency contacts for each employee in a second sheet titled emergency_contacts. However, this sheet was edited at some point, and the header was removed!

Looking at the data, you were able to figure out what the header should be, and you confirmed that they were appropriate with the HR manager: employee_id, last_name, first_name, emergency_contact, emergency_contact_number, relationship.

# Load data from the second sheet of employee_information.xlsx df_emergency_contacts = pd.read_excel("employee_information.xlsx", sheet_name=1, header=None) # Declare a list of new column names emergency_contacts_header = ["employee_id", "last_name", "first_name", "emergency_contact", "emergency_contact_number", "relationship"] # Rename the columns df_emergency_contacts.columns = emergency_contacts_header # Take a look at the first rows of the DataFrame df_emergency_contacts.head()

3. Loading role data from JSON files

Now the next step is to gather information about employee roles, teams, and salaries. This information usually lives in a human resources management system, but the Head of People Operations exported the data for you into a JSON file titled employee_roles.json.

Looking at the JSON file, you see entries are structured in a specific way. It is built as a Python dictionary: the keys are employee IDs, and each employee ID has a corresponding dictionary value holding role, salary, and team information. Here are the first few lines of the file:

{"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }

Load the JSON file to a variable df_employee_roles, choosing the appropriate orientation.

# Load employee_roles.json df_employee_roles = pd.read_json("employee_roles.json", orient="index") df_employee_roles = df_employee_roles.reindex(sorted(df_employee_roles.columns), axis=1) # Take a look at the first rows of the DataFrame df_employee_roles.head()

4. Merging several DataFrames into one

You now have all the data required! All that's left is bringing it all in a unique DataFrame. This unique DataFrame will enable the Head of People Operations to access all employee data at once.

In this step, you will merge all DataFrames.

In the next step, you will remove duplicates and reorganize the columns - don't worry about this for now.

# Merge df_employee_addresses with df_emergency_contacts df_employees = df_employee_addresses.merge(df_emergency_contacts, how="left", on="employee_id") # Merge df_employees with df_employee_roles df_employees = df_employees.merge(df_employee_roles, how="left", left_on="employee_id", right_on=df_employee_roles.index) # Merge df_employees with df_office_adresses df_employees = df_employees.merge(df_office_addresses, how="left", left_on="employee_country", right_on="office_country") # Take a look at the first rows of the DataFrame and its columns print(df_employees.head()) print(df_employees.columns)

5. Editing column names

Now that you merged all of your DataFrames into one let's make sure you have the information required by People Ops.

Currently, your df_employees DataFrame has the following column titles: employee_id, employee_last_name, employee_first_name, employee_country, employee_city, employee_street, employee_street_number, last_name, first_name, emergency_contact, emergency_contact_number, relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.

The columns employee_last_name and last_name are duplicates. The columns employee_first_name and first_name are duplicates as well. On top of this, People Ops wants to rename some of the columns:

employee_id should be id

employee_country should be country

employee_city should be city

employee_street should be street

employee_street_number should be street_number

emergency_contact_number should be emergency_number

relationship should be emergency_relationship

So your header should look like this in the end: id, country, city, street, street_number, last_name, first_name, emergency_contact, emergency_number, emergency_relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.

# Drop the columns df_employees_renamed = df_employees.drop(["employee_first_name", "employee_last_name"], axis=1) # New columns names new_column_names = {"employee_id": "id", "employee_country": "country", "employee_city": "city", "employee_street": "street", "employee_street_number": "street_number", "relationship": "emergency_relationship", "emergency_contact_number": "emergency_number"} # Rename the columns df_employees_renamed = df_employees_renamed.rename(columns=new_column_names) # Take a look at the first rows of the DataFrame df_employees_renamed.head()

6. Changing column order

Now that you have the appropriate column names, you can reorder the columns.

# Declare a list for the new column's order and reorder columns new_column_order = ["id", "last_name", "first_name", "title", "team", "monthly_salary", "country", "city", "street", "street_number", "emergency_contact", "emergency_number", "emergency_relationship", "office", "office_country", "office_city", "office_street", "office_street_number"] # Reorder the columns df_employees_ordered = df_employees_renamed[new_column_order] # Take a look at the result df_employees_ordered.head()

7. The last minute request

Last touches! You were ready to let People Ops know that the DataFrame was ready, but the department head just went over to your desk after lunch, asking about some last-minute requirements.

Let's polish the DataFrame before exporting the data, sending it over to People Ops, and deploying the pipeline:

All street numbers should be integers

The index should be the actual employee ID rather than the row number

If the value for office is NaN then the employee is remote: add a column named "status", right after monthly_salary indicating whether the employee is "On-site" or "Remote."

# Reset the index and drop the column df_employees_final = df_employees_ordered.set_index(df_employees_ordered["id"]).drop(columns=["id"]) # Loop through the row values and append to status_list accordingly status_list = [] for index, row in df_employees_final.iterrows(): if pd.isnull(row["office"]): status_list.append("Remote") else: status_list.append("On-site") # Or status_list = ["Remote" if pd.isnull(row["office"]) else "On-site" for index, row in df_employees_final.iterrows()] # Insert status as a new column df_employees_final.insert(loc=5, column="status", value=status_list) # Take a look at the first rows of the DataFrame df_employees_final.head()

8. Saving your work

You now have everything People Ops requested. The different people responsible for these various files can currently keep working on these files if they want. As long as they save it in the datasets folder, People Ops will have to execute this unique script to obtain just one file from the ones scattered across different teams.

You bumped into the Head of People Ops and shared a few caveats and areas of improvement. She booked a meeting with you so you can explain:

How the current structure isn't robust to role changes: what if an existing employee takes on a new role?

How the current structure doesn't fit best practices in terms of database schema:

Having data all over the place like it's the case right now is a no-go

But gathering everything in a single table is inefficient: you have to query all information even if all you want is a phone number

There should be a single SQL database for employee data, with several tables that can be joined

Views can be built on top of the database to simplify non-data practitioners access.

In any case, you still need to show up with what was requested - so let's export your DataFrame to a CSV file.

# Write to CSV df_employees_final.to_csv("employee_data.csv")

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}1. Loading data from CSV and Excel files