Simple ETL with Pandas

What is "ETL"?

ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system (Sc : "https://www.ibm.com/cloud/learn/etl"). The simple, ETL is the first or basis step of data processing.

What is "Pandas"?

Pandas (Python Data Analysis) is a Python open source library that provides data structure and data analysis tools that easy use for analysis. There are many data format that able to be read and written by pandas, such as :

CSV

XLSX

JSON

SQL

HTML, and

XML

Pandas will convert them into rows and columns called data frame.

Let's Begin!!!

Extract

Extract is the process of extracting data from sources, this data source can be relational data (SQL) or tables, nonrelational (NoSQL) or others like we have written before. We need to import the libraries that we need to its process.

import pandas as pd #pd is the common alias of pandas library

df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')

print(df.head(5))

print(df.info())

Transform

Transform is the process of transforming data, or changes to data. Generally like:

Change the value of a column to a new value,

Creates a new column using another column,

Transpose rows to columns (or vice versa),

Changing the data format to a more standard form (for example, date and datetime columns which usually have non-standard values or HP numbers which usually have values that do not match the standard format), and others.

Postal Code

First, we will make a new column "postal_code" that include postal code. We will take the postal code from the end sentence of the full address.

df["postal_code"] = df["address"].str.extract(r"(\d+)$") print(df["postal_code"].head())

City

Second, we will get the city which is after street number and \n (newline character or enter)

df['city'] = df['address'].str.extract(r'(?<=\n)(\w.+)(?=,)') print(df['city'].head())

Github Profile

We will figure out what project they have done with check their github profile. Assume that github profile is combination of first and last name that have been lower (lowercase)

df['github_profile'] = 'https://github.com/' + df['first_name'].str.lower() + df['last_name'].str.lower() print(df['github_profile'].head())

Phone Number

There are differences formats in this csv. So we will change it into one same format with condition:

If the prefix of the cellphone number is the number 62 or +62 which is the Indonesian telephone code, then it is translated to 0.

No punctuation such as opening parenthesis, closing parenthesis, strip⟶ ()-

There is no space in the column name cell phone number to store the cleaning results on the cellphone numbe

df['cleaned_phone_number'] = df['phone_number'].str.replace(r'^(\+62|62)', '0') df['cleaned_phone_number'] = df['cleaned_phone_number'].str.replace(r'[()-]', '') df['cleaned_phone_number'] = df['cleaned_phone_number'].str.replace(r'\s+', '') print(df['cleaned_phone_number'].head())

Team Name

Team name is a combination of first name, last name, country, and institute. The format are "the abbreviation of the first letter of the first and the last name-country-the abbreviation of the institute from first letter of each words"

def team(col): abbrev_name = "%s%s"%(col['first_name'][0],col['last_name'][0]) country = col['country'] abbrev_institute = '%s'%(''.join(list(map(lambda word: word[0], col['institute'].split())))) return "%s-%s-%s"%(abbrev_name,country,abbrev_institute) df['team_name'] = df.apply(team, axis=1) print(df['team_name'].head())

Email formats: xxyy@aa.bb.[ac/com].[cc]

Information: xx -> first name (first_name) in lowercase yy -> last name (last_name) in lowercase aa -> institution name

For the value of bb, and cc follow the value of aa. The rules: - If the institution is a university, then bb -> combination of the first letters of each word of the University name in lowercase Then, followed by .ac which indicates the academy/institution of learning and followed by the pattern cc - If the institution is not a University, then bb -> combination of the first letters of each word of the institution name in lowercase Then, followed by .com. Please note that the cc pattern does not apply in this condition

cc -> is the participant's country of origin, as for the rules: - If the number of words in the country is more than 1 then take the abbreviation of that country in lowercase - However, if the number of words is only 1 then take the first 3 letters of the country in lowercase

Example: First name: Citra Last name: Nurdiyanti Institution: UD Prakasa Mandasari Country: Georgia So, his email: citranurdiyanti@upm.geo -------------------------------------------------- First name: Aris Last name: Setiawan Institution: Universitas Diponegoro Country: North Korea So, Email: arissetiawan@ud.ac.ku

def func(col): first_name_lower = col['first_name'].lower() last_name_lower = col['last_name'].lower() institute = ''.join(list(map(lambda word: word[0], col['institute'].lower().split()))) #Singkatan dari nama perusahaan dalam lowercase if 'Universitas' in col['institute']: if len(col['country'].split()) > 1: #Kondisi untuk mengecek apakah jumlah kata dari country lebih dari 1 country = ''.join(list(map(lambda word: word[0], col['country'].lower().split()))) else: country = col['country'][:3].lower() return "%s%s@%s.ac.%s"%(first_name_lower,last_name_lower,institute,country) return "%s%s@%s.com"%(first_name_lower,last_name_lower,institute) df['email'] = df.apply(func, axis=1) print(df['email'].head())

Birth Date

We will follow MySQL rules for birth date, that is YYYY-MM-DD with description: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date

df['birth_date'] = pd.to_datetime(df['birth_date'], format='%d %b %Y') print(df['birth_date'].head())

Competition Register Date

In addition to having rules regarding the DATE format, MySQL also provides rules for data of type DATETIME, namely YYYY-MM-DD HH:mm:ss with the information: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date - HH: 2 digits indicating the hour - mm: 2 digits indicating the minute - ss: 2 digits indicating seconds Examples are: 2021-04-07 15:10:55

df['register_at'] = pd.to_datetime(df['register_time'], unit='s') print(df['register_at'].head())

Load

In this load section, the data that has been transformed in such a way that it fits the needs of the analyst team is entered back into the database, namely the Data Warehouse (DWH). Usually, the database schema is defined first, such as: - Column name - Column type - Is it primary key, unique key, index or not - Column length Because data warehouses are generally structured databases, they need a schema before the data is entered.

Pandas already provides a function to enter data into the database, namely to_sql().

Details of the function can be found in the Pandas documentation: "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html"