Simple ETL with Pandas

What is "ETL"?

ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system (Sc : "https://www.ibm.com/cloud/learn/etl"). The simple, ETL is the first or basis step of data processing.

What is "Pandas"?

Pandas (Python Data Analysis) is a Python open source library that provides data structure and data analysis tools that easy use for analysis. There are many data format that able to be read and written by pandas, such as :

CSV

XLSX

JSON

SQL

HTML, and

XML

Pandas will convert them into rows and columns called data frame.

Let's Begin!!!

Extract

Extract is the process of extracting data from sources, this data source can be relational data (SQL) or tables, nonrelational (NoSQL) or others like we have written before. We need to import the libraries that we need to its process.

import pandas as pd #pd is the common alias of pandas library

df = pd.read_csv('https://storage.googleapis.com/dqlab-dataset/dqthon-participants.csv')

print(df.head(5))

                         participant_id first_name   last_name   birth_date  \
0  bd9b6f88-b84f-4c4d-90f8-b67fe2f1a29a      Citra  Nurdiyanti  05 Feb 1991   
1  7dfe3391-6f40-47b6-b4db-0c76ebaf5fc3       Aris    Setiawan  11 Jan 1993   
2  19582d7f-b824-4fe5-a517-d5bf573fc768   Cornelia   Handayani  31 Jul 1993   
3  aeb6d817-25f3-4867-8a74-8d92e0a0f633      Soleh      Rajasa  04 Nov 1991   
4  1fdabdd9-5444-4c97-87b2-fe8833ad0d27       Vivi      Astuti  22 Jan 2003   

                                             address        phone_number  \
0           Gg. Monginsidi No. 08\nMedan, Aceh 80734     (0151) 081 2706   
1     Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434  +62 (036) 461 7027   
2             Jalan Kebonjati No. 0\nAmbon, SS 57739        089 833 6695   
3    Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156  +62 (418) 329-4756   
4  Jalan Gardujati No. 53\nKediri, Sulawesi Tenga...          0812511835   

       country                  institute                      occupation  \
0      Georgia       UD Prakasa Mandasari  Business Intelligence Engineer   
1  Korea Utara     Universitas Diponegoro               Frontend Engineer   
2       Komoro   UD Hardiansyah Puspasari                Business Analyst   
3      Eritrea  Perum Tampubolon Yuliarti                 DevOps Engineer   
4     Aljazair     PT Hardiansyah Rahimah                    Data Analyst   

   register_time  
0   1.617634e+09  
1   1.617634e+09  
2   1.617634e+09  
3   1.617634e+09  
4   1.617634e+09

print(df.info())

Transform

Transform is the process of transforming data, or changes to data. Generally like:

Change the value of a column to a new value,

Creates a new column using another column,

Transpose rows to columns (or vice versa),

Changing the data format to a more standard form (for example, date and datetime columns which usually have non-standard values or HP numbers which usually have values that do not match the standard format), and others.

Postal Code

First, we will make a new column "postal_code" that include postal code. We will take the postal code from the end sentence of the full address.

df["postal_code"] = df["address"].str.extract(r"(\d+)$") print(df["postal_code"].head())

City

Second, we will get the city which is after street number and \n (newline character or enter)

df['city'] = df['address'].str.extract(r'(?<=\n)(\w.+)(?=,)') print(df['city'].head())

0           Medan
1      Prabumulih
2           Ambon
3    Lubuklinggau
4          Kediri
Name: city, dtype: object

Github Profile

We will figure out what project they have done with check their github profile. Assume that github profile is combination of first and last name that have been lower (lowercase)

df['github_profile'] = 'https://github.com/' + df['first_name'].str.lower() + df['last_name'].str.lower() print(df['github_profile'].head())

0      https://github.com/citranurdiyanti
1         https://github.com/arissetiawan
2    https://github.com/corneliahandayani
3          https://github.com/solehrajasa
4           https://github.com/viviastuti
Name: github_profile, dtype: object

Phone Number

There are differences formats in this csv. So we will change it into one same format with condition:

If the prefix of the cellphone number is the number 62 or +62 which is the Indonesian telephone code, then it is translated to 0.

No punctuation such as opening parenthesis, closing parenthesis, strip⟶ ()-

There is no space in the column name cell phone number to store the cleaning results on the cellphone numbe

df['cleaned_phone_number'] = df['phone_number'].str.replace(r'^(\+62|62)', '0') df['cleaned_phone_number'] = df['cleaned_phone_number'].str.replace(r'[()-]', '') df['cleaned_phone_number'] = df['cleaned_phone_number'].str.replace(r'\s+', '') print(df['cleaned_phone_number'].head())

Team Name

Team name is a combination of first name, last name, country, and institute. The format are "the abbreviation of the first letter of the first and the last name-country-the abbreviation of the institute from first letter of each words"

def team(col): abbrev_name = "%s%s"%(col['first_name'][0],col['last_name'][0]) country = col['country'] abbrev_institute = '%s'%(''.join(list(map(lambda word: word[0], col['institute'].split())))) return "%s-%s-%s"%(abbrev_name,country,abbrev_institute) df['team_name'] = df.apply(team, axis=1) print(df['team_name'].head())

0       CN-Georgia-UPM
1    AS-Korea Utara-UD
2        CH-Komoro-UHP
3       SR-Eritrea-PTY
4      VA-Aljazair-PHR
Name: team_name, dtype: object

Email formats: xxyy@aa.bb.[ac/com].[cc]

Information: xx -> first name (first_name) in lowercase yy -> last name (last_name) in lowercase aa -> institution name

For the value of bb, and cc follow the value of aa. The rules: - If the institution is a university, then bb -> combination of the first letters of each word of the University name in lowercase Then, followed by .ac which indicates the academy/institution of learning and followed by the pattern cc - If the institution is not a University, then bb -> combination of the first letters of each word of the institution name in lowercase Then, followed by .com. Please note that the cc pattern does not apply in this condition

cc -> is the participant's country of origin, as for the rules: - If the number of words in the country is more than 1 then take the abbreviation of that country in lowercase - However, if the number of words is only 1 then take the first 3 letters of the country in lowercase

Example: First name: Citra Last name: Nurdiyanti Institution: UD Prakasa Mandasari Country: Georgia So, his email: citranurdiyanti@upm.geo -------------------------------------------------- First name: Aris Last name: Setiawan Institution: Universitas Diponegoro Country: North Korea So, Email: arissetiawan@ud.ac.ku

def func(col): first_name_lower = col['first_name'].lower() last_name_lower = col['last_name'].lower() institute = ''.join(list(map(lambda word: word[0], col['institute'].lower().split()))) #Singkatan dari nama perusahaan dalam lowercase if 'Universitas' in col['institute']: if len(col['country'].split()) > 1: #Kondisi untuk mengecek apakah jumlah kata dari country lebih dari 1 country = ''.join(list(map(lambda word: word[0], col['country'].lower().split()))) else: country = col['country'][:3].lower() return "%s%s@%s.ac.%s"%(first_name_lower,last_name_lower,institute,country) return "%s%s@%s.com"%(first_name_lower,last_name_lower,institute) df['email'] = df.apply(func, axis=1) print(df['email'].head())

0      citranurdiyanti@upm.com
1        arissetiawan@ud.ac.ku
2    corneliahandayani@uhp.com
3          solehrajasa@pty.com
4           viviastuti@phr.com
Name: email, dtype: object

Birth Date

We will follow MySQL rules for birth date, that is YYYY-MM-DD with description: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date

df['birth_date'] = pd.to_datetime(df['birth_date'], format='%d %b %Y') print(df['birth_date'].head())

0   1991-02-05
1   1993-01-11
2   1993-07-31
3   1991-11-04
4   2003-01-22
Name: birth_date, dtype: datetime64[ns]

Competition Register Date

In addition to having rules regarding the DATE format, MySQL also provides rules for data of type DATETIME, namely YYYY-MM-DD HH:mm:ss with the information: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date - HH: 2 digits indicating the hour - mm: 2 digits indicating the minute - ss: 2 digits indicating seconds Examples are: 2021-04-07 15:10:55

df['register_at'] = pd.to_datetime(df['register_time'], unit='s') print(df['register_at'].head())

0   2021-04-05 14:47:26
1   2021-04-05 14:46:58
2   2021-04-05 14:47:15
3   2021-04-05 14:47:14
4   2021-04-05 14:46:50
Name: register_at, dtype: datetime64[ns]

Load

In this load section, the data that has been transformed in such a way that it fits the needs of the analyst team is entered back into the database, namely the Data Warehouse (DWH). Usually, the database schema is defined first, such as: - Column name - Column type - Is it primary key, unique key, index or not - Column length Because data warehouses are generally structured databases, they need a schema before the data is entered.

Pandas already provides a function to enter data into the database, namely to_sql().

Details of the function can be found in the Pandas documentation: "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html"