Simple ETL with Pandas
What is "ETL"?
ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system (Sc : "https://www.ibm.com/cloud/learn/etl"). The simple, ETL is the first or basis step of data processing.
What is "Pandas"?
Pandas (Python Data Analysis) is a Python open source library that provides data structure and data analysis tools that easy use for analysis. There are many data format that able to be read and written by pandas, such as :
Pandas will convert them into rows and columns called data frame.
Let's Begin!!!
Extract
Extract is the process of extracting data from sources, this data source can be relational data (SQL) or tables, nonrelational (NoSQL) or others like we have written before. We need to import the libraries that we need to its process.
participant_id first_name last_name birth_date \
0 bd9b6f88-b84f-4c4d-90f8-b67fe2f1a29a Citra Nurdiyanti 05 Feb 1991
1 7dfe3391-6f40-47b6-b4db-0c76ebaf5fc3 Aris Setiawan 11 Jan 1993
2 19582d7f-b824-4fe5-a517-d5bf573fc768 Cornelia Handayani 31 Jul 1993
3 aeb6d817-25f3-4867-8a74-8d92e0a0f633 Soleh Rajasa 04 Nov 1991
4 1fdabdd9-5444-4c97-87b2-fe8833ad0d27 Vivi Astuti 22 Jan 2003
address phone_number \
0 Gg. Monginsidi No. 08\nMedan, Aceh 80734 (0151) 081 2706
1 Gg. Rajawali Timur No. 7\nPrabumulih, MA 09434 +62 (036) 461 7027
2 Jalan Kebonjati No. 0\nAmbon, SS 57739 089 833 6695
3 Jl. Yos Sudarso No. 109\nLubuklinggau, SR 76156 +62 (418) 329-4756
4 Jalan Gardujati No. 53\nKediri, Sulawesi Tenga... 0812511835
country institute occupation \
0 Georgia UD Prakasa Mandasari Business Intelligence Engineer
1 Korea Utara Universitas Diponegoro Frontend Engineer
2 Komoro UD Hardiansyah Puspasari Business Analyst
3 Eritrea Perum Tampubolon Yuliarti DevOps Engineer
4 Aljazair PT Hardiansyah Rahimah Data Analyst
register_time
0 1.617634e+09
1 1.617634e+09
2 1.617634e+09
3 1.617634e+09
4 1.617634e+09
Transform
Transform is the process of transforming data, or changes to data. Generally like:
First, we will make a new column "postal_code" that include postal code. We will take the postal code from the end sentence of the full address.
Second, we will get the city which is after street number and \n (newline character or enter)
0 Medan
1 Prabumulih
2 Ambon
3 Lubuklinggau
4 Kediri
Name: city, dtype: object
We will figure out what project they have done with check their github profile. Assume that github profile is combination of first and last name that have been lower (lowercase)
0 https://github.com/citranurdiyanti
1 https://github.com/arissetiawan
2 https://github.com/corneliahandayani
3 https://github.com/solehrajasa
4 https://github.com/viviastuti
Name: github_profile, dtype: object
There are differences formats in this csv. So we will change it into one same format with condition:
Team name is a combination of first name, last name, country, and institute. The format are "the abbreviation of the first letter of the first and the last name-country-the abbreviation of the institute from first letter of each words"
0 CN-Georgia-UPM
1 AS-Korea Utara-UD
2 CH-Komoro-UHP
3 SR-Eritrea-PTY
4 VA-Aljazair-PHR
Name: team_name, dtype: object
Email formats: xxyy@aa.bb.[ac/com].[cc]
Information: xx -> first name (first_name) in lowercase yy -> last name (last_name) in lowercase aa -> institution name
For the value of bb, and cc follow the value of aa. The rules: - If the institution is a university, then bb -> combination of the first letters of each word of the University name in lowercase Then, followed by .ac which indicates the academy/institution of learning and followed by the pattern cc - If the institution is not a University, then bb -> combination of the first letters of each word of the institution name in lowercase Then, followed by .com. Please note that the cc pattern does not apply in this condition
cc -> is the participant's country of origin, as for the rules: - If the number of words in the country is more than 1 then take the abbreviation of that country in lowercase - However, if the number of words is only 1 then take the first 3 letters of the country in lowercase
Example: First name: Citra Last name: Nurdiyanti Institution: UD Prakasa Mandasari Country: Georgia So, his email: citranurdiyanti@upm.geo -------------------------------------------------- First name: Aris Last name: Setiawan Institution: Universitas Diponegoro Country: North Korea So, Email: arissetiawan@ud.ac.ku
0 citranurdiyanti@upm.com
1 arissetiawan@ud.ac.ku
2 corneliahandayani@uhp.com
3 solehrajasa@pty.com
4 viviastuti@phr.com
Name: email, dtype: object
We will follow MySQL rules for birth date, that is YYYY-MM-DD with description: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date
0 1991-02-05
1 1993-01-11
2 1993-07-31
3 1991-11-04
4 2003-01-22
Name: birth_date, dtype: datetime64[ns]
In addition to having rules regarding the DATE format, MySQL also provides rules for data of type DATETIME, namely YYYY-MM-DD HH:mm:ss with the information: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date - HH: 2 digits indicating the hour - mm: 2 digits indicating the minute - ss: 2 digits indicating seconds Examples are: 2021-04-07 15:10:55
0 2021-04-05 14:47:26
1 2021-04-05 14:46:58
2 2021-04-05 14:47:15
3 2021-04-05 14:47:14
4 2021-04-05 14:46:50
Name: register_at, dtype: datetime64[ns]
Load
In this load section, the data that has been transformed in such a way that it fits the needs of the analyst team is entered back into the database, namely the Data Warehouse (DWH). Usually, the database schema is defined first, such as: - Column name - Column type - Is it primary key, unique key, index or not - Column length Because data warehouses are generally structured databases, they need a schema before the data is entered.
Pandas already provides a function to enter data into the database, namely to_sql().
Details of the function can be found in the Pandas documentation: "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html"