Simple ETL with Pandas
What is "ETL"?
ETL, which stands for extract, transform and load, is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system (Sc : "https://www.ibm.com/cloud/learn/etl"). The simple, ETL is the first or basis step of data processing.
What is "Pandas"?
Pandas (Python Data Analysis) is a Python open source library that provides data structure and data analysis tools that easy use for analysis. There are many data format that able to be read and written by pandas, such as :
Pandas will convert them into rows and columns called data frame.
Let's Begin!!!
Extract
Extract is the process of extracting data from sources, this data source can be relational data (SQL) or tables, nonrelational (NoSQL) or others like we have written before. We need to import the libraries that we need to its process.
Transform
Transform is the process of transforming data, or changes to data. Generally like:
First, we will make a new column "postal_code" that include postal code. We will take the postal code from the end sentence of the full address.
Second, we will get the city which is after street number and \n (newline character or enter)
We will figure out what project they have done with check their github profile. Assume that github profile is combination of first and last name that have been lower (lowercase)
There are differences formats in this csv. So we will change it into one same format with condition:
Team name is a combination of first name, last name, country, and institute. The format are "the abbreviation of the first letter of the first and the last name-country-the abbreviation of the institute from first letter of each words"
Email formats: xxyy@aa.bb.[ac/com].[cc]
Information: xx -> first name (first_name) in lowercase yy -> last name (last_name) in lowercase aa -> institution name
For the value of bb, and cc follow the value of aa. The rules: - If the institution is a university, then bb -> combination of the first letters of each word of the University name in lowercase Then, followed by .ac which indicates the academy/institution of learning and followed by the pattern cc - If the institution is not a University, then bb -> combination of the first letters of each word of the institution name in lowercase Then, followed by .com. Please note that the cc pattern does not apply in this condition
cc -> is the participant's country of origin, as for the rules: - If the number of words in the country is more than 1 then take the abbreviation of that country in lowercase - However, if the number of words is only 1 then take the first 3 letters of the country in lowercase
Example: First name: Citra Last name: Nurdiyanti Institution: UD Prakasa Mandasari Country: Georgia So, his email: citranurdiyanti@upm.geo -------------------------------------------------- First name: Aris Last name: Setiawan Institution: Universitas Diponegoro Country: North Korea So, Email: arissetiawan@ud.ac.ku
We will follow MySQL rules for birth date, that is YYYY-MM-DD with description: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date
In addition to having rules regarding the DATE format, MySQL also provides rules for data of type DATETIME, namely YYYY-MM-DD HH:mm:ss with the information: - YYYY: 4 digits indicating year - MM: 2 digits indicating month - DD: 2 digits indicating the date - HH: 2 digits indicating the hour - mm: 2 digits indicating the minute - ss: 2 digits indicating seconds Examples are: 2021-04-07 15:10:55
Load
In this load section, the data that has been transformed in such a way that it fits the needs of the analyst team is entered back into the database, namely the Data Warehouse (DWH). Usually, the database schema is defined first, such as: - Column name - Column type - Is it primary key, unique key, index or not - Column length Because data warehouses are generally structured databases, they need a schema before the data is entered.
Pandas already provides a function to enter data into the database, namely to_sql().
Details of the function can be found in the Pandas documentation: "https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html"