Data Wrangling with Python
Data wrangling is the process of transforming raw data into a more structured format. The process includes collecting, processing, analyzing, and tidying the raw data so that it can be easily read and analyzed. We can use the common library in python, that is "pandas".
Source : https://storage.googleapis.com/dqlab-dataset/shopping_data.csv & https://storage.googleapis.com/dqlab-dataset/shopping_data_missingvalue.csv
Just in case we have a large number of data, we can just show into only five rows with head function. It will show you 5 rows data automatically.
Pandas has provided function .columns to access the column of the data source.
We know that the data source contains 5 columns indicated by the data code output above. It will show you all of the column according to data source. Then if we want to access just one column, for example "Age" we can use :
In addition to accessing data through columns, using pandas can also access using rows. In contrast to access through columns, the function to display data from a row is the .iloc[i] function where [i] indicates the order of the rows to be displayed where the index starts from 0. For the example we want to know what line 5 contains, so the code to run is :
We can combine both of those function to show row and column we want. For the example, we want to show the value in column "Age" at the first row (remember that the row starts at 0), then function and output is :
After displaying a data set, what if you want to display data from rows 5 to 20 of a dataset? To anticipate this, pandas can also display data within a certain range, both ranges for rows only, only columns, and ranges for rows and columns. This is the function
The describe() function allows to quickly find statistical information from a dataset. Those information such as mean, median, modus, max min, even standard deviation. Don't forget to install Numpy before using describe function.
As shown above, there are a few missing data on the columns. That's because in the dataset there is a string data format which eventually gives rise to the NaN format. We can use exclude=["O"], where the function will ignore non-numeric data for processing.
For the first step, we will figure out if there is missing value.
We will use another data source with missing values to practice this part.
We can handle missing data by following this schema (Sc : DQLab).
Based on the above scheme, data deletion is also a solution if it is felt that filling in blank values will have a bad influence on the analysis, or if the consideration of deleted data or missing data is small and does not contribute too much to the analysis to be carried out. Deletion of data can be directly on the data row or directly one column of data. The second solution is to use imputation (filling in empty data) depending on the problem. Especially for problems related to forecasting or forecasting depending on the existing data (more details can be seen in the picture)
The mean is used for data that has a few outliers/noise/anomalies in the distribution of the data and its contents. This value will later fill in the empty value of the dataset that has a missing value case. To fill in an empty value use the fillna() function
The median is used when the data presented has a high outlier. The median was chosen because it is the middle value, which means it is not the result of calculations involving outlier data. In some cases, outlier data is considered disturbing and often considered noisy because it can affect class distribution and interfere with clustering analysis.
Sometimes the data presented has a wide range difference. For example, age is in the range of 25 - 50, while income is in the range of 5000000 - 25000000. The two columns cannot be compared, this can be a big problem in the case of clustering or classification so it is necessary to normalize the data. Normalization is very important, especially for those who use distance calculations using any method (e.g. K - Means). There are various normalization methods, such as MinMax, Z Score, Decimal Scaling, Sigmoid, and Softmax. Its use depends on the needs of the dataset and the type of analysis performed (furthermore we will learn this topic in the future article).
We can easily identify the different between data source before normalization.
That's all for today. I will be honor if some of you could give me suggestion to my article or give me an invitation to work together on a data project.
See you😊!!!