Data Wrangling with Python
Data wrangling is the process of transforming raw data into a more structured format. The process includes collecting, processing, analyzing, and tidying the raw data so that it can be easily read and analyzed. We can use the common library in python, that is "pandas".
Source : https://storage.googleapis.com/dqlab-dataset/shopping_data.csv & https://storage.googleapis.com/dqlab-dataset/shopping_data_missingvalue.csv
Read the Data with Pandas
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
.. ... ... ... ... ...
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83
[200 rows x 5 columns]
Just in case we have a large number of data, we can just show into only five rows with head function. It will show you 5 rows data automatically.
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
Access the Column
Pandas has provided function .columns to access the column of the data source.
Index(['CustomerID', 'Genre', 'Age', 'Annual Income (k$)',
'Spending Score (1-100)'],
dtype='object')
We know that the data source contains 5 columns indicated by the data code output above. It will show you all of the column according to data source. Then if we want to access just one column, for example "Age" we can use :
0 19
1 21
2 20
3 23
4 31
..
195 35
196 45
197 32
198 32
199 30
Name: Age, Length: 200, dtype: int64
Access the Row
In addition to accessing data through columns, using pandas can also access using rows. In contrast to access through columns, the function to display data from a row is the .iloc[i] function where [i] indicates the order of the rows to be displayed where the index starts from 0. For the example we want to know what line 5 contains, so the code to run is :
CustomerID 6
Genre Female
Age 22
Annual Income (k$) 17
Spending Score (1-100) 76
Name: 5, dtype: object
We can combine both of those function to show row and column we want. For the example, we want to show the value in column "Age" at the first row (remember that the row starts at 0), then function and output is :
21
Show Data Based on Range
After displaying a data set, what if you want to display data from rows 5 to 20 of a dataset? To anticipate this, pandas can also display data within a certain range, both ranges for rows only, only columns, and ranges for rows and columns. This is the function
Shows data to 5th to less than 10th in a row:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72
Using Numpy to Show the Statistic Information
The describe() function allows to quickly find statistical information from a dataset. Those information such as mean, median, modus, max min, even standard deviation. Don't forget to install Numpy before using describe function.
CustomerID Genre Age Annual Income (k$) \
count 200.000000 200 200.000000 200.000000
unique NaN 2 NaN NaN
top NaN Female NaN NaN
freq NaN 112 NaN NaN
mean 100.500000 NaN 38.850000 60.560000
std 57.879185 NaN 13.969007 26.264721
min 1.000000 NaN 18.000000 15.000000
25% 50.750000 NaN 28.750000 41.500000
50% 100.500000 NaN 36.000000 61.500000
75% 150.250000 NaN 49.000000 78.000000
max 200.000000 NaN 70.000000 137.000000
Spending Score (1-100)
count 200.000000
unique NaN
top NaN
freq NaN
mean 50.200000
std 25.823522
min 1.000000
25% 34.750000
50% 50.000000
75% 73.000000
max 99.000000
As shown above, there are a few missing data on the columns. That's because in the dataset there is a string data format which eventually gives rise to the NaN format. We can use exclude=["O"], where the function will ignore non-numeric data for processing.
CustomerID Age Annual Income (k$) Spending Score (1-100)
count 200.000000 200.000000 200.000000 200.000000
mean 100.500000 38.850000 60.560000 50.200000
std 57.879185 13.969007 26.264721 25.823522
min 1.000000 18.000000 15.000000 1.000000
25% 50.750000 28.750000 41.500000 34.750000
50% 100.500000 36.000000 61.500000 50.000000
75% 150.250000 49.000000 78.000000 73.000000
max 200.000000 70.000000 137.000000 99.000000
Handling Missing Value
For the first step, we will figure out if there is missing value.
False
We will use another data source with missing values to practice this part.
True
We can handle missing data by following this schema (Sc : DQLab).
Based on the above scheme, data deletion is also a solution if it is felt that filling in blank values will have a bad influence on the analysis, or if the consideration of deleted data or missing data is small and does not contribute too much to the analysis to be carried out. Deletion of data can be directly on the data row or directly one column of data. The second solution is to use imputation (filling in empty data) depending on the problem. Especially for problems related to forecasting or forecasting depending on the existing data (more details can be seen in the picture)
Filling with Mean Values
The mean is used for data that has a few outliers/noise/anomalies in the distribution of the data and its contents. This value will later fill in the empty value of the dataset that has a missing value case. To fill in an empty value use the fillna() function
CustomerID 100.500000
Age 38.939698
Annual Income (k$) 61.005051
Spending Score (1-100) 50.489899
dtype: float64
Filling with Median
The median is used when the data presented has a high outlier. The median was chosen because it is the middle value, which means it is not the result of calculations involving outlier data. In some cases, outlier data is considered disturbing and often considered noisy because it can affect class distribution and interfere with clustering analysis.
CustomerID 100.5
Age 36.0
Annual Income (k$) 62.0
Spending Score (1-100) 50.0
dtype: float64
Dataset with empty values! :
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19.0 15.0 39.0
1 2 Male NaN 15.0 81.0
2 3 Female 20.0 NaN 6.0
3 4 Female 23.0 16.0 77.0
4 5 Female 31.0 17.0 NaN
5 6 Female 22.0 NaN 76.0
6 7 Female 35.0 18.0 6.0
7 8 Female 23.0 18.0 94.0
8 9 Male 64.0 19.0 NaN
9 10 Female 30.0 19.0 72.0
Dataset that has been processed Handling Missing Values with Median :
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19.0 15.0 39.0
1 2 Male 36.0 15.0 81.0
2 3 Female 20.0 61.5 6.0
3 4 Female 23.0 16.0 77.0
4 5 Female 31.0 17.0 50.0
5 6 Female 22.0 61.5 76.0
6 7 Female 35.0 18.0 6.0
7 8 Female 23.0 18.0 94.0
8 9 Male 64.0 19.0 50.0
9 10 Female 30.0 19.0 72.0
Data Normalization with Scikit Learn
Sometimes the data presented has a wide range difference. For example, age is in the range of 25 - 50, while income is in the range of 5000000 - 25000000. The two columns cannot be compared, this can be a big problem in the case of clustering or classification so it is necessary to normalize the data. Normalization is very important, especially for those who use distance calculations using any method (e.g. K - Means). There are various normalization methods, such as MinMax, Z Score, Decimal Scaling, Sigmoid, and Softmax. Its use depends on the needs of the dataset and the type of analysis performed (furthermore we will learn this topic in the future article).
Dataset Before Normalization :
Customer ID Gender Age Income Spending Score
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
5 6 Female 22 17 76
6 7 Female 35 18 6
7 8 Female 23 18 94
8 9 Male 64 19 3
9 10 Female 30 19 72
dataset after normalization :
Age Income Spending Score Customer ID Gender
0 0.019231 0.000000 0.387755 1 Male
1 0.057692 0.000000 0.816327 2 Male
2 0.038462 0.008197 0.051020 3 Female
3 0.096154 0.008197 0.775510 4 Female
4 0.250000 0.016393 0.397959 5 Female
5 0.076923 0.016393 0.765306 6 Female
6 0.326923 0.024590 0.051020 7 Female
7 0.096154 0.024590 0.948980 8 Female
8 0.884615 0.032787 0.020408 9 Male
9 0.230769 0.032787 0.724490 10 Female
We can easily identify the different between data source before normalization.
That's all for today. I will be honor if some of you could give me suggestion to my article or give me an invitation to work together on a data project.
See you😊!!!