Data Cleansing in Telco
Case
The management of Telco company wants to reduce the number of churn customers by using machine learning. Therefore, the Data Scientist team was asked to prepare data as well as create the right prediction model to determine whether customers will churn or not.
As a data scientist, you are required to prepare data before modeling.
In this task, you will perform Data Preprocessing (Data Cleansing) last month, namely June 2020. The steps to be taken are :
Data Source and Library
Columns Definition :
Let's Start to Our Task
We will use count( ) to count the number of rows. It's because we have differences of its format so we will make it into one format with astype( ), str.match and regex to match the specified criteria.
We have to make sure there is no duplication of the CustomerID for the effectiveness and accuracy of the analysis. The duplication appears because of either its inserting more than once with the same value in each column or its inserting different data retrieval periods. We will use drop_duplicates( ) to delete it and sort_values( ) to check the latest data collection.
The validity of the customer ID number is very necessary to ensure that the data we take is correct. Based on these results, there are differences in the number of ID numbers from the first data loaded to the final result. The number of rows of data when it was first loaded was 7113 rows and 22 columns with 7017 unique ID numbers. Then after checking the validity of the customer ID, the remaining 6993 rows of data.
Next we will delete rows from undetected data whether it churns or not. It is assumed that the data modeller will only accept data that has a churn flag or not. We will use isnull( ) to detect missing values and dropna( ) to remove missing values.
It is assumed that the data modeller asks for missing values to fill in with the following criteria:
The data modeller's asks for each row that has missing values for the length of the subscription to be filled with 11. Variables that are numeric other than Tenure are filled with the median of each of these variables. Define:
After further analysis, it turns out that there are still missing values from the data that we have validated for the customer ID number. Missing values are listed in the Churn, tenure, MonthlyCharges and TotalCharges columns. After we handle it by deleting rows and filling rows with certain values, it is proven that there are no more missing values in the data, as evidenced by the number of missing values for each variable which is worth 0. Next, we will handle outliers.
A boxplot is a summary of the sample distribution presented graphically that can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of dispersion (diversity). The following is a general view of the boxplot representing outliers.
After we know which variables have outliers, then we will overcome outliers by using the interquartile range (IQR) method.
Before :
After :
From the three boxplots with variables 'tenure', 'MonthlyCharges' & 'TotalCharges' it is clear that there are outliers. This can be identified from the points that are far from the boxplot image. Then if we look at the distribution of the data from the max column, there is also a very high value.
Then the outlier value is handled by changing its value to the maximum & minimum value of the interquartile range (IQR). After handling the outliers, and looking at the distribution of the data, it appears that there are no more outlier values.
Detects whether there are values of non-standard categorical variables. This usually occurs due to data input errors. Differences in terms are one of the factors that often occur, for that we need standardization of the data that has been inputted.
We will use the value_counts( ) function to see the number of unique data per variable.
print("The bold text is",'\033[1m' + 'Python' + '\033[0m') Here, '\033[0m' ends the bold formatting. If it is not added, the next print statement will keep print the bold text.
After we know which variables have non-standard values, then we standardize them with the most patterns, provided that they do not change their meaning. Example: Yes -> Yes Then look again at the unique values of each variable that has been changed. Use the replace() function to standardize the values.
After we standardize the value, and we observe again the shape of the data, it is standardized well for the unique value.
So, that's all for this article. I hope I can write it consistently and make an improvement for the future for my code. Feel free to contact me on e-mail if there is something wrong or collaboration project.
Thanks All!!!
Addition : We can use built-in ANSI escape sequences for making text bold, italic or colored, etc. By using the special ANSI escape sequences, the text can be printed in different formats. The ANSI escape sequence to print bold text is: '\033[1m'. To print the bold text, we use the following statement.
Reference : https://www.delftstack.com/howto/python/python-bold-text/