Data Cleansing in Telco
The management of Telco company wants to reduce the number of churn customers by using machine learning. Therefore, the Data Scientist team was asked to prepare data as well as create the right prediction model to determine whether customers will churn or not.
As a data scientist, you are required to prepare data before modeling.
In this task, you will perform Data Preprocessing (Data Cleansing) last month, namely June 2020. The steps to be taken are :
Data Source and Library
UpdatedAt customerID gender SeniorCitizen Partner Dependents tenure \ 0 202006 45759018157 Female 0 Yes No 1.0 1 202006 45557574145 Male 0 No No 34.0 2 202006 45366876421 Male 0 No No 2.0 3 202006 45779536532 Male 0 No No 45.0 4 202006 45923787906 Female 0 No No 2.0 PhoneService MultipleLines InternetService ... DeviceProtection \ 0 No No phone service DSL ... No 1 Yes No DSL ... Yes 2 Yes No DSL ... No 3 No No phone service DSL ... Yes 4 Yes No Fiber optic ... No TechSupport StreamingTV StreamingMovies Contract PaperlessBilling \ 0 No No No Month-to-month Yes 1 No No No One year No 2 No No No Month-to-month Yes 3 Yes No No One year No 4 No No No Month-to-month Yes PaymentMethod MonthlyCharges TotalCharges Churn 0 Electronic check 29.85 29.85 No 1 Mailed check 56.95 1889.50 No 2 Mailed check 53.85 108.15 Yes 3 Bank transfer (automatic) 42.30 1840.75 No 4 Electronic check 70.70 151.65 Yes [5 rows x 22 columns]
Columns Definition :
Let's Start to Our Task
We will use count( ) to count the number of rows. It's because we have differences of its format so we will make it into one format with astype( ), str.match and regex to match the specified criteria.
The Number of the Filtered CustomerID is 7006
We have to make sure there is no duplication of the CustomerID for the effectiveness and accuracy of the analysis. The duplication appears because of either its inserting more than once with the same value in each column or its inserting different data retrieval periods. We will use drop_duplicates( ) to delete it and sort_values( ) to check the latest data collection.
The Number of Customer ID without Duplication (disticnt) 6993
The validity of the customer ID number is very necessary to ensure that the data we take is correct. Based on these results, there are differences in the number of ID numbers from the first data loaded to the final result. The number of rows of data when it was first loaded was 7113 rows and 22 columns with 7017 unique ID numbers. Then after checking the validity of the customer ID, the remaining 6993 rows of data.
Next we will delete rows from undetected data whether it churns or not. It is assumed that the data modeller will only accept data that has a churn flag or not. We will use isnull( ) to detect missing values and dropna( ) to remove missing values.
Total of Missing Values from Churn Columns 43
Total of Rows and Columns after Deleting Missing Values (6950, 22)
It is assumed that the data modeller asks for missing values to fill in with the following criteria:
The data modeller's asks for each row that has missing values for the length of the subscription to be filled with 11. Variables that are numeric other than Tenure are filled with the median of each of these variables. Define:
After further analysis, it turns out that there are still missing values from the data that we have validated for the customer ID number. Missing values are listed in the Churn, tenure, MonthlyCharges and TotalCharges columns. After we handle it by deleting rows and filling rows with certain values, it is proven that there are no more missing values in the data, as evidenced by the number of missing values for each variable which is worth 0. Next, we will handle outliers.
A boxplot is a summary of the sample distribution presented graphically that can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of dispersion (diversity). The following is a general view of the boxplot representing outliers.
Distribution of Data Before the Outliers Being Handled: tenure MonthlyCharges TotalCharges count 6950.000000 6950.000000 6950.000000 mean 32.477266 65.783741 2305.083460 std 25.188910 50.457871 2578.651143 min 0.000000 0.000000 19.000000 25% 9.000000 36.462500 406.975000 50% 29.000000 70.450000 1400.850000 75% 55.000000 89.850000 3799.837500 max 500.000000 2311.000000 80000.000000
After we know which variables have outliers, then we will overcome outliers by using the interquartile range (IQR) method.
From the three boxplots with variables 'tenure', 'MonthlyCharges' & 'TotalCharges' it is clear that there are outliers. This can be identified from the points that are far from the boxplot image. Then if we look at the distribution of the data from the max column, there is also a very high value.
Then the outlier value is handled by changing its value to the maximum & minimum value of the interquartile range (IQR). After handling the outliers, and looking at the distribution of the data, it appears that there are no more outlier values.
Detects whether there are values of non-standard categorical variables. This usually occurs due to data input errors. Differences in terms are one of the factors that often occur, for that we need standardization of the data that has been inputted.
We will use the value_counts( ) function to see the number of unique data per variable.
Unique Values Count Before Standardized Variable gender Male 3499 Female 3431 Wanita 14 Laki-Laki 6 Name: gender, dtype: int64 Unique Values Count Before Standardized Variable SeniorCitizen 0 5822 1 1128 Name: SeniorCitizen, dtype: int64 Unique Values Count Before Standardized Variable Partner No 3591 Yes 3359 Name: Partner, dtype: int64 Unique Values Count Before Standardized Variable Dependents No 4870 Yes 2060 Iya 20 Name: Dependents, dtype: int64 Unique Values Count Before Standardized Variable PhoneService Yes 6281 No 669 Name: PhoneService, dtype: int64 Unique Values Count Before Standardized Variable MultipleLines No 3346 Yes 2935 No phone service 669 Name: MultipleLines, dtype: int64 Unique Values Count Before Standardized Variable InternetService Fiber optic 3057 DSL 2388 No 1505 Name: InternetService, dtype: int64 Unique Values Count Before Standardized Variable OnlineSecurity No 3454 Yes 1991 No internet service 1505 Name: OnlineSecurity, dtype: int64 Unique Values Count Before Standardized Variable OnlineBackup No 3045 Yes 2400 No internet service 1505 Name: OnlineBackup, dtype: int64 Unique Values Count Before Standardized Variable DeviceProtection No 3054 Yes 2391 No internet service 1505 Name: DeviceProtection, dtype: int64 Unique Values Count Before Standardized Variable TechSupport No 3431 Yes 2014 No internet service 1505 Name: TechSupport, dtype: int64 Unique Values Count Before Standardized Variable StreamingTV No 2774 Yes 2671 No internet service 1505 Name: StreamingTV, dtype: int64 Unique Values Count Before Standardized Variable StreamingMovies No 2747 Yes 2698 No internet service 1505 Name: StreamingMovies, dtype: int64 Unique Values Count Before Standardized Variable Contract Month-to-month 3823 Two year 1670 One year 1457 Name: Contract, dtype: int64 Unique Values Count Before Standardized Variable PaperlessBilling Yes 4114 No 2836 Name: PaperlessBilling, dtype: int64 Unique Values Count Before Standardized Variable PaymentMethod Electronic check 2337 Mailed check 1594 Bank transfer (automatic) 1519 Credit card (automatic) 1500 Name: PaymentMethod, dtype: int64 Unique Values Count Before Standardized Variable Churn No 5114 Yes 1827 Churn 9 Name: Churn, dtype: int64
Here, 'ends the bold formatting. If it is not added, the next print statement will keep print the bold text.
After we know which variables have non-standard values, then we standardize them with the most patterns, provided that they do not change their meaning. Example: Yes -> Yes Then look again at the unique values of each variable that has been changed. Use the replace() function to standardize the values.
After we standardize the value, and we observe again the shape of the data, it is standardized well for the unique value.
So, that's all for this article. I hope I can write it consistently and make an improvement for the future for my code. Feel free to contact me on e-mail if there is something wrong or collaboration project.
Addition : We can use built-in ANSI escape sequences for making text bold, italic or colored, etc. By using the special ANSI escape sequences, the text can be printed in different formats. The ANSI escape sequence to print bold text is: . To print the bold text, we use the following statement.
Reference : https://www.delftstack.com/howto/python/python-bold-text/