Delivery App Analysis

This is an analysis of a dataset corresponding to a food delivery app. It gives us data about customers and their purchasing preferences. It also lets us know which advertising campaigns they have accepted. Our goal here will be to optimize these campaigns so that they are well targeted to our customers and have a higher conversion rate.

We can first clean our dataset, look at atypical data, then do an initial correlation analysis to understand what approach we can give to this dataset and be able to implement a customer classification algorithm.

!pip install missingno==0.5.2

import pandas as pd import numpy as np import matplotlib.pyplot as plt import missingno from datetime import datetime, timedelta import seaborn as sns from sklearn.cluster import KMeans

%run pandas-missing-extension.ipynb

df_raw = pd.read_csv("ml_project1_data.csv") df_raw

Let's take a look at how many variables and records make up our dataset.

df_raw.shape

Check for missing data

In this dataset there are different outliers and there is missing data in the Income of the clients, we are going to make some adjustments to clean our dataset as best as possible.

( df_raw .isna() .sum() )

( df_raw .missing .sort_variables_by_missingness() .pipe(missingno.matrix) )

( df_raw .missing .missing_case_summary() .sort_values(by="pct_missing", ascending=False) .reset_index(drop=True) )

Seeing the unique values I can notice there is no Zero Income, there could be cases in which the customer doesn't have incomes and the amount spend in the app comes from a related. Also we can see an atypical amount of 666,666.0 that is not usual because the exact number and how far is from the rest of the data.

df_raw['Income'].drop_duplicates().reset_index(drop=True).sort_values(ascending=True)

df_raw['Income'].plot(kind='hist', bins=100)

df_raw = df_raw[df_raw['Income'] != 666666.0]

df_raw['Income'].fillna(0, inplace=True) df_raw

Here below we can see more variables, now with our df completed, we can notice the min Year_Birth is 1893, but now in 2024 it's imposible to have that year of birth, it'd be 131 years old.

df_raw.describe()

In fact, there are other registers with more than 100 years.

df_raw['Year_Birth'].plot(kind='hist', bins=100)

df_raw = df_raw[2024 - df_raw['Year_Birth'] < 100] df_raw['Year_Birth'].plot(kind='hist', bins=100)

df_raw.describe()

variable_types = pd.DataFrame(df_raw.dtypes).reset_index() variable_types

numerical_var = ( variable_types[ (variable_types[0] == 'int64') | (variable_types[0] == 'float64') ].reset_index(drop=True) ) numerical_var

categorical_var = ( variable_types[ (variable_types[0] != 'int64') & (variable_types[0] != 'float64') ].reset_index(drop=True) ) categorical_var

We can convert the Dt_customer to days being customer, maybe years would be better.

df_raw['Dt_Customer'] = pd.to_datetime(df_raw['Dt_Customer'])

df_raw['years_being_customer'] = (datetime.today() - df_raw['Dt_Customer']).dt.days / 365 df_raw['years_being_customer'].plot(kind='hist', bins=100) plt.show()

Now we can delete the Dt_Customer, and at the time we can proceed deleting the ID that doesn't help to this analysis.

df_raw.drop(['ID', 'Dt_Customer'], axis=1, inplace=True)

Now we will drop any registered customer that its value is 0 for each column Mnt that represents the amount of 'x' product bought, meaning that we will base our analysis just with the customers that buy in the platform so we can have accurate targets based on historical purchases.

product_columns = df_raw.filter(like='Mnt').columns df_raw = df_raw[(df_raw[product_columns] > 0).all(axis=1)] df_raw.describe()

df = df_raw

View correlations

Accepted Campaign #1

corr = df.select_dtypes(include=['int64', 'float64']).corr() corr['AcceptedCmp1'].sort_values(ascending=False)

We can observe the moderated correlation between those who accepted campaign 1 and campaign 5.

contingency_tb = pd.crosstab(df['AcceptedCmp1'], df['AcceptedCmp5']) contingency_tb.rename(columns={0: 'No AcceptedCmp1', 1: 'AcceptedCmp1'}, inplace=True) contingency_tb.index = ['No AcceptedCmp5', 'AcceptedCmp5'] contingency_tb

Seeing the correlational columns of Campaign 1 and 5, We can see the share some correlations, but in case of Camp 1 its only correlation moderate is Camp 5, the other ones are weak correlations. Meanwhile in Camp 5 MntWines has a moderate correlation with Camp 5.

AcceptedCmp1_corr = corr['AcceptedCmp1'].sort_values(ascending=False) AcceptedCmp5_corr = corr['AcceptedCmp5'].sort_values(ascending=False) pd.DataFrame({ 'AcceptedCmp1_corr_var': AcceptedCmp1_corr.index, 'AcceptedCmp1_corr': AcceptedCmp1_corr.values, 'AcceptedCmp5_corr_var': AcceptedCmp5_corr.index, 'AcceptedCmp5_corr': AcceptedCmp5_corr.values, })

Now let's see correlations for the products

product_columns

MntWines_corr = corr['MntWines'].sort_values(ascending=False) MntFruits_corr = corr['MntFruits'].sort_values(ascending=False) MntMeatProducts_corr = corr['MntMeatProducts'].sort_values(ascending=False) MntFishProducts_corr = corr['MntFishProducts'].sort_values(ascending=False) MntSweetProducts_corr = corr['MntSweetProducts'].sort_values(ascending=False) MntGoldProds_corr = corr['MntGoldProds'].sort_values(ascending=False)

pd.DataFrame({ 'MntWines_corr_var': MntWines_corr.index, 'MntWines_corr': MntWines_corr.values, 'MntFruits_corr_var': MntFruits_corr.index, 'MntFruits_corr': MntFruits_corr.values, 'MntMeatProducts_corr_var': MntMeatProducts_corr.index, 'MntMeatProducts_corr': MntMeatProducts_corr.values, 'MntFishProducts_corr_var': MntFishProducts_corr.index, 'MntFishProducts_corr': MntFishProducts_corr.values, 'MntSweetProducts_corr_var': MntSweetProducts_corr.index, 'MntSweetProducts_corr': MntSweetProducts_corr.values, 'MntGoldProds_corr_var': MntGoldProds_corr.index, 'MntGoldProds_corr': MntGoldProds_corr.values })

We can see that there are no strong correlations in this dataset, but there are moderate ones, approaching strong in the case of wines, where there is a correlation of 0.67 with the Income variable, 0.61 with catalog purchases. On the other hand, in meat products we see a correlation of 0.68 with catalog sales, 0.64 with the Income variable, similar to Wines. Between these two products we see a moderate correlation of 0.57, which after seeing the correlations of campaigns 1 and 5 suggests that they are moderately related customers.

sns.scatterplot( data=df, x='Income', y='MntWines', hue='AcceptedCmp1', palette=['tab:blue', 'tab:orange'], legend='full', alpha=0.5 )

sns.scatterplot( data=df, x='Income', y='MntWines', hue='AcceptedCmp5', palette=['tab:blue', 'tab:orange'], legend='full', alpha=0.5 )

sns.scatterplot( data=df, x='Income', y='MntMeatProducts', hue='AcceptedCmp1', palette=['tab:blue', 'tab:orange'], legend='full', alpha=0.5 )

sns.scatterplot( data=df, x='Income', y='MntMeatProducts', hue='AcceptedCmp5', palette=['tab:blue', 'tab:orange'], legend='full', alpha=0.5 )

sns.scatterplot( data=df, x='MntWines', y='MntMeatProducts', legend='full', alpha=0.5 )

Since these campaigns are similar, we can perform a second, more detailed analysis of the Wine and Meat products and see the demographic ranges that comprise them, unify both campaigns in a dataset and extract more specific categories by applying clustering with K-means.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Delivery App Analysis

Check for missing data

View correlations

Accepted Campaign #1

Delivery App Analysis