# Customer Classification with a Logical Regression Model

## About the Data Set

## Libraries

## Load Data Set

0

385

Male

1

681

Male

2

353

Male

3

895

Male

4

661

Male

## Analysis

### Understanding our Data Set

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User ID 1000 non-null int64
1 Gender 1000 non-null object
2 Age 1000 non-null int64
3 AnnualSalary 1000 non-null int64
4 Purchased 1000 non-null int64
dtypes: int64(4), object(1)
memory usage: 39.2+ KB
```

As we can see the Data Set has all its data complete, so we will not have many problems in our analysis, but something that we must take into account is the Data Type of two specific variables; Gender and Purchased, because as far as we know, these data are categorical. As we move forward in the analysis, we will confirm the nature of these two variables and take action based on that.

count

1000

1000

mean

40.106

72689

std

10.707072681429104

34488.34186685009

min

18

15000

25%

32

46375

50%

40

72000

75%

48

90000

max

63

152500

In these graphs we can see the count of people and their distribution within the Data Set, we can see that there is a slight difference between the number of Men and Women that are registered in the Data Set, but this would not be a problem with our Model so we can continue.

### Exploratory Data Analysis

As we said before, our Data Set has two variables that possibly had their Data Type incorrect and at this point we confirm that. As we can see, for both the variable Gender and Purchased, we only have 2 data types for each of them so we can categorize them as Categorical type variables, this way we will improve the performance of our model and analysis.

## Data Preparation

0

35

20000

1

40

43500

2

49

74000

3

40

107500

4

25

79000

Now it is time to prepare our data, first we will eliminate the User ID variable column because in this case it is not useful in our model and we will also proceed to modify the values of our Gender variable because as we know, our Logistic Regression model does not accept words or letters so we will transform it into an Indicator Variable having now a number 1 marking men and a 0 marking women.

We will create our X and Y variables, but before separating the data into different sets, we must standardize the values of our X variable because remember that the data within this variable had different ranges and this could alter the results of our model. After we have standardized our data, we can proceed with the separation of our data into training and evaluation groups.

## Model

('True',)

106

16

('False',)

22

56

```
------ ROC AUC SCORE ------
91.331
```

Finally, after having created and trained our model, we will make a prediction and to verify the effectiveness of our model, we will use a confusion matrix which shows us that our model has a great performance and we will use a cross validation with a ROC AUC score to evaluate our model. A score of 91% was obtained so we can use our model to try to predict whether or not a person will buy a car.