0
5008804
M
1
5008805
M
2
5008806
M
3
5008808
F
4
5008809
F
0
5001711
0
1
5001711
-1
2
5001711
-2
3
5001711
-3
4
5001712
0
(1048575, 3)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 1048575 non-null int64
1 MONTHS_BALANCE 1048575 non-null int64
2 STATUS 1048575 non-null object
dtypes: int64(2), object(1)
memory usage: 24.0+ MB
(438557, 18)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 438557 entries, 0 to 438556
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 438557 non-null int64
1 CODE_GENDER 438557 non-null object
2 FLAG_OWN_CAR 438557 non-null object
3 FLAG_OWN_REALTY 438557 non-null object
4 CNT_CHILDREN 438557 non-null int64
5 AMT_INCOME_TOTAL 438557 non-null float64
6 NAME_INCOME_TYPE 438557 non-null object
7 NAME_EDUCATION_TYPE 438557 non-null object
8 NAME_FAMILY_STATUS 438557 non-null object
9 NAME_HOUSING_TYPE 438557 non-null object
10 DAYS_BIRTH 438557 non-null int64
11 DAYS_EMPLOYED 438557 non-null int64
12 FLAG_MOBIL 438557 non-null int64
13 FLAG_WORK_PHONE 438557 non-null int64
14 FLAG_PHONE 438557 non-null int64
15 FLAG_EMAIL 438557 non-null int64
16 OCCUPATION_TYPE 304354 non-null object
17 CNT_FAM_MEMBERS 438557 non-null float64
dtypes: float64(2), int64(8), object(8)
memory usage: 60.2+ MB
0
ID
0
1
CODE_GENDER
0
2
FLAG_OWN_CAR
0
3
FLAG_OWN_REALTY
0
4
CNT_CHILDREN
0
5
AMT_INCOME_TOTAL
0
6
NAME_INCOME_TYPE
0
7
NAME_EDUCATION_TYPE
0
8
NAME_FAMILY_STATUS
0
9
NAME_HOUSING_TYPE
0
10
DAYS_BIRTH
0
11
DAYS_EMPLOYED
0
12
FLAG_MOBIL
0
13
FLAG_WORK_PHONE
0
14
FLAG_PHONE
0
15
FLAG_EMAIL
0
16
OCCUPATION_TYPE
134203
17
CNT_FAM_MEMBERS
0
0
5008804
0
1
5008804
-1
2
5008804
-2
3
5008804
-3
4
5008804
-4
5
5008804
-5
6
5008804
-6
7
5008804
-7
8
5008804
-8
9
5008804
-9
0
5008804
0
16
5008805
0
31
5008806
0
61
5008808
0
71
5008810
0
98
5008811
0
154
5008813
0
188
5008815
0
211
5008821
0
234
5008824
0
<ipython-input-12-735bc8430817>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['Edad'] = (data['DAYS_BIRTH'] / 365) * -1
<ipython-input-12-735bc8430817>:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['Empleo'] = data.DAYS_EMPLOYED.apply(tiene_empleo)
0
5008804
0
16
5008805
0
31
5008806
0
61
5008808
0
71
5008810
0
98
5008811
0
154
5008813
0
188
5008815
0
211
5008821
0
234
5008824
0
<ipython-input-13-160f3f9f692b>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data.drop(['DAYS_BIRTH', 'DAYS_EMPLOYED'], axis=1, inplace=True)
float64
int64
<ipython-input-14-7a285713298a>:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['Edad'] = data['Edad'].astype(int)
<ipython-input-15-9f1714e3feac>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data['Target'] = data.STATUS.apply(target)
0
5008804
0
16
5008805
0
31
5008806
0
61
5008808
0
71
5008810
0
98
5008811
0
154
5008813
0
188
5008815
0
211
5008821
0
234
5008824
0
0
5008804
Y
16
5008805
Y
31
5008806
Y
61
5008808
N
71
5008810
N
98
5008811
N
154
5008813
N
188
5008815
Y
211
5008821
Y
234
5008824
Y
5008804
1
1
5008805
1
1
5008806
1
1
5008808
0
1
5008810
0
1
5008811
0
1
5008813
0
1
5008815
1
1
5008821
1
1
5008824
1
1
count
24672
24672
mean
0
1
std
0
0
min
0
0
25%
0
0
50%
0
1
75%
1
1
max
1
1
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
/shared-libs/python3.9/py/lib/python3.9/site-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Variable FLAG_OWN_CAR outliers = 0.00%
Variable FLAG_OWN_REALTY outliers = 0.00%
Variable CNT_CHILDREN outliers = 1.41%
Variable AMT_INCOME_TOTAL outliers = 4.27%
Variable CNT_FAM_MEMBERS outliers = 1.34%
Variable Edad outliers = 0.00%
Variable Empleo outliers = 100.00%
Variable Target outliers = 100.00%
<ipython-input-25-bf4e26441da1>:2: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
data_limpia = data_limpia[data_procesada.CNT_CHILDREN <= 2]
<ipython-input-25-bf4e26441da1>:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
data_limpia = data_limpia[data_procesada.CNT_FAM_MEMBERS <= 4]
(24672, 8)
(23409, 8)
5008806
1
1
5008808
0
1
5008810
0
1
5008811
0
1
5008813
0
1
5008815
1
1
5008821
1
1
5008824
1
1
5008825
1
0
5008827
1
1
Collecting imblearn
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
Downloading imbalanced_learn-0.9.0-py3-none-any.whl (199 kB)
|████████████████████████████████| 199 kB 18.1 MB/s
Requirement already satisfied: joblib>=0.11 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (3.1.0)
Requirement already satisfied: scipy>=1.1.0 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.8.0)
Requirement already satisfied: scikit-learn>=1.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.0.2)
Requirement already satisfied: numpy>=1.14.6 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from imbalanced-learn->imblearn) (1.22.3)
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.9.0 imblearn-0.0
WARNING: You are using pip version 21.2.3; however, version 22.0.4 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
% de aciertos sobre el set de entrenamiento: 0.6411326470679155
% de aciertos sobre el set de evaluación: 0.6436350522899387
% de aciertos sobre el set de entrenamiento: 0.96343009057467
% de aciertos sobre el set de evaluación: 0.9542733501622791
Recall:
% de recall sobre el set de entrenamiento: 0.9488685544701373
% de recall sobre el set de evaluación: 0.939564402134718
Precisión:
% de precisión sobre el set de entrenamiento: 0.9773291727695345
% de precisión sobre el set de evaluación: 0.9680487442413435
F1 Score:
% de f1 score sobre el set de entrenamiento: 0.9628886030680428
% de f1_score sobre el set de evaluación: 0.9535939101156492
% de aciertos sobre el set de entrenamiento: 0.5015920121178398
% de aciertos sobre el set de evaluación: 0.4980887125856473
Recall:
% de recall sobre el set de entrenamiento: 0.6349078768393718
% de recall sobre el set de evaluación: 0.630318765325256
Precisión:
% de precisión sobre el set de entrenamiento: 0.5012446917557476
% de precisión sobre el set de evaluación: 0.4985169974903034
F1 Score:
% de f1 score sobre el set de entrenamiento: 0.5602138512315539
% de f1_score sobre el set de evaluación: 0.55672335817568
% de aciertos sobre el set de evaluación del Árbol de decisión: 0.6436350522899387
% de aciertos sobre el set de evaluación del Random Forest: 0.9542733501622791
% de aciertos sobre el set de evaluación de la Regresión Logistica: 0.4980887125856473
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning:
30 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 450, in fit
trees = Parallel(
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/parallel.py", line 1043, in __call__
if self.dispatch_one_batch(iterator):
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/parallel.py", line 861, in dispatch_one_batch
self._dispatch(tasks)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/parallel.py", line 779, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
result = ImmediateResult(func)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 572, in __init__
self.results = batch()
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/parallel.py", line 262, in __call__
return [func(*args, **kwargs)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/joblib/parallel.py", line 262, in <listcomp>
return [func(*args, **kwargs)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/fixes.py", line 216, in __call__
return self.function(*args, **kwargs)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 185, in _parallel_build_trees
tree.fit(X, y, sample_weight=curr_sample_weight, check_input=False)
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 937, in fit
super().fit(
File "/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 250, in fit
raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1
/shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/model_selection/_search.py:969: UserWarning:
One or more of the test scores are non-finite: [0.94151292 0.86339601 0.91805017 0.89838946 0.86293226 0.64323466
nan 0.76568062 0.66329739 0.92636557 0.84181858 0.9156388
0.87718321 0.93347565 0.85764626 0.92988965 0.811277 0.73866325
nan 0.90481937 0.76849368 0.93492847 0.89090855 0.66329739
0.72741025 0.86188127 0.9266748 0.89829667 0.90182076 0.78413541
0.90092432 0.82330179 0.91495881 0.80630016 0.88506593 0.92639657
0.87600841 0.75551013 0.68308085 0.86491054 0.92451084 0.76342373
0.75931246 0.94352217 0.91464964 0.88228384 0.89208328 0.92642741
0.64416213 nan nan 0.78694838 0.94240934 0.92315077
0.89359794 0.93007523 0.72920314 0.68657391 0.76153829 0.91789551
0.91427871 0.85356557 0.90475758 0.91718443 0.83931474 0.77684001
0.77449071 0.90689052 0.90132609 0.70119613 0.84209724 0.77201757
0.63739221 0.84231339 nan 0.88930104 0.8680948 0.88985749
0.81925244 0.78447524 0.89786391 0.78781394 0.71000667 0.82008723
0.88549881 0.92308888 0.65532137 0.76966831 0.8987914 nan
0.72586538 0.82135438 0.82614607 0.8957 0.85359642 0.93041519
0.87641037 0.80614546 0.75643748 0.7876597 ]
Mejores parametros: {'criterion': 'entropy', 'max_depth': 19, 'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split': 6, 'random_state': 6}
Mejor Score: 0.9435221743552594
0
1.741157627
0.04761143871
1
1.570005083
0.03744789727
2
1.578596067
0.05041297753
3
1.765538073
0.06826824218
4
1.35671196
0.01747660497
5
0.6017141342
0.01189156825
6
0.06520042419
0.002758269377
7
0.929206562
0.039771927
8
0.5702732086
0.01169563289
9
1.616813755
0.07309009584
Exactitud: 0.9509556437071763
% de aciertos sobre el set de entrenamiento: 0.7924510804043402
% de aciertos sobre el set de evaluación: 0.7886043995672557
Recall:
% de recall sobre el set de entrenamiento: 0.8537569026493764
% de recall sobre el set de evaluación: 0.8509298998569385
Precisión:
% de precisión sobre el set de entrenamiento: 0.7595076447535464
% de precisión sobre el set de evaluación: 0.7589638892433329
F1 Score:
% de f1 score sobre el set de entrenamiento: 0.8038791844365251
% de f1_score sobre el set de evaluación: 0.8023200917245565
Collecting xgboost
Downloading xgboost-1.5.2-py3-none-manylinux2014_x86_64.whl (173.6 MB)
|████████████████████████████████| 173.6 MB 43.5 MB/s
Requirement already satisfied: scipy in /shared-libs/python3.9/py/lib/python3.9/site-packages (from xgboost) (1.8.0)
Requirement already satisfied: numpy in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from xgboost) (1.22.3)
Installing collected packages: xgboost
Successfully installed xgboost-1.5.2
WARNING: You are using pip version 21.2.3; however, version 22.0.4 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
/root/venv/lib/python3.9/site-packages/xgboost/compat.py:36: FutureWarning:
pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
/root/venv/lib/python3.9/site-packages/xgboost/sklearn.py:1224: UserWarning:
The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
/root/venv/lib/python3.9/site-packages/xgboost/data.py:262: FutureWarning:
pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
[03:08:06] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
% de aciertos sobre el set de entrenamiento: 0.9264583140127979
% de aciertos sobre el set de evaluación: 0.9191489361702128
Recall:
% de recall sobre el set de entrenamiento: 0.9371944805395706
% de recall sobre el set de evaluación: 0.9297437374028218
Precisión:
% de precisión sobre el set de entrenamiento: 0.9173834039975772
% de precisión sobre el set de evaluación: 0.9107319136934142
F1 Score:
% de f1 score sobre el set de entrenamiento: 0.9271831287686326
% de f1_score sobre el set de evaluación: 0.9201396309752796