R2C Exploratory Data Analysis & Churn Study
The main purpose of our study is to conduct EDA and churn analysis on an anonymized sample of Semgrep’s user data (each row in this data set represents a single scan of a project using Semgrep). We want to better understand what metrics impact churn, and also which additional data features we can recommend adding.
Executive Summary
|Number of rows = 100000
|Number of features = 17
===Data Types===
Unnamed: 0 int64
PROJECT_HASH object
EVENT_TIMESTAMP object
DERIVED_PLATFORM object
USER_AGENT object
SEMGREP_VERSION object
N_FINDINGS int64
N_MUTED int64
CONFIG_NAMES_HASH object
RULES_HASH object
ERRORS object
RETURN_CODE float64
SCAN_RUN_TIME float64
TOTAL_BYTES_SCANNED int64
N_RULES int64
N_TARGETS int64
RULES_WITH_FINDINGS object
dtype: object
Nb of churned projects: .....1824
% of churned projects: .......62.9%
df_model
df_model.describe()
Class Imbalance (Count)
0 94408
1 5592
Name: isChurned, dtype: int64
Class Imbalance (Ratio)
0 94.41
1 5.59
Name: isChurned, dtype: float64
== New Class Imbalance (Ratio) in cross_val_df ==
0 66.67
1 33.33
Name: isChurned, dtype: float64
=== XGB classification_report on TEST dataset ===
precision recall f1-score support
0 0.986026 0.948819 0.967065 28409
1 0.453999 0.759899 0.568406 1591
accuracy 0.938800 30000
macro avg 0.720013 0.854359 0.767736 30000
weighted avg 0.957811 0.938800 0.945923 30000