You built your first model in the previous segment. Based on the summary statistics, you inferred that many of the variables might be insignificant and hence, you need to do some feature elimination. Since the number of features is huge, let's first start off with an automated feature selection technique (RFE) and then move to manual feature elimination (using p-values and VIFs) - this is exactly the same process that you did in linear regression.
So let's start off with the automatic feature selection technique - RFE.
Let's summarise the steps you just performed one by one. First, you imported the logistic regression library from sklearn and created a logistic regression object using:
from sklearn.linear_model import LogisticRegression logreg = LogisticRegression()
Then you run an RFE on the dataset using the same command as you did in linear regression. In this case, we choose to select 15 features first (15 is, of course, an arbitrary number).
from sklearn.feature_selection import RFE rfe = RFE(logreg, 15) # running RFE with 15 variables as output rfe = rfe.fit(X_train, y_train)
RFE selected 15 features for you and following is the output you got:
You can see that RFE has eliminated certain features such as 'MonthlyCharges', 'Partner', 'Dependents', etc.
We decided to go ahead with this model but since we are also interested in the statistics, we take the columns selected by RFE and use them to build a model using statsmodels using:
X_train_sm = sm.add_constant(X_train[col]) logm2 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial()) res = logm2.fit()
Here, you use the GLM (Generalized Linear Models) method of the library statsmodels. 'Binomial()' in the 'family' argument tells statsmodels that it needs to fit a logit curve to a binomial data (i.e. in which the target will have just two classes, here 'Churn' and 'Non-Churn').
Now, recall that the logistic regression curve gives you the probabilities of churning and not churning. You can get these probabilities by simply using the 'predict' function as shown in the notebook.
Since the logistic curve gives you just the probabilities and not the actual classification of 'Churn' and 'Non-Churn', you need to find a threshold probability to classify customers as 'churn' and 'non-churn'. Here, we choose 0.5 as an arbitrary cutoff wherein if the probability of a particular customer churning is less than 0.5, you'd classify it as 'Non-Churn' and if it's greater than 0.5, you'd classify it as 'Churn'. The choice of 0.5 is completely arbitrary at this stage and you'll learn how to find the optimal cutoff in 'Model Evaluation', but for now, we'll move forward with 0.5 as the cutoff.