In the last lecture, you learnt about segmentation, which is primarily done for increasing the predictive power of a model. However, so far you’ve only seen topics such as sample selection and segmentation, which talk about the data that is used in the model building process.
Now, the next steps, as you may recall from the last session, are dummy variable creation, standardising scales of continuous variables, etc. These processes are generally referred to as variable transformation. Can other types of variable transformations be performed before building a logistic regression model?
Let's hear from Hindol about that.
From earlier sessions, you already know that categorical variables have to be transformed into dummies. Also, you were told that numeric variables have to be standardised, so that they all have the same scale. However, you could also convert numeric variables into dummy variables, using the techniques mentioned by Hindol in the video above.
There are some pros and cons of transforming variables to dummies. Creating dummies for categorical variables is very straightforward. You can directly create n-1 new variables from an existing categorical variable if it has n levels. But for continuous variables, you would be required to do some kind of EDA analysis for binning the variables.
The major advantage offered by dummies especially for continuous variables is that they make the model stable. In other words, small variations in the variables would not have a very big impact on a model that was made using dummies, but they would still have a sizeable impact on a model built using continuous variables as is.
On the other side, there are some major disadvantages that exist. E.g. if you change the continuous variable to dummies, all the data will be compressed into very few categories and that might result in data clumping.