So, creating dummy variables is one way of transforming variables. Let’s now move on to another technique commonly used for transforming variables — Weight of evidence (WOE) analysis.
So, to summarise, you learnt three important things in this lecture:
Calculating woe values for fine binning and coarse binning
The importance of woe for fine binning and coarse binning
The usage of woe transformation
WOE can be calculated using the following equation:
Or, it can be expressed as:
Once you've calculated woe values, it is also important to note that they should follow an increasing or decreasing trend across bins. If the trend is not monotonic, then you would need to compress the buckets/ bins (coarse buckets) of that variable and then calculate the WOE values again.
As mentioned in the lecture, there are two main advantages of WOE:
WOE helps you in treating missing values logically for both types of variables — categorical and continuous. E.g. in the credit card case, if you replace the continuous variable credit card utilisation with WOE values, you would replace all categories mentioned above (0%-45%, 45% - 60%, etc.) with certain specific values, and that would include the category "missing" as well, which would also be replaced with a WOE value.
Let’s also understand the positives and negatives of woe transformation from Hindol.
So, basically, the pros and cons of a WOE transformation are similar to dummy variables.
This is because when you are using WOE values in your model, you are doing something similar to creating dummy variables — you are replacing a range of values with an indicative variable. It is just that, instead of replacing it with a simple 1 or 0, which was not thought out at all, you are replacing it with a well thought out WOE value. Hence, the chances of undesired score clumping will be a lot less here.
Let's now move on to IV (Information Value), which is a very important concept.
So, information value can be calculated using the following expression:
Or it can expressed as:
It is an important indicator of predictive power.
Mainly, it helps you understand how the binning of variables should be done. The binning should be done such that the WOE trend across bins is monotonic — either increasing all the time or decreasing all the time. But one more thing that needs to be taken care of is that IV (infomation value) should be high.
Comprehension 1: WOE and Information Value Analysis
You are required to download the data set from below for answering the questions that follow:
In the attached file, there are three sheets. The first sheet contains three variables (Tenure, Second Contract and Churn) from the telecom data. The second sheet contains the distribution of the binned tenure variable. The third sheet contains the distribution of goods and bad information of the contract variable.