So far, you have worked with numerical variables. But many times, you will have non-numeric variables in the data sets. These variables are also known as categorical variables. Obviously, these variables cannot be used directly in the model, as they are non-numeric.
Let’s see how you can deal with these variables in the following video.
When you have a categorical variable with, say, 'n' levels, the idea of dummy variable creation is to build 'n-1' variables, indicating the levels. For a variable, say, 'Relationship' with three levels, namely, 'Single', 'In a relationship', and 'Married', you would create a dummy table like the following:
Relationship Status | Single | In a Relationship | Married |
Single | 1 | 0 | 0 |
In a Relationship | 0 | 1 | 0 |
Married | 0 | 0 | 1 |
As you can clearly see, there is no need to define three different levels. If you drop a level, say, 'Single', you will still be able to explain the three levels.
Let's drop the dummy variable 'Single' from the columns and see what the table looks like:
Relationship Status | In a Relationship | Married |
Single | 0 | 0 |
In a Relationship | 1 | 0 |
Married | 0 | 1 |
If both the dummy variables, i.e., 'In a relationship' and 'Married', are equal to zero, it means that the person is single. If 'In a relationship' is denoted by 1 and 'Married' by 0, it means that the person is in a relationship. Finally, if 'In a relationship' is denoted by 0 and 'Married' by 1, it means that the person is married.
Before you move on to the next segment, there’s one concept that needs to be addressed: the concept of scaling the variables. But now that you have dummy variables in the picture, let’s revisit the different aspects of scaling.
Note that scaling just affects the coefficients and none of the other parameters, such as t-statistic, F-statistic, p-values and R-squared.
Two major methods are employed to scale the variables: standardisation and MinMax scaling. Standardisation brings all the data into a standard normal distribution with mean 0 and standard deviation 1. MinMax scaling, on the other hand, brings all the data in the range of 0-1. The formulae used in the background for each of these methods are as given below:
Additional reading