Categorical variables: How to encode them and why?

A glance into the reality behind

Anirban Chakraborty
Analytics Vidhya

--

Source: Google images

Abstract:

Categorical variables can be encoded either through ordinal (1, 2, 3, …) or one-hot (001, 010, 100, …) encoding schemes. For categorical variables that do not have any relevant ordering present, for e.g., (male, female and other) where none is greater than the other in any sense, using ordinal encoding might lead to issues because, the ordering inherent in (1, 2, 3) might find its way into the calculations. We will try to find out how real of a problem this is, does it have similar effects on both feature and response variables and what is the cause for this. In the process we expect to gain some insight into the reality behind. So fasten your belts!

Questions:

From Google images

We will try to answer two specific questions to begin with -

  1. Suppose a response variable with 3 unordered classes is ordinally encoded (1, 2, 3). Does the model output differ as the class-number mapping is varied?
  2. Does a similar effect arise when categorical feature variables are ordinally encoded?

Question I — the case of response variables:

The classes 1, 2 and 3 from now on will be called Red, Blue and Green respectively; they will also be plotted with the same colours. They are encoded as 1, 2 and 3 respectively.

Let us have a feature variable x distributed uniformly over the range (-1.0, 7.0). For any x there is a probability for each of Red, Blue and Green classes occurring. Below is the probability distribution of the respective classes over the range of x.

class-wise probability distribution over x

We generate (X_train, Y_train) that follows this probability distribution. The training set is as below -

Training set data

We train a linear regression model on this data and predict (X_test, Y_testPredict). The following is the result we get -

Test set predictions

So we see that the distribution is very well captured by the linear regression model. Now keeping X_train, X_test and class-wise probability distributions same, we only change the encoding scheme — Red-2, Blue-1 and Green-3.

With this change introduced the training set (X_train, Y_train) now looks as below -

Training set with modified encoding R-2, B-1, G-3

Applying linear regression, the test set (X_test, Y_testPredict) now looks as below -

Test set with modified encoding R-2, B-1, G-3

Now you see that the model shows very poor, and drastically different results, compared to the previous encoding scheme.

Question 2 — the case for feature variables

Now instead of conducting a similar study for feature variables, we will introduce a very simple observation which will lead us to our valuable insight (which some of you might have already guessed given the above figures).

Suppose we have a categorical feature variable with class-labels Red, Blue and Green respectively. The corresponding Y values are 4, 5 and 6. The class-label to number mapping is — Red-1, Blue-2 and Green-3. The overall data set looks like the following -

A simple data-set

This data when plotted —

Coding scheme R-1, B-2, G-3 (look at the x-axis)

The figure shows that a linear regression would do pretty well in this setting.

With a different coding scheme of Red-2, Blue-1 and Green-3 the data set will be —

Coding scheme R-2, B-1, G-3

And its plot will be like —

Coding scheme R-2, B-1, G-3

A linear case becomes non-linear now and clearly linear regression will not be able to capture this. It is the same with feature and response — changing the coding convention can introduce non-linearities. Check (X_train, Y_train) plot for the R-2, B-1, G-3 coding scheme once more to assure yourself that a similar thing happened for response variable as well.

Conclusion — pièce de résistance

Hence a linear model can become non-linear due to a difference coding scheme when using ordinal encoding.

Now to capture non-linearities we need our model to be more flexible. How is this flexibility achieved?

  1. Either our model itself is non-linear, for e.g., QDA, Polynomial regression etc.
  2. Or we have more number of features in a linear model- this is the same as using one-hot encoding for categorical variables while using Linear Regression (because a non-linearity in lower dimension can become relatively more linear in higher dimensions). **This is what using polynomial features in linear regression does.

Hence, when using linear models, encoding categorical variables ordinally might lead to issues (if you are unlucky enough to choose a coding convention which introduces a bad non-linearity even in a good linear model). Again, introducing one-hot encoding has the potential risk of exploding the number of features and hence leading to overfitting or destroying model interpretability. Thus in linear models there is a trade-off going on between the two; which scheme is better depends on the specific case at hand and can be decided only through experimental runs.

For non-linear models ordinal encoding might pass though, because they have the inherent capability to capture non-linearities. Although, even that should be tested by experimental runs because I don’t think there is a general analytical proof present to cover all the different scenarios.

So that concludes our study with a possibly valuable insight at hand. I hope it was a fruitful journey. Thanks and bye, till next time.

PS: The code for the notebook (containing all the plots and methodology to generate the datasets) can be found in the following github link.

--

--

Anirban Chakraborty
Analytics Vidhya

A science enthusiast and passionate philosopher in pursuit of truth.