Manipulating class weights and decision threshold

Anirban Chakraborty

Published in

Analytics Vidhya

9 min readJun 29, 2021

Comparing two methods to keep your balance in classification

Context:

Handling dataset imbalance in classification problems is a very hot topic in the machine learning (ML) community. Here we refer to the imbalance in the proportion of samples for each response class. While going through ISLR by James, Witten, Hastie, Tibshirani — I came across this point and the approach to handle the adverse effects of such imbalances is by manipulating the threshold probability limit that decides the prediction decision boundary. The documentation of Logistic Regression in sklearn, however, introduced me to the concept of class-weights which seems to serve the same purpose. So from here rises my query -

Does manipulating class weights and threshold limit lead to same result?

We will attempt to move towards a resolution of this query by conducting some simulated empirical studies.

I. Abstract:

An initial note — we will keep our study limited to Logistic Regression so as to focus on the essential nature of the question itself.

There are two types of errors in a classification problem — false negatives and false positives. Manipulating the threshold limit determines the rate of false negativity and false positivity.

We guess, the excessive false negativity rate as compared to the false positivity rate on using 0.5 as the threshold, is due to the imbalance in the dataset — too low y=1 entries as compared to y=0 entries.

It seems logical because y=0 entries have greater influence over the classifier as compared to y=1 entries. As a result error rates in y=0 class is lower compared to y=1 class. First, we will put this assumption to test.

If the assumption is valid, we will check if manipulating class-weights achieves the same end as manipulating threshold limit. Hence there will be two questions to answer —

Does false negativity rate indeed depend on dataset imbalance?
If above answer is yes then — does manipulating threshold limit and class-weights achieve the same end?

II. Preparing the simulated training data

We first decide the features of the training dataset.

Features of training data:

There will be one feature, credit balance — a number between 0–20,000 and a response — the default status. 1 denotes default=yes and 0 denotes default=no. The training set size will be 1000.
We will take a probability distribution P(y=1|x) such that people with higher credit balance have a higher chance of default. This probability distribution will be used to generate the training dataset — (X,Y).
In the training set (entries with y=1)/(number of samples)=proportion of positive class entries.

We use the logistic function, e^(b_0 + b_1.x)/(1 + e^(b_0+b_01.x)), b_0 and b_1 chosen to meet our above requirements, to generate y for any given x. The following is the probability distribution curve obtained —

We also ensure a given proportion of y=1 entries in the training set.

The following is the credit balance distribution for the default=yes and default=no entries in the training set, generated with 10% default=yes entries.

Figure 2: Credit balance distribution for each default status

And the training set thus generated is (top 5 entries) —

This is a classic case for Logistic Regression.

III. Finding the false negativity rate

With the training set as above, applying Logistic Regression, the following are the predictions obtained —

Table 2: Prediction obtained on training set

The associated confusion matrix is below —

Table 3: Confusion matrix with each row normalized

From the confusion matrix, the false negative and true positive rates are 44% and 56% respectively. The false positive rate is 2.22% and overall error rate (not shown here) is 6.4%.

We will now study the false negative, false positive and overall error rates against training dataset imbalance (default=yes proportion in training set). This study will give us a confirmation if false negativity rate indeed depends on dataset imbalance?

IV. Error rates vs imbalance in training dataset

We keep varying the default=yes proportion from 1% to 50% in 100 steps and against each proportion value we plot the false negative, false positive and overall error rates respectively.

The below plot is obtained —

Figure 3: Error rates vs default=yes proportion

As training dataset goes from imbalance (lower 𝑦=1 proportion) to balance, the false negative rate will decrease (green dots). This is logical as model will be more ‘learned’ in the y=1 domain hence will make lower errors in that region. So will the false positive rate increase (red dots) as the model will get comparatively less scope to learn y=0 domain.

However, why is it so that the overall error rate, which denotes the models overall ‘learning-shortfall’ , increases? It could have equally stayed the same or decreased.

To answer this question we will take a closer look at the original and predicted decision boundaries, i.e., x values for which P(y=1|x)=0.5.

IV a. Why does the overall error rate increase with increasing balance?

The original decision boundary is governed by the distribution used to simulate the training dataset. The prediction boundary is obtained from the learned-model.

We take three training sets with 10, 30 and 50 percent default=yes proportion respectively, moving from imbalance to balance and study the original and predicted boundaries.

Below is the plot we obtain —

The three rows of figures are respectively for the 10, 30 and 50 percent default=yes proportion in the data. The left hand plots show the original data for which default=yes (in green). The right hand plots show the original data where default=no (in red). In each figure the back vertical line shows the decision boundary governed by the simulating function; it stays the same throughout. The blue vertical line shows the predicted decision boundary.

First, we go through the 3 left hand plots where only default=yes cases are shown.

At the top the default=yes proportion is 10%.

The cases to the left of the original decision boundary are the ones where probability for default according to governing function is less than 50%; yet a default happens. That’s nature’s way of throwing in unavoidable random errors.
Had the prediction boundary been exactly the same as the original boundary it would have been an exact learning of nature by the model; best possible scenario. Even then, all data points to the left of the prediction boundary will be predicted as default=no, yet originally have default=yes; this is the false negative error. When the prediction boundary is same as original boundary, the only false negative errors are those caused by unavoidable randomness of nature.
However, the prediction boundary lies to the right of the original boundary because the training set has low number of default=yes samples. To motivate this point — had there been no default=yes samples the prediction boundary would have been to the extreme right, identifying all samples as default=no.
This is the source of the higher than usual false negative rate; the prediction boundary being to the right of original boundary due to low proportion of default=yes or positive cases.

Below it in the second row, default=yes proportion is 30%.

As the default=yes sample proportion increases the prediction boundary moves left. To motivate this movement — if all training samples were default=yes. The prediction boundary would have been to the extreme left predicting default=yes for all cases.
As the prediction boundary moves to the left the false negative errors decrease. This would try to bring the overall error down.

At the bottom, where default=yes proportion is 50% -

The prediction boundary moves even more left further decreasing the false negative errors.

Next, we go through the 3 right-hand figures where only default=no cases are shown.

At the top the default=yes proportion is 10%.

The cases to the right of the original decision boundary are the ones where probability for default is more than 50%; yet a default does not happen. Here this is the unavoidable random error.
The prediction boundary is to the right of the original boundary. Cases to the right of the prediction boundary are the ones where default=yes or y=1 is predicted, yet originally default=no or y=0. These are the false positive errors.

Below it in the second row, default=yes proportion is 30%.

As the prediction boundary moves to the left the false positive errors increase. This would try to bring the overall error up.

At the bottom, where default=yes proportion is 50% -

The prediction boundary moves even more left further increasing the false positive errors.

With the prediction boundary moving left, the false negative error goes down. However, the false positive error goes up. But why does the overall error go up?. Here lies a very tricky point.

Look at the left hand plots in figures 4-6. As the blue line moves left the number of cases to its left, which constitute the false negative errors, decrease. However, when the default=yes proportion is low the density of these green points is low (low y=1 samples). Hence even with an appreciable left movement of the blue line, the decrease in the number of points to its left is low.
Look at the right hand figures now. As the blue lines moves left the number of cases to its right increases i.e., false positive error increases. However, the density of the red dots is more (high y=0 samples). Hence the same left movement of the blue line might lead to a relatively larger increase in number of cases to its right.
Hence in increasing default=yes samples from 10 to 30 percent, the increase in false positives normally trumps the decrease in false negatives.
However standing at a more balanced scenario, say default=yes being 30%, the movement of the blue line towards the left might not give stark differences between the decrement of false negatives and increment of false positives. Thus moving from 30 to 50 percent the overall error rate might be relatively more constant.
Check figure 3 now. The overall error rate initially rises then gradually stabilizes beyond 30% positive proportion.

Hence that concludes the first part of our study — does false negativity rate depend on dataset imbalance? The answer is yes, it does.

V. Do threshold limit manipulation and class-weights achieve the same end?

Towards this end, given an imbalanced training set, we first draw an ROC curve which gives the false positive vs true positive rates at various threshold levels. Now different classifiers for different levels of class-weights are fit to the same data and the false-positive vs true-positive rate is plotted on the same figure.

Below is the figure obtained —

Figure 7: ROC (threshold) and class-weights

The blue curve is the ROC curve. The colour-bar denotes the class-weight of positive or default=yes class. The class-weights vary from 5.0 to 1.0 (which is not using class weights at all). The scatter plots all lying on the ROC curve denote that manipulating the class weights is the same as manipulating threshold. The red dot is the performance of the classifier when class-weight=’balanced’ i.e., a class weight of 5.0 to positive class. We see that without any class weights i.e., class-weight=1.0 the true positive rate is poorest.

VI. Conclusion:

So it seems we have two options at our disposal with imbalanced sets.

balance via class-weights
choose a suitable threshold

I have some doubts regarding the first point though. Balancing via class-weights gives extra weightage to the few positive response samples. If those samples are indeed representative of the real world positive response class it’s great. However, if those samples are corrupted, say due to incorrect measurements, then putting extra weightage on them can adversely affect the learning outcome. This challenge possibly does not exist for threshold manipulation though; here it is dependent on the human decision as to how much overall error one is willing to accept to achieve a low false negative rate. True, that this line of reasoning needs further empirical study, based on test set performance, to be established.

PS:

The codes used in to generate the above plots can be found in my github repository.