The instability in logistic regression

Published in

Analytics Vidhya

5 min readJul 4, 2021

The easier the problem, the poorer the solution

The predictions from logistic regression seem to be pretty unstable for training sets obtained from the same population!!! To introduce the issue let me first bring in the context where I faced it.

1. Context:

I was performing logistic regression on a dataset — response was credit default status (yes/no) for given credit balance (varying from 0–20000). The training sets were generated (simulated) via a governing logistic function itself. It was noticed that for the same simulating function, the fit scores for the training sets generated in different attempts were varying between 50% and 87%. This was the source of the question — ‘is logistic regression intrinsically unstable’.

2. Aim:

We will try to understand why instability in fit scores for training sets generated from the same simulation function occur. A hint for this was found in ‘Introduction to Statistical Learning — James, Witten, Hastie, Tibshirani’ section 4.4, page 138 -

When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.

We will try to verify and understand the nature of this instability via empirical studies and then see if linear discriminant analysis, as claimed above, solves the problem.

3. Simulation model:

Below is the probability distribution curve used to generate the training set, and the training set itself -

Figure 1: Simulation model and sample training set

The separation of the two response classes (default=yes and default=no) in terms of the feature (credit balance) is displayed below —

Figure 2: Separation of response classes

4. Instability vs class separation:

We prepare three training sets as above, with decreasing class separation. The decreasing class separations are displayed below.

For each of the above three class separation levels -

we generate 20 training sets
apply logistic regression on each and find the fit scores
thus, we have 20 fit scores for each level of class separation. The instability of the fit scores with changing class separation level is what interests us.

Figure 4: Three separation levels with variability of fit scores

The separation values in the above plots is obtained thus -

There is a credit balance, x_high for which the probability of default=yes is 0.9, as per the governing probability distribution function (shown in figure 1).
Similarly, there is an x_low for which the probability of default=yes is 0.1
(x_high−x_low) is the separation value, which is a good metric to represent the separation of the classes.

From the above three plots it becomes clear how the instability of the fit scores decrease as class separation decreases. However, a test across 3 separation values is not robust enough. We need to run the test over many more separation levels to verify if variability of the fit scores indeed comes down with the separation values decreasing. Towards that end, we need one metric to capture the instability of the fit scores, given a separation value. The variance of the 20 fit scores for any given separation value will serve as a measure of the instability. We will study this variance against the separation value.

The data that we thus generate is as follows —

Table 1: Fit scores for given separation

The column headers above represent the separation values while the columns represent the fit scores for given separation values.

When the variance of the fit scores for different separation levels are plotted we get the below figure -

Figure 5: Separation vs fit score variability

The above plot shows that below a separation value of 5000 (the blue vertical line), variance vanishes altogether (note that the x-axis decreases towards right). True, that the fit scores are very poor in the end — 50% (as evident from the right hand columns of the table). But they are ‘consistently’ poor (which is good!!) and that is what we are concerned with right now. Note that, as the separation was lowered, the scale of the x-values remained the same — O(10³). Hence it seems that it is not the scale of feature values, rather the separation that causes the instability in the fit scores.

5. Remedy — linear discriminant analysis:

Linear discriminant analysis was suggested as the solution for this problem in the ‘Aim’ section of this piece. So we will now see whether or not LDA suffers from a similar instability.

When the same data of fit scores variance against separation was generated for linear discriminant analysis, the following plot was obtained —

Figure 6: Separation vs fit score variability for LDA

It can be seen that linear discriminant analysis does not suffer from the variability issue. One thing that needs to be mentioned — the fit scores for LDA were both consistent and good. However, it is the ‘consistency’ we have focused on here. Why logistic regression settled for a poor fit at low separation values while LDA has a better fit — that is a question for some other time.

6. Conclusion:

Hence, as claimed in ISLR, logistic regression indeed becomes unstable for well separated classes.

It’s easy to classify well-separated classes. However, the easier the situation gets, the more erratically logistic regression behaves. Hence the subtitle of this piece. It seems logistic regression does very well in classifying different breeds of dogs but does poorly when classifying dogs and cats.

LDA does not suffer from this problem. Hence, while using logistic regression in such scenarios, we have two approaches at our disposal -

Scale the features — that might go to reduce the separation. (Whether or not it will make the fit better as well, as a general rule, needs further investigation. In this case one can verify that it actually does. Look at my stackoverflow question.)
Or use LDA.

PS:

The code related to this presentation can be found in my github repository. The repository contains other similar questions and answers in the field of statistical learning methodologies.