Lack of Fit in Ordinal Regression -- Analysis/Alternatives?

In summary, the conversation revolves around a binary logistic regression with three different numerical variables, A, B, and C. The issue of separation of variables arises, leading to a lack of fit in the Chi-squared test. The nature of the covariates is determined to be continuous, causing potential problems in the hypothesis tests. The use of a Likert scale is attempted to improve the fit, but with no success. The conversation also delves into the concept of proportionality and its application in analyzing the data.
  • #1
WWGD
Science Advisor
Gold Member
7,010
10,476
Hi All,
I ran a binary logistic of Y on three different numerical variables A,B,C respectively. I am having an issue of separation of variables with all of them, meaning that there are values Ao,Bo, Co for each of A,B,C (different values for each, of course) so that for ## A>Ao, B>Bo, C>Co ## all the responses are successes (I guess this forces the slope to diverge to minus infinity for the slope of the curve to accommodate the abrupt change of 1 to 0). Then I increased the success levels to three: high, medium and low, to use an ordinal regression . But now I have a significant lack of fit, with p -->0 on the Chi-squared test. How does one interpret lack-of-fit issues with a Logistic Regression? I know that a lack of fit in a simple linear means that data is not linear but what does it mean for a Logistic? Does it mean the (log of) the data is not distributed like an S-curve ExpL/(1+ExpL) (##L=
\beta_0+ \beta_1 x+...##) ? If so, are there any standard , or any, alternatives (e.g for a distribution for the data). Any ideas?
 
Physics news on Phys.org
  • #2
What are your covariates? What is the nature of the covariates? Are they continuous? categorical?
 
  • #3
They are all continuous, thanks.
 
  • #4
WWGD said:
They are all continuous, thanks.
That could be cause for problems in your hypothesis tests then. I don't know which test you used for lack of fit, but usually they don't work for continuous covariates.
 
  • #5
micromass said:
That could be cause for problems in your hypothesis tests then. I don't know which test you used for lack of fit, but usually they don't work for continuous covariates.
No, I had no problem with the Chi-Squared, which AFAIK does not require discrete/categorical variables. I just got a pretty low p -value.
 
  • #6
WWGD said:
No, I had no problem with the Chi-Squared, which AFAIK does not require discrete/categorical variables. I just got a pretty low p -value.

I don't understand. What is chi-squared? There are many chi-square tests in regression.
 
  • #7
micromass said:
I don't understand. What is chi-squared? There are many chi-square tests in regression.
It is, I believe, the standard goodness of fit " [Sum(observed -expected)/observed]^2 I iam not aware of any other Chi-square goodness of fit tests.
 
  • #8
Are you talking about the Pearson residuals? In either case, that chi-square test in your post doesn't always work for continuous variables.
 
  • Like
Likes WWGD
  • #9
Thanks, I'll look into it.
 
  • #10
Still, it would be nice if someone knew of a good interpretation for a lack of fit in ordinal logistic, other than
obvious ones on collinearity, etc. Lack of fit for ordinary least squares means a line is not an effective way of describing a dataset, but not so clear for logistic. I have broken down the process of linearity of log(odds) =
##\beta_0 + \beta_1x_1+... ## of how ##\beta_0## shifts the S-curve while ##\beta_1 ## "speeds it up or slows it down", etc. , but I am having trouble finding a clear understanding of the lack of fit.
 
  • #11
Is this a proportional odds model?
 
  • #12
It's easier to analyze real life situations as real life situations rather than mathematical skeletons. What phenomena does the data represent ?
 
  • #13
Stephen Tashi said:
It's easier to analyze real life situations as real life situations rather than mathematical skeletons. What phenomena does the data represent ?
EDIT2 I did a regress of control v compliance/ effectiveness. Specifically, control measures vs the existence of Fraud (F), Error (E) and Waste (W). A linear regression for each separately produces the expected results: increased control leads to a decrease of each of F,E,W. I was trying to do a Logit of Control vs each, to get a measure of proportionality to have some ideas of the odds of a certain level of control leading above or below a cutoff point ( selected as a standard level of 2.5 in a scale of 0 to 5 ) in each of the variables F,E,W. I got a horrible fit for binary regressions with the Chi-Squared and Pearson goodness of fit methods, with a P of 0.00 (Actually, I had a separation of points issue, since, beyond a certain level of control, all responses were successes). I tried using a Likert scale to change the binary into an ordinal logistic, see if I got a better fit , with no success EDIT( and low concordance, so low Kruskal's, etc.).
 
Last edited:
  • #14
WWGD said:
I did a regress of control v compliance/ effectiveness. Specifically, control measures vs the existence of Fraud (F), Error (E) and Waste (W).

An elementary question: Is each sample datum defined by 4-tuple of numbers: ( c,f,e,w) so all four values apply to a single "situation" that provides one sample ?
 
  • #15
Stephen Tashi said:
An elementary question: Is each sample datum defined by 4-tuple of numbers: ( c,f,e,w) so all four values apply to a single "situation" that provides one sample ?
Yes, for a certain fixed level of control we evaluate the associated levels of fraud, error and waste.
 
  • #16
WWGD said:
I was trying to do a Logit of Control vs each, to get a measure of proportionality to have some ideas of the odds of a certain level of control leading above or below a cutoff point ( selected as a standard level of 2.5 in a scale of 0 to 5 ) in each of the variables F,E,W.

That's a hard sentence to parse. For example, "odds of" and "probability of" have different meanings. It's easier for me to think about probability that odds.

I don't understand what "proportionality" means in that context. I think of a "proportion" as a ratio of a part to a whole. So what quantity is the "the part" and what quantity is "the whole"?

When you say "in each of the variables" , are you asking about all of them simultaneously? Or are you analyzing them individually ? For example if the level of control is (say) 8, are you asking something about the probability that a situation where the control is 8 will have less than a level of 2.5 in all three of F,E,W ?
 
  • #17
Stephen Tashi said:
That's a hard sentence to parse. For example, "odds of" and "probability of" have different meanings. It's easier for me to think about probability that odds.

I don't understand what "proportionality" means in that context. I think of a "proportion" as a ratio of a part to a whole. So what quantity is the "the part" and what quantity is "the whole"?

When you say "in each of the variables" , are you asking about all of them simultaneously? Or are you analyzing them individually ? For example if the level of control is (say) 8, are you asking something about the probability that a situation where the control is 8 will have less than a level of 2.5 in all three of F,E,W ?

Hi, sorry for the mess, they were closing the coffee shop and I wrote things in a hurry/
1) I meant probability. I am new to logistic regression. As I understand it (please correct me if I am wrong ) the input is a collection of Bernoulli trials ( or at least their
outcomes) and the outcome is a smooth family of Bernoulli distributions obtained through the use of Max Likelihood Estimators for the collection of outcomes. In other words, our output is a PDF from the family of S -curves with parameters the dependent variables.

2)Re proportionality, I was being loose again. I meant a PDF relates the dependent variable to the independent ones, assigning a probability to input values for each independent variable.

3) Re " In each of the variables" . Both, linearly I regress C against each individually and then against all of them (I ultimately do a "best subsets" analysis. considering all possible combination of regressions, the best one being the one with lowest Mallows' Cp and highest adjusted R^2; in case of tie, select the model with the fewest variables. The 3-variable model was the best). I also regressed each independent variable (i.e., F,E,W) logistically against Control . But I don't know how to do a logistic regression in the opposite sense, i.e., to have a control input and get probabilities for each of the 3 variables.
 
  • #18
WWGD said:
1) I meant probability. I am new to logistic regression. As I understand it (please correct me if I am wrong ) the input is a collection of Bernoulli trials ( or at least their
outcomes) and the outcome is a smooth family of Bernoulli distributions obtained through the use of Max Likelihood Estimators for the collection of outcomes.

Calling the outcome a "smooth family" of distributions is an interesting way to look at it. The outcome gives the parameter p of a Bernoulli distribution as a function of some independent variable x. The fitted equation p = f(x) implicitly defines a smooth family of Bernoulli distributions because for each given x we have a Bernoulli distribution with parameter p(x).
In other words, our output is a PDF from the family of S -curves with parameters the dependent variables.

A "family" of distributions defines more than one PDF. Each member of the family has a PDF.

2)Re proportionality, I was being loose again. I meant a PDF relates the dependent variable to the independent ones, assigning a probability to input values for each independent variable.

We can investigate that concept, but it seems "backwards" to how the usual sort of analysis goes. When I think of "fraud" and "control" (like frequent audits), I think of the a level of "control" causing (or allowing) a level of "fraud". I don't think of "fraud" as being what causes "control" (although I suppose one could look at it that way).

3) Re " In each of the variables" . Both, linearly I regress C against each individually and then against all of them (I ultimately do a "best subsets" analysis.

That would say that C is the dependent variable, so for logistic regression it has 2 possible outcomes , which I'll call "low control" and "high control". Each of the variables F,E,W is regarded as having a continuous range of values. Is that correct ?
 
  • #19
Stephen Tashi said:
Calling the outcome a "smooth family" of distributions is an interesting way to look at it. The outcome gives the parameter p of a Bernoulli distribution as a function of some independent variable x. The fitted equation p = f(x) implicitly defines a smooth family of Bernoulli distributions because for each given x we have a Bernoulli distribution with parameter p(x).

A "family" of distributions defines more than one PDF. Each member of the family has a PDF.
We can investigate that concept, but it seems "backwards" to how the usual sort of analysis goes. When I think of "fraud" and "control" (like frequent audits), I think of the a level of "control" causing (or allowing) a level of "fraud". I don't think of "fraud" as being what causes "control" (although I suppose one could look at it that way).
That would say that C is the dependent variable, so for logistic regression it has 2 possible outcomes , which I'll call "low control" and "high control". Each of the variables F,E,W is regarded as having a continuous range of values. Is that correct ?

Yes, I wrote the variable order backwards. I was hoping to actually do a multi-valued , i.e., control as independent and a triple (F,E,W) of values as functions of
control, assigning a probability triple for fixed values. Obviously, these three, F,E,W depend on C and not vice-versa. Still, going back to the initial question: how do we interpret a lack-of-fit in this case ( or, better, in general)?
 
  • #20
Just a followup: Say we are working on the same situation as above: we have a logistic of control (C) vs each of F,E,W (Fraud, Error, Waste)
Say we also assume each of F,E,W to have the same importance. It seems to make sense to logistically regress (binary) C against the arithmetic average:

D:= (F+E+W)/3 Any caveat to consider? I am trying to consider the cutoff point for this D to be the mean value of the means of each of F,E,W , i.e., this cutoff point determines a case (and anything below etermines a non-case; in the case of an equality we can randomly decide a yes or no.)
Is this a meaningful way and a standard way of doing things?.
 
  • #22
This is the way I visualize the model:

Plot F,E,W on the x,y,z axes. For simplicity, I'll imagine the values scaled so the data points fall inside the unit cube. At each data point (x,y,z) in space there is a value C, which I'll imagine as some sort of "density of matter" or "intensity" of something. You are interested in using the data to estimate the region in 3-D space where this density is "high".

The estimation is done in by fitting a function C(x,yz) to the data which predicts the density at each point in space. Describing the region where C(x,y,z) is "high" is done by assuming picking a function g(x,y,z) that defines a boundary by a rule such as "If g(x,y,z) > 0 then the C is "high". Otherwise C is "low". For example, you might try g(x,y,z) = (x+y+z)/3 - .73, which would separate the the "high" and "low" regions by a plane.

If you want to visualize the plausibility of various models, it would be helpful if you use some 3-D visualization software. There are all sorts of ways that data may fail to fit a particular model. For example, the "high" values of C might occur in isolated blobs that aren't well described by a volume with planar sides.
 
  • Like
Likes WWGD

1. What is "Lack of Fit" in ordinal regression analysis?

"Lack of Fit" in ordinal regression analysis refers to the situation where the chosen model does not adequately fit the data. This means that the model is not able to accurately predict the relationship between the independent and dependent variables, and there is a significant difference between the actual data and the predicted values.

2. How can I detect "Lack of Fit" in ordinal regression?

"Lack of Fit" can be detected through statistical tests such as the Chi-square test or the Likelihood Ratio test. These tests compare the predicted values from the model to the actual data and determine if there is a significant difference.

3. What are some alternatives to dealing with "Lack of Fit" in ordinal regression?

Some alternatives to dealing with "Lack of Fit" include using a different regression model, such as a polynomial or spline regression, or adding more independent variables to the model. Another option is to use a non-parametric regression method, such as the proportional odds model.

4. Can "Lack of Fit" be prevented in ordinal regression analysis?

While it is not always possible to prevent "Lack of Fit" in ordinal regression, there are some steps that can be taken to minimize its occurrence. These include carefully selecting the appropriate model, ensuring that the independent variables are relevant and properly measured, and testing for assumptions such as linearity and homoscedasticity.

5. How does "Lack of Fit" affect the results of ordinal regression analysis?

If "Lack of Fit" is present in a regression analysis, it can lead to inaccurate and unreliable results. This means that the relationships between the independent and dependent variables may not be accurately captured, and the predictions made by the model may not be valid. It is important to address "Lack of Fit" in order to obtain meaningful and accurate results from ordinal regression analysis.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
361
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
23
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
502
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
983
  • Set Theory, Logic, Probability, Statistics
Replies
7
Views
494
  • Set Theory, Logic, Probability, Statistics
Replies
14
Views
1K
Back
Top