Lack of Fit in Ordinal Regression -- Analysis/Alternatives?

Click For Summary

Discussion Overview

The discussion revolves around issues related to lack of fit in ordinal regression, particularly in the context of logistic regression with continuous covariates. Participants explore interpretations of lack of fit, the nature of the data, and potential alternatives for modeling.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes encountering separation issues with continuous variables in binary logistic regression, leading to a significant lack of fit in ordinal regression.
  • Another participant questions the nature of the covariates and suggests that continuous covariates may complicate hypothesis testing.
  • There is a discussion about the interpretation of the Chi-squared test and its applicability to continuous variables, with some participants expressing uncertainty about its use in this context.
  • A participant seeks clarification on the meaning of lack of fit in logistic regression compared to ordinary least squares, noting the challenges in understanding this concept.
  • Questions arise regarding the specifics of the data being analyzed, including whether the analysis involves simultaneous or individual assessments of multiple variables.
  • One participant expresses confusion about the terms "odds" and "probability," indicating a need for clearer definitions in the context of logistic regression.
  • Another participant discusses their approach to regression analysis, including the use of best subsets analysis and the challenges faced in performing logistic regression in reverse.

Areas of Agreement / Disagreement

Participants express varying levels of understanding regarding the application of Chi-squared tests to continuous variables and the interpretation of lack of fit in logistic regression. Multiple competing views remain on these topics, and the discussion does not reach a consensus.

Contextual Notes

Participants highlight limitations in their understanding of the relationship between control measures and the outcomes of fraud, error, and waste, as well as the complexities of interpreting logistic regression outputs.

WWGD
Science Advisor
Homework Helper
Messages
7,797
Reaction score
13,096
Hi All,
I ran a binary logistic of Y on three different numerical variables A,B,C respectively. I am having an issue of separation of variables with all of them, meaning that there are values Ao,Bo, Co for each of A,B,C (different values for each, of course) so that for ## A>Ao, B>Bo, C>Co ## all the responses are successes (I guess this forces the slope to diverge to minus infinity for the slope of the curve to accommodate the abrupt change of 1 to 0). Then I increased the success levels to three: high, medium and low, to use an ordinal regression . But now I have a significant lack of fit, with p -->0 on the Chi-squared test. How does one interpret lack-of-fit issues with a Logistic Regression? I know that a lack of fit in a simple linear means that data is not linear but what does it mean for a Logistic? Does it mean the (log of) the data is not distributed like an S-curve ExpL/(1+ExpL) (##L=
\beta_0+ \beta_1 x+...##) ? If so, are there any standard , or any, alternatives (e.g for a distribution for the data). Any ideas?
 
Physics news on Phys.org
What are your covariates? What is the nature of the covariates? Are they continuous? categorical?
 
They are all continuous, thanks.
 
WWGD said:
They are all continuous, thanks.
That could be cause for problems in your hypothesis tests then. I don't know which test you used for lack of fit, but usually they don't work for continuous covariates.
 
micromass said:
That could be cause for problems in your hypothesis tests then. I don't know which test you used for lack of fit, but usually they don't work for continuous covariates.
No, I had no problem with the Chi-Squared, which AFAIK does not require discrete/categorical variables. I just got a pretty low p -value.
 
WWGD said:
No, I had no problem with the Chi-Squared, which AFAIK does not require discrete/categorical variables. I just got a pretty low p -value.

I don't understand. What is chi-squared? There are many chi-square tests in regression.
 
micromass said:
I don't understand. What is chi-squared? There are many chi-square tests in regression.
It is, I believe, the standard goodness of fit " [Sum(observed -expected)/observed]^2 I iam not aware of any other Chi-square goodness of fit tests.
 
Are you talking about the Pearson residuals? In either case, that chi-square test in your post doesn't always work for continuous variables.
 
  • Like
Likes   Reactions: WWGD
Thanks, I'll look into it.
 
  • #10
Still, it would be nice if someone knew of a good interpretation for a lack of fit in ordinal logistic, other than
obvious ones on collinearity, etc. Lack of fit for ordinary least squares means a line is not an effective way of describing a dataset, but not so clear for logistic. I have broken down the process of linearity of log(odds) =
##\beta_0 + \beta_1x_1+... ## of how ##\beta_0## shifts the S-curve while ##\beta_1 ## "speeds it up or slows it down", etc. , but I am having trouble finding a clear understanding of the lack of fit.
 
  • #11
Is this a proportional odds model?
 
  • #12
It's easier to analyze real life situations as real life situations rather than mathematical skeletons. What phenomena does the data represent ?
 
  • #13
Stephen Tashi said:
It's easier to analyze real life situations as real life situations rather than mathematical skeletons. What phenomena does the data represent ?
EDIT2 I did a regress of control v compliance/ effectiveness. Specifically, control measures vs the existence of Fraud (F), Error (E) and Waste (W). A linear regression for each separately produces the expected results: increased control leads to a decrease of each of F,E,W. I was trying to do a Logit of Control vs each, to get a measure of proportionality to have some ideas of the odds of a certain level of control leading above or below a cutoff point ( selected as a standard level of 2.5 in a scale of 0 to 5 ) in each of the variables F,E,W. I got a horrible fit for binary regressions with the Chi-Squared and Pearson goodness of fit methods, with a P of 0.00 (Actually, I had a separation of points issue, since, beyond a certain level of control, all responses were successes). I tried using a Likert scale to change the binary into an ordinal logistic, see if I got a better fit , with no success EDIT( and low concordance, so low Kruskal's, etc.).
 
Last edited:
  • #14
WWGD said:
I did a regress of control v compliance/ effectiveness. Specifically, control measures vs the existence of Fraud (F), Error (E) and Waste (W).

An elementary question: Is each sample datum defined by 4-tuple of numbers: ( c,f,e,w) so all four values apply to a single "situation" that provides one sample ?
 
  • #15
Stephen Tashi said:
An elementary question: Is each sample datum defined by 4-tuple of numbers: ( c,f,e,w) so all four values apply to a single "situation" that provides one sample ?
Yes, for a certain fixed level of control we evaluate the associated levels of fraud, error and waste.
 
  • #16
WWGD said:
I was trying to do a Logit of Control vs each, to get a measure of proportionality to have some ideas of the odds of a certain level of control leading above or below a cutoff point ( selected as a standard level of 2.5 in a scale of 0 to 5 ) in each of the variables F,E,W.

That's a hard sentence to parse. For example, "odds of" and "probability of" have different meanings. It's easier for me to think about probability that odds.

I don't understand what "proportionality" means in that context. I think of a "proportion" as a ratio of a part to a whole. So what quantity is the "the part" and what quantity is "the whole"?

When you say "in each of the variables" , are you asking about all of them simultaneously? Or are you analyzing them individually ? For example if the level of control is (say) 8, are you asking something about the probability that a situation where the control is 8 will have less than a level of 2.5 in all three of F,E,W ?
 
  • #17
Stephen Tashi said:
That's a hard sentence to parse. For example, "odds of" and "probability of" have different meanings. It's easier for me to think about probability that odds.

I don't understand what "proportionality" means in that context. I think of a "proportion" as a ratio of a part to a whole. So what quantity is the "the part" and what quantity is "the whole"?

When you say "in each of the variables" , are you asking about all of them simultaneously? Or are you analyzing them individually ? For example if the level of control is (say) 8, are you asking something about the probability that a situation where the control is 8 will have less than a level of 2.5 in all three of F,E,W ?

Hi, sorry for the mess, they were closing the coffee shop and I wrote things in a hurry/
1) I meant probability. I am new to logistic regression. As I understand it (please correct me if I am wrong ) the input is a collection of Bernoulli trials ( or at least their
outcomes) and the outcome is a smooth family of Bernoulli distributions obtained through the use of Max Likelihood Estimators for the collection of outcomes. In other words, our output is a PDF from the family of S -curves with parameters the dependent variables.

2)Re proportionality, I was being loose again. I meant a PDF relates the dependent variable to the independent ones, assigning a probability to input values for each independent variable.

3) Re " In each of the variables" . Both, linearly I regress C against each individually and then against all of them (I ultimately do a "best subsets" analysis. considering all possible combination of regressions, the best one being the one with lowest Mallows' Cp and highest adjusted R^2; in case of tie, select the model with the fewest variables. The 3-variable model was the best). I also regressed each independent variable (i.e., F,E,W) logistically against Control . But I don't know how to do a logistic regression in the opposite sense, i.e., to have a control input and get probabilities for each of the 3 variables.
 
  • #18
WWGD said:
1) I meant probability. I am new to logistic regression. As I understand it (please correct me if I am wrong ) the input is a collection of Bernoulli trials ( or at least their
outcomes) and the outcome is a smooth family of Bernoulli distributions obtained through the use of Max Likelihood Estimators for the collection of outcomes.

Calling the outcome a "smooth family" of distributions is an interesting way to look at it. The outcome gives the parameter p of a Bernoulli distribution as a function of some independent variable x. The fitted equation p = f(x) implicitly defines a smooth family of Bernoulli distributions because for each given x we have a Bernoulli distribution with parameter p(x).
In other words, our output is a PDF from the family of S -curves with parameters the dependent variables.

A "family" of distributions defines more than one PDF. Each member of the family has a PDF.

2)Re proportionality, I was being loose again. I meant a PDF relates the dependent variable to the independent ones, assigning a probability to input values for each independent variable.

We can investigate that concept, but it seems "backwards" to how the usual sort of analysis goes. When I think of "fraud" and "control" (like frequent audits), I think of the a level of "control" causing (or allowing) a level of "fraud". I don't think of "fraud" as being what causes "control" (although I suppose one could look at it that way).

3) Re " In each of the variables" . Both, linearly I regress C against each individually and then against all of them (I ultimately do a "best subsets" analysis.

That would say that C is the dependent variable, so for logistic regression it has 2 possible outcomes , which I'll call "low control" and "high control". Each of the variables F,E,W is regarded as having a continuous range of values. Is that correct ?
 
  • #19
Stephen Tashi said:
Calling the outcome a "smooth family" of distributions is an interesting way to look at it. The outcome gives the parameter p of a Bernoulli distribution as a function of some independent variable x. The fitted equation p = f(x) implicitly defines a smooth family of Bernoulli distributions because for each given x we have a Bernoulli distribution with parameter p(x).

A "family" of distributions defines more than one PDF. Each member of the family has a PDF.
We can investigate that concept, but it seems "backwards" to how the usual sort of analysis goes. When I think of "fraud" and "control" (like frequent audits), I think of the a level of "control" causing (or allowing) a level of "fraud". I don't think of "fraud" as being what causes "control" (although I suppose one could look at it that way).
That would say that C is the dependent variable, so for logistic regression it has 2 possible outcomes , which I'll call "low control" and "high control". Each of the variables F,E,W is regarded as having a continuous range of values. Is that correct ?

Yes, I wrote the variable order backwards. I was hoping to actually do a multi-valued , i.e., control as independent and a triple (F,E,W) of values as functions of
control, assigning a probability triple for fixed values. Obviously, these three, F,E,W depend on C and not vice-versa. Still, going back to the initial question: how do we interpret a lack-of-fit in this case ( or, better, in general)?
 
  • #20
Just a followup: Say we are working on the same situation as above: we have a logistic of control (C) vs each of F,E,W (Fraud, Error, Waste)
Say we also assume each of F,E,W to have the same importance. It seems to make sense to logistically regress (binary) C against the arithmetic average:

D:= (F+E+W)/3 Any caveat to consider? I am trying to consider the cutoff point for this D to be the mean value of the means of each of F,E,W , i.e., this cutoff point determines a case (and anything below etermines a non-case; in the case of an equality we can randomly decide a yes or no.)
Is this a meaningful way and a standard way of doing things?.
 
  • #21
  • Like
Likes   Reactions: WWGD
  • #22
This is the way I visualize the model:

Plot F,E,W on the x,y,z axes. For simplicity, I'll imagine the values scaled so the data points fall inside the unit cube. At each data point (x,y,z) in space there is a value C, which I'll imagine as some sort of "density of matter" or "intensity" of something. You are interested in using the data to estimate the region in 3-D space where this density is "high".

The estimation is done in by fitting a function C(x,yz) to the data which predicts the density at each point in space. Describing the region where C(x,y,z) is "high" is done by assuming picking a function g(x,y,z) that defines a boundary by a rule such as "If g(x,y,z) > 0 then the C is "high". Otherwise C is "low". For example, you might try g(x,y,z) = (x+y+z)/3 - .73, which would separate the the "high" and "low" regions by a plane.

If you want to visualize the plausibility of various models, it would be helpful if you use some 3-D visualization software. There are all sorts of ways that data may fail to fit a particular model. For example, the "high" values of C might occur in isolated blobs that aren't well described by a volume with planar sides.
 
  • Like
Likes   Reactions: WWGD

Similar threads

Replies
3
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 8 ·
Replies
8
Views
3K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 13 ·
Replies
13
Views
5K