"Many-to-One" Mapping of Variables in Logistic Regression

WWGD · Jan 8, 2017

Hi all, I have logistically- regressed 3 different numerical variables ,v1,v2,v3 separately against the same variable w . All variables have the same type of S-curve (meaning, in this case, that probabilities increase as vi ; i=1,2,3 increases ). Is there a way of somehow joining the three variables v1,v2,v3 into a single variable V to be regressed against w? Specifically, w is a numerical measure of control methods in business , while v1 is theft, v2 is inefficiency, v3 is loss(money). Is there a meaningful way to get a new variable V to regress logistically against w?
Thanks.

micromass · Jan 8, 2017

Principal component analysis?

WWGD · Jan 8, 2017

micromass said:

Principal component analysis?

But this seems to be going in the opposite direction. I know that each of these variables contributes to the dependent variable, so each one is a factor. I just wonder if there is a method to put together all of these three variables into a single one.

micromass · Jan 8, 2017

Principal component analysis does that in the way that it carries over the most information from the three variables. You can also do factor analysis and try to find latent variables, etc. etc.

You can invent other ways like summing them, taking the average, etc. But what you do depends essentially on the question you want to answer and on other specifics.

WWGD · Jan 8, 2017

Yes, I understand that factor analysis identifies the factors that contribute the most to variability. But, yes, good point., I will think of the weight I should assign to each to make a separate variable, I was just wondering if there was a standard format for this type of situation. Thanks.

StoneTemplePython · Jan 9, 2017

If the variables / features all move together in the same sort of S-curve with respect to your output value (label)... this would seem to suggest you are breaking the independence assumption that is implicitly there amongst your data.

Note: This post assumes you've already done basic data preprocessing so that what you have is zero mean / centered data for each feature. I also assume you have a lot of real world data relative to these 3 features and hence no singular values of 0.

Given the above, it sounds a lot like you may want to consider whitenning the data. (The term comes from signal processing and white noise.)

In a nutshell, whitening your data can be be interpreted as doing SVD and setting all singular values = 1. In a sense PCA looks to differences among singular values for information on what to do and whitening is the other side of the coin that homogenizes the singular values. After whitenning, you'd still have 3 features but (with respected to linear combinations) the data will be uncorrelated.

WWGD · Jan 9, 2017

StoneTemplePython said:

If the variables / features all move together in the same sort of S-curve with respect to your output value (label)... this would seem to suggest you are breaking the independence assumption that is implicitly there amongst your data.

Note: This post assumes you've already done basic data preprocessing so that what you have is zero mean / centered data for each feature. I also assume you have a lot of real world data relative to these 3 features and hence no singular values of 0.

Given the above, it sounds a lot like you may want to consider whitenning the data. (The term comes from signal processing and white noise.)

In a nutshell, whitening your data can be be interpreted as doing SVD and setting all singular values = 1. In a sense PCA looks to differences among singular values for information on what to do and whitening is the other side of the coin that homogenizes the singular values. After whitenning, you'd still have 3 features but (with respected to linear combinations) the data will be uncorrelated.

Thank you, do you have any refs for whitenning and SVD? I know a bit about SVD in terms of matrices, but not in the more general sense

WWGD · Jan 9, 2017

StoneTemplePython said:

If the variables / features all move together in the same sort of S-curve with respect to your output value (label)... this would seem to suggest you are breaking the independence assumption that is implicitly there amongst your data.

Note: This post assumes you've already done basic data preprocessing so that what you have is zero mean / centered data for each feature. I also assume you have a lot of real world data relative to these 3 features and hence no singular values of 0.

Given the above, it sounds a lot like you may want to consider whitenning the data. (The term comes from signal processing and white noise.)

In a nutshell, whitening your data can be be interpreted as doing SVD and setting all singular values = 1. In a sense PCA looks to differences among singular values for information on what to do and whitening is the other side of the coin that homogenizes the singular values. After whitenning, you'd still have 3 features but (with respected to linear combinations) the data will be uncorrelated.

I am also having the issue of perfect separation with each of the three variables. Does that make a difference?

StoneTemplePython · Jan 9, 2017

WWGD said:

Thank you, do you have any refs for whitenning and SVD? I know a bit about SVD in terms of matrices, but not in the more general sense

Off the top of my head I can't think of any free references, though they surely exist.

However, if you live in the US, let me recommend "Learning From Data". The book is available for < $30 in the US. Outside the US pricing is more up in the air -- e.g. somehow it is ~$100 in Canada due to vagaries of licensing. Technically the book covers VC dimension, and basic linear models almost exclusively. However, if you buy the book you get 5 or so free e-chapters which cover more advanced items, including SVD, PCA and whitening. One of the co-authors has a great machine learning mooc by the same name if you're interested in that sort of thing. More information is available here: http://work.caltech.edu/

WWGD said:

I am also having the issue of perfect separation with each of the three variables. Does that make a difference?

I am less sure what this means. Can you expand on what it would look like if the variables had perfect separation?

WWGD · Jan 9, 2017

StoneTemplePython said:

Off the top of my head I can't think of any free references, though they surely exist.

However, if you live in the US, let me recommend "Learning From Data". The book is available for < $30 in the US. Outside the US pricing is more up in the air -- e.g. somehow it is ~$100 in Canada due to vagaries of licensing. Technically the book covers VC dimension, and basic linear models almost exclusively. However, if you buy the book you get 5 or so free e-chapters which cover more advanced items, including SVD, PCA and whitening. One of the co-authors has a great machine learning mooc by the same name if you're interested in that sort of thing. More information is available here: http://work.caltech.edu/

I am less sure what this means. Can you expand on what it would look like if the variables had perfect separation?

Thanks, re Perfect (EDIT aka Complete) Separation in a standard binary logistic regression: basically, there is an input value that cleanly separates the cases from the non-cases.
An example before the formal definition. Say you want to logistically regress hours studied vs grades and passing or not passing , and you have defined 70 points as passing.
Then perfect separation would mean that there is a value for hours studied that perfectly separates the passing from the not-passing, e.g., 7 hours, so that all of those who studied 7-or-more hours passed, while all of those who studied fewer than 7 hours did not pass. Then 7 hours perfectly separates the cases from the non-cases.

More in general, a data set D:=D(X,Y) for a binary logistic regression of X vs Y, both numerical variables * , where the value y in Y separates the cases from the non-cases, i.e., if y_j>y we consider it a case/yes and if y_j<y , we have a non-case , we say that there is perfect separation if there is a value x in X so that all x_j < x are cases(non-cases) and all x_j>x are non-cases(cases). In my example above, X=Hours Studies, Y= Grades, y=70 and x=7 .

* Y can also be categorical.

FactChecker · Jan 9, 2017

Suppose you do a linear regression of w versus your three variables and come up with w = a₀ + a₁x₁ + a₂x₂ + a₃x₃.
Then what is wrong with using a₁x₁ + a₂x₂ + a₃x₃ as your combined variable?
If your objection is that there is no clear physical interpretation of that variable, you will have that issue with any of the statistical methods. Statistics is strictly mathematical and knows nothing about the application subject.

WWGD · Jan 9, 2017

FactChecker said:

Suppose you do a linear regression of w versus your three variables and come up with w = a₀ + a₁x₁ + a₂x₂ + a₃x₃.
Then what is wrong with using a₁x₁ + a₂x₂ + a₃x₃ as your combined variable?
If your objection is that there is no clear physical interpretation of that variable, you will have that issue with any of the statistical methods. Statistics is strictly mathematical and knows nothing about the application subject.

Thank you, but how would I then go about dealing with the perfect separation issue? If each of X1, X2, X3 have perfect separation at, say, x1, x2, x3, then any linear combination aX1+bY2+cY3 of them will have ax1+bx2+cx3 as a separating value.

StoneTemplePython · Jan 9, 2017

WWGD said:

Thanks, re Perfect (EDIT aka Complete) Separation in a standard binary logistic regression: basically, there is an input value that cleanly separates the cases from the non-cases.
An example before the formal definition. Say you want to logistically regress hours studied vs grades and passing or not passing , and you have defined 70 points as passing.
Then perfect separation would mean that there is a value for hours studied that perfectly separates the passing from the not-passing, e.g., 7 hours, so that all of those who studied 7-or-more hours passed, while all of those who studied fewer than 7 hours did not pass. Then 7 hours perfectly separates the cases from the non-cases.

Yea, that was my worry...

At its simplest form you're talking about a binary classification problem and the fact that it isn't perfectly linearly separable with your current dimensions. Technically it depends on the domain you're in, but in general when dealing with humans, the world is messy and I wouldn't expect clean separation. In fact, if things are completely separable I might get worried.

You can always map your data to a higher dimensional space (either explicitly or with kernels -- or as a limitting case, with radial basis functions) to get complete separation, but this opens Pandora's box with respect to rampant overfitting. There is quite a bit to read and think about here with regularization, soft margin penalties, validation sets, and so on. I am not sure how to thoughtfully address all this in a post.

At times I think it's important to step back and recall that prediction models are supposed to be about minimizing out of sample prediction errors. Given a few power tools, anyone can minimize in sample prediction errors (i.e. drive them to zero, while hurting out of sample prediction quality).

"Learning from Data" will walk you through all of this and more. And of course if you already know a section you can speed read or skip it.

WWGD · Jan 9, 2017

StoneTemplePython said:

Yea, that was my worry...

At its simplest form you're talking about a binary classification problem and the fact that it isn't perfectly linearly separable with your current dimensions. Technically it depends on the domain you're in, but in general when dealing with humans, the world is messy and I wouldn't expect clean separation. In fact, if things are completely separable I might get worried.

You can always map your data to a higher dimensional space (either explicitly or with kernels -- or as a limitting case, with radial basis functions) to get complete separation, but this opens Pandora's box with respect to rampant overfitting. There is quite a bit to read and think about here with regularization, soft margin penalties, validation sets, and so on. I am not sure how to thoughtfully address all this in a post.

At times I think it's important to step back and recall that prediction models are supposed to be about minimizing out of sample prediction errors. Given a few power tools, anyone can minimize in sample prediction errors (i.e. drive them to zero, while hurting out of sample prediction quality).

"Learning from Data" will walk you through all of this and more. And of course if you already know a section you can speed read or skip it.

What do you think about the idea of introducing an imaginary case to break the separation, e.g., in our case of exams and study, to assume a sample point where someone studied more than 7 hours and did not pass, a case where someone studied less than 7 hours and did not pass , or both? I would only
need to change the output in one of my data pairs by less than 0. 01 ( and all ouput values are larger than 1), i.e., I have a data pair (1,1.2) and if I were to change it into a pair (1,1.19) , it would break the separation. I could justify this by saying that my measurement may have been inaccurate to start with.

StoneTemplePython · Jan 9, 2017

WWGD said:

What do you think about the idea of introducing an imaginary case to break the separation, e.g., in our case of exams and study, to assume a sample point where someone studied more than 7 hours and did not pass, a case where someone studied less than 7 hours and did not pass , or both? I would only
need to change the output in one of my data pairs by less than 0. 01 ( and all ouput values are larger than 1), i.e., I have a data pair (1,1.2) and if I were to change it into a pair (1,1.19) , it would break the separation. I could justify this by saying that my measurement may have been inaccurate to start with.

If your goal is high quality predictions and you don't know that was an error with a high degree of confidence, then at a high level, I would be very uncomfortable with this. (Note: there are times when you know something is wrong -- e.g. looking at car insurance data and one of the drivers is listed at 160 years old -- either should say 16 years old or something else sub 100 with a very high degree of confidence. Choosing between adjusting to 16 or inserting some interpolated value or just tossing out the data point needs some domain knowledge.)

If your goal is more promotional, i.e. you're selling a thesis / story of some kind, then I'd have to say variations of that this kind of thing are done all the time --with a bunch of hard to follow footnotes documenting said 'adjustments'.

Adjusting data values to 'improve' your model feels like data snooping to me. Adjusting data values before you see their impact because they look obviously wrong (e.g. 160 year old drivers) does not.

"Many-to-One" Mapping of Variables in Logistic Regression

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Graduate Probability puzzle

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Undergrad Understanding permutations and combinations in a coin toss experiment

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect