"Many-to-One" Mapping of Variables in Logistic Regression

In summary, the conversation discusses the possibility of combining three variables (v1, v2, v3) into a single variable (V) to be regressed against a dependent variable (w). The suggestion is to use principal component analysis or factor analysis to identify the most informative features and potentially use whitening to homogenize the data. Perfect separation among the variables may pose an issue. "Learning From Data" is recommended as a resource for further understanding.
  • #1
WWGD
Science Advisor
Gold Member
7,010
10,474
Hi all, I have logistically- regressed 3 different numerical variables ,v1,v2,v3 separately against the same variable w . All variables have the same type of S-curve (meaning, in this case, that probabilities increase as vi ; i=1,2,3 increases ). Is there a way of somehow joining the three variables v1,v2,v3 into a single variable V to be regressed against w? Specifically, w is a numerical measure of control methods in business , while v1 is theft, v2 is inefficiency, v3 is loss(money). Is there a meaningful way to get a new variable V to regress logistically against w?
Thanks.
 
Physics news on Phys.org
  • #2
Principal component analysis?
 
  • #3
micromass said:
Principal component analysis?

But this seems to be going in the opposite direction. I know that each of these variables contributes to the dependent variable, so each one is a factor. I just wonder if there is a method to put together all of these three variables into a single one.
 
  • #4
Principal component analysis does that in the way that it carries over the most information from the three variables. You can also do factor analysis and try to find latent variables, etc. etc.

You can invent other ways like summing them, taking the average, etc. But what you do depends essentially on the question you want to answer and on other specifics.
 
  • #5
Yes, I understand that factor analysis identifies the factors that contribute the most to variability. But, yes, good point., I will think of the weight I should assign to each to make a separate variable, I was just wondering if there was a standard format for this type of situation. Thanks.
 
  • #6
If the variables / features all move together in the same sort of S-curve with respect to your output value (label)... this would seem to suggest you are breaking the independence assumption that is implicitly there amongst your data.

Note: This post assumes you've already done basic data preprocessing so that what you have is zero mean / centered data for each feature. I also assume you have a lot of real world data relative to these 3 features and hence no singular values of 0.

Given the above, it sounds a lot like you may want to consider whitenning the data. (The term comes from signal processing and white noise.)

In a nutshell, whitening your data can be be interpreted as doing SVD and setting all singular values = 1. In a sense PCA looks to differences among singular values for information on what to do and whitening is the other side of the coin that homogenizes the singular values. After whitenning, you'd still have 3 features but (with respected to linear combinations) the data will be uncorrelated.
 
  • Like
Likes WWGD
  • #7
StoneTemplePython said:
If the variables / features all move together in the same sort of S-curve with respect to your output value (label)... this would seem to suggest you are breaking the independence assumption that is implicitly there amongst your data.

Note: This post assumes you've already done basic data preprocessing so that what you have is zero mean / centered data for each feature. I also assume you have a lot of real world data relative to these 3 features and hence no singular values of 0.

Given the above, it sounds a lot like you may want to consider whitenning the data. (The term comes from signal processing and white noise.)

In a nutshell, whitening your data can be be interpreted as doing SVD and setting all singular values = 1. In a sense PCA looks to differences among singular values for information on what to do and whitening is the other side of the coin that homogenizes the singular values. After whitenning, you'd still have 3 features but (with respected to linear combinations) the data will be uncorrelated.

Thank you, do you have any refs for whitenning and SVD? I know a bit about SVD in terms of matrices, but not in the more general sense
 
  • #8
StoneTemplePython said:
If the variables / features all move together in the same sort of S-curve with respect to your output value (label)... this would seem to suggest you are breaking the independence assumption that is implicitly there amongst your data.

Note: This post assumes you've already done basic data preprocessing so that what you have is zero mean / centered data for each feature. I also assume you have a lot of real world data relative to these 3 features and hence no singular values of 0.

Given the above, it sounds a lot like you may want to consider whitenning the data. (The term comes from signal processing and white noise.)

In a nutshell, whitening your data can be be interpreted as doing SVD and setting all singular values = 1. In a sense PCA looks to differences among singular values for information on what to do and whitening is the other side of the coin that homogenizes the singular values. After whitenning, you'd still have 3 features but (with respected to linear combinations) the data will be uncorrelated.

I am also having the issue of perfect separation with each of the three variables. Does that make a difference?
 
  • #9
WWGD said:
Thank you, do you have any refs for whitenning and SVD? I know a bit about SVD in terms of matrices, but not in the more general sense

Off the top of my head I can't think of any free references, though they surely exist.

However, if you live in the US, let me recommend "Learning From Data". The book is available for < $30 in the US. Outside the US pricing is more up in the air -- e.g. somehow it is ~$100 in Canada due to vagaries of licensing. Technically the book covers VC dimension, and basic linear models almost exclusively. However, if you buy the book you get 5 or so free e-chapters which cover more advanced items, including SVD, PCA and whitening. One of the co-authors has a great machine learning mooc by the same name if you're interested in that sort of thing. More information is available here: http://work.caltech.edu/
WWGD said:
I am also having the issue of perfect separation with each of the three variables. Does that make a difference?

I am less sure what this means. Can you expand on what it would look like if the variables had perfect separation?
 
  • #10
StoneTemplePython said:
Off the top of my head I can't think of any free references, though they surely exist.

However, if you live in the US, let me recommend "Learning From Data". The book is available for < $30 in the US. Outside the US pricing is more up in the air -- e.g. somehow it is ~$100 in Canada due to vagaries of licensing. Technically the book covers VC dimension, and basic linear models almost exclusively. However, if you buy the book you get 5 or so free e-chapters which cover more advanced items, including SVD, PCA and whitening. One of the co-authors has a great machine learning mooc by the same name if you're interested in that sort of thing. More information is available here: http://work.caltech.edu/

I am less sure what this means. Can you expand on what it would look like if the variables had perfect separation?

Thanks, re Perfect (EDIT aka Complete) Separation in a standard binary logistic regression: basically, there is an input value that cleanly separates the cases from the non-cases.
An example before the formal definition. Say you want to logistically regress hours studied vs grades and passing or not passing , and you have defined 70 points as passing.
Then perfect separation would mean that there is a value for hours studied that perfectly separates the passing from the not-passing, e.g., 7 hours, so that all of those who studied 7-or-more hours passed, while all of those who studied fewer than 7 hours did not pass. Then 7 hours perfectly separates the cases from the non-cases.

More in general, a data set D:=D(X,Y) for a binary logistic regression of X vs Y, both numerical variables * , where the value y in Y separates the cases from the non-cases, i.e., if y_j>y we consider it a case/yes and if y_j<y , we have a non-case , we say that there is perfect separation if there is a value x in X so that all x_j < x are cases(non-cases) and all x_j>x are non-cases(cases). In my example above, X=Hours Studies, Y= Grades, y=70 and x=7 .

* Y can also be categorical.
 
Last edited:
  • #11
Suppose you do a linear regression of w versus your three variables and come up with w = a0 + a1x1 + a2x2 + a3x3.
Then what is wrong with using a1x1 + a2x2 + a3x3 as your combined variable?
If your objection is that there is no clear physical interpretation of that variable, you will have that issue with any of the statistical methods. Statistics is strictly mathematical and knows nothing about the application subject.
 
  • Like
Likes WWGD
  • #12
FactChecker said:
Suppose you do a linear regression of w versus your three variables and come up with w = a0 + a1x1 + a2x2 + a3x3.
Then what is wrong with using a1x1 + a2x2 + a3x3 as your combined variable?
If your objection is that there is no clear physical interpretation of that variable, you will have that issue with any of the statistical methods. Statistics is strictly mathematical and knows nothing about the application subject.

Thank you, but how would I then go about dealing with the perfect separation issue? If each of X1, X2, X3 have perfect separation at, say, x1, x2, x3, then any linear combination aX1+bY2+cY3 of them will have ax1+bx2+cx3 as a separating value.
 
  • #13
WWGD said:
Thanks, re Perfect (EDIT aka Complete) Separation in a standard binary logistic regression: basically, there is an input value that cleanly separates the cases from the non-cases.
An example before the formal definition. Say you want to logistically regress hours studied vs grades and passing or not passing , and you have defined 70 points as passing.
Then perfect separation would mean that there is a value for hours studied that perfectly separates the passing from the not-passing, e.g., 7 hours, so that all of those who studied 7-or-more hours passed, while all of those who studied fewer than 7 hours did not pass. Then 7 hours perfectly separates the cases from the non-cases.
Yea, that was my worry...

At its simplest form you're talking about a binary classification problem and the fact that it isn't perfectly linearly separable with your current dimensions. Technically it depends on the domain you're in, but in general when dealing with humans, the world is messy and I wouldn't expect clean separation. In fact, if things are completely separable I might get worried.

You can always map your data to a higher dimensional space (either explicitly or with kernels -- or as a limitting case, with radial basis functions) to get complete separation, but this opens Pandora's box with respect to rampant overfitting. There is quite a bit to read and think about here with regularization, soft margin penalties, validation sets, and so on. I am not sure how to thoughtfully address all this in a post.

At times I think it's important to step back and recall that prediction models are supposed to be about minimizing out of sample prediction errors. Given a few power tools, anyone can minimize in sample prediction errors (i.e. drive them to zero, while hurting out of sample prediction quality).

"Learning from Data" will walk you through all of this and more. And of course if you already know a section you can speed read or skip it.
 
  • #14
StoneTemplePython said:
Yea, that was my worry...

At its simplest form you're talking about a binary classification problem and the fact that it isn't perfectly linearly separable with your current dimensions. Technically it depends on the domain you're in, but in general when dealing with humans, the world is messy and I wouldn't expect clean separation. In fact, if things are completely separable I might get worried.

You can always map your data to a higher dimensional space (either explicitly or with kernels -- or as a limitting case, with radial basis functions) to get complete separation, but this opens Pandora's box with respect to rampant overfitting. There is quite a bit to read and think about here with regularization, soft margin penalties, validation sets, and so on. I am not sure how to thoughtfully address all this in a post.

At times I think it's important to step back and recall that prediction models are supposed to be about minimizing out of sample prediction errors. Given a few power tools, anyone can minimize in sample prediction errors (i.e. drive them to zero, while hurting out of sample prediction quality).

"Learning from Data" will walk you through all of this and more. And of course if you already know a section you can speed read or skip it.

What do you think about the idea of introducing an imaginary case to break the separation, e.g., in our case of exams and study, to assume a sample point where someone studied more than 7 hours and did not pass, a case where someone studied less than 7 hours and did not pass , or both? I would only
need to change the output in one of my data pairs by less than 0. 01 ( and all ouput values are larger than 1), i.e., I have a data pair (1,1.2) and if I were to change it into a pair (1,1.19) , it would break the separation. I could justify this by saying that my measurement may have been inaccurate to start with.
 
  • #15
WWGD said:
What do you think about the idea of introducing an imaginary case to break the separation, e.g., in our case of exams and study, to assume a sample point where someone studied more than 7 hours and did not pass, a case where someone studied less than 7 hours and did not pass , or both? I would only
need to change the output in one of my data pairs by less than 0. 01 ( and all ouput values are larger than 1), i.e., I have a data pair (1,1.2) and if I were to change it into a pair (1,1.19) , it would break the separation. I could justify this by saying that my measurement may have been inaccurate to start with.

If your goal is high quality predictions and you don't know that was an error with a high degree of confidence, then at a high level, I would be very uncomfortable with this. (Note: there are times when you know something is wrong -- e.g. looking at car insurance data and one of the drivers is listed at 160 years old -- either should say 16 years old or something else sub 100 with a very high degree of confidence. Choosing between adjusting to 16 or inserting some interpolated value or just tossing out the data point needs some domain knowledge.)

If your goal is more promotional, i.e. you're selling a thesis / story of some kind, then I'd have to say variations of that this kind of thing are done all the time --with a bunch of hard to follow footnotes documenting said 'adjustments'.

Adjusting data values to 'improve' your model feels like data snooping to me. Adjusting data values before you see their impact because they look obviously wrong (e.g. 160 year old drivers) does not.
 
Last edited:

1. What is meant by "many-to-one" mapping in logistic regression?

Many-to-one mapping in logistic regression refers to the process of combining multiple independent variables into a single dummy variable in order to simplify the model and improve interpretability. This is often done when the independent variables are highly correlated with each other and could potentially cause multicollinearity issues.

2. How is the "many-to-one" mapping performed in logistic regression?

The "many-to-one" mapping is typically performed by creating a new dummy variable that represents the combination of the original variables. For example, if we have two independent variables A and B, we can create a new variable AB that takes the value of 1 when both A and B are present, and 0 otherwise.

3. What are the advantages of using "many-to-one" mapping in logistic regression?

One of the main advantages of "many-to-one" mapping is the simplification of the model. By combining multiple variables into one, we can reduce the number of independent variables and potentially improve the interpretability of the model. Additionally, this can help to reduce the risk of multicollinearity, which can lead to unstable and unreliable results.

4. Are there any limitations to using "many-to-one" mapping in logistic regression?

Yes, there are some limitations to using "many-to-one" mapping in logistic regression. One limitation is that it assumes a linear relationship between the independent variables and the outcome variable. If this assumption is violated, the results may be biased. Additionally, the interpretation of coefficients becomes more complex when using "many-to-one" mapping, as the coefficients represent the effect of a combination of variables rather than individual variables.

5. When should "many-to-one" mapping be used in logistic regression?

"Many-to-one" mapping should be used in logistic regression when the independent variables are highly correlated and could potentially cause multicollinearity issues. It can also be useful when there are a large number of independent variables, as it can help to simplify the model and improve interpretability. However, it is important to assess the assumptions and limitations before using "many-to-one" mapping in order to ensure reliable results.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
10
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
4K
  • Introductory Physics Homework Help
Replies
7
Views
7K
  • Linear and Abstract Algebra
Replies
5
Views
3K
  • Calculus and Beyond Homework Help
Replies
4
Views
3K
  • Calculus and Beyond Homework Help
Replies
2
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
4
Views
3K
  • Introductory Physics Homework Help
Replies
4
Views
2K
Replies
10
Views
2K
Replies
1
Views
3K
Back
Top