Discussion Overview
The discussion revolves around the challenge of combining multiple numerical variables (v1, v2, v3) into a single variable (V) for logistic regression against a dependent variable (w). The context includes considerations of statistical methods such as principal component analysis and factor analysis, as well as issues related to data independence and perfect separation in logistic regression.
Discussion Character
- Exploratory
- Technical explanation
- Debate/contested
- Mathematical reasoning
Main Points Raised
- One participant inquires about methods to combine three variables (theft, inefficiency, loss) into a single variable for logistic regression against a control measure (w).
- Some participants suggest principal component analysis (PCA) as a method to consolidate the variables while retaining information.
- Others propose factor analysis as an alternative to identify latent variables contributing to the dependent variable.
- There are suggestions to create a new variable by summing or averaging the three variables, though the appropriateness of these methods depends on specific research questions.
- Concerns are raised about breaking the independence assumption among the variables if they exhibit similar S-curve behaviors relative to the output.
- One participant discusses the concept of whitening data to address potential issues with correlated features and mentions singular value decomposition (SVD) as a related technique.
- Another participant seeks clarification on the implications of perfect separation in logistic regression, providing an example to illustrate the concept.
Areas of Agreement / Disagreement
Participants express varying opinions on the best method to combine the variables, with no consensus reached on a standard approach. Additionally, there is ongoing discussion regarding the implications of data independence and perfect separation, indicating unresolved questions in these areas.
Contextual Notes
Participants assume that basic data preprocessing has been completed, including centering the data. There are also mentions of potential limitations related to the independence of variables and the nature of perfect separation in logistic regression.