Using an odds ratio when data is sparse

  • Thread starter Thread starter snowfox2004
  • Start date Start date
  • Tags Tags
    Data Ratio
AI Thread Summary
Calculating odds ratios from logistic regression can be problematic when working with sparse data, particularly with many exposures influencing an outcome. The discussion highlights concerns about the independence of exposure effects and the complexity of interactions among multiple exposures. While ANOVA can help identify main drivers of outcomes, its effectiveness with sparse data is questioned. The potential for infinite combinations of exposures complicates the analysis, suggesting that careful consideration of interactions is necessary. Ultimately, while the method may work under certain assumptions, the challenges posed by sparsity and interaction effects should not be overlooked.
snowfox2004
Messages
7
Reaction score
0
Suppose I have around 20 exposures that potentially affect an outcome and I want to see which exposures have bigger impacts on the outcome. So I want to calculate each exposures' odds ratios by exponentiating the coefficients obtained from logistic regression. So I have the following input set and output set where 1 means it (exposure or outcome) is present and 0=not present:

4GTNy.png


So, for example, the first row represents a sample where exposure 1 wasn't present, exposure 2 was present,...exposure 20 was present and the outcome was present. I fit a logistic regression model to this data and exponentiate the coefficients to get odds ratios. The potential problem is that I am going to be working with a VERY sparse data set with many samples. There are many instances where almost all exposures except one or maybe two is going to be present in a sample. My question is if this sparsity is something to be concerned about and if this will make my method of comparing exposures using odds ratios a bad idea.

Page 6 of this paper http://www.epidemiology.ch/history/PDF%20bg/Greenland%20S%201987%20interpretation%20and%20choice%20of%20effect%20measures.pdf seems to imply that sparsity won't matter too much but I want to see what the statisticians here say. Any links to papers would be appreciated.
 
Physics news on Phys.org
Your question is the basic question addressed by the statistical subjects of Analysis of Variance (ANOVA) and design of experiments. ANOVA tries to tell which factors are the main drivers of the outcome. Design of experiments tries to tell you what combinations of factors need to be included in a set of experiments to obtain valid statistical results. There are statistical software packages that can help you do the analysis.

One problem that your post does not mention is that the effects of exposures might depend on how they are combined with other exposures. If you are really sure that the effects are independent, the problem is much simpler.
 
I thought that fitting the ENTIRE input set involving all the exposures to a logistic regression model would automatically adjust the odds ratios to account for possible confounding variables by this paper on page 319: http://www.iarc.fr/en/publications/pdfs-online/epi/cancerepi/CancerEpi-14.pdf

Thanks for the input on ANOVA. Does ANOVA work well even with sparse data?
 
I have to admit that I can't make it through the unfamiliar (to me) terminology of your first reference and I couldn't open the link of your second reference. So I am not sure how they address the issue of interacting factors.

If you think about how many possible combinations there are of 20 possible exposure types, I'm sure you will agree that the number of possible combinations is practically infinite. Nothing can solve the problem unless the number of interacting effects is assumed to be very limited. Linear regression will assume that each exposure adds a certain amount, regardless of what other exposures are present. If you accept that, I think your method should work. You can also judiciously introduce additional variables for the combinations which you suspect might affect each other. ANOVA is a general study of the effects of multiple factors, including low order interactions. Experimental design helps you to design experiments that are efficient. Its main concern is to design experiments where you can draw valid conclusions from sparse data. If you already have your data, it may be too late for that.
 
I was reading documentation about the soundness and completeness of logic formal systems. Consider the following $$\vdash_S \phi$$ where ##S## is the proof-system making part the formal system and ##\phi## is a wff (well formed formula) of the formal language. Note the blank on left of the turnstile symbol ##\vdash_S##, as far as I can tell it actually represents the empty set. So what does it mean ? I guess it actually means ##\phi## is a theorem of the formal system, i.e. there is a...
Back
Top