Using an odds ratio when data is sparse

1. Jul 24, 2014

snowfox2004

Suppose I have around 20 exposures that potentially affect an outcome and I want to see which exposures have bigger impacts on the outcome. So I want to calculate each exposures' odds ratios by exponentiating the coefficients obtained from logistic regression. So I have the following input set and output set where 1 means it (exposure or outcome) is present and 0=not present:

So, for example, the first row represents a sample where exposure 1 wasn't present, exposure 2 was present,...exposure 20 was present and the outcome was present. I fit a logistic regression model to this data and exponentiate the coefficients to get odds ratios. The potential problem is that I am going to be working with a VERY sparse data set with many samples. There are many instances where almost all exposures except one or maybe two is going to be present in a sample. My question is if this sparsity is something to be concerned about and if this will make my method of comparing exposures using odds ratios a bad idea.

Page 6 of this paper http://www.epidemiology.ch/history/PDF%20bg/Greenland%20S%201987%20interpretation%20and%20choice%20of%20effect%20measures.pdf seems to imply that sparsity won't matter too much but I want to see what the statisticians here say. Any links to papers would be appreciated.

2. Jul 24, 2014

FactChecker

Your question is the basic question addressed by the statistical subjects of Analysis of Variance (ANOVA) and design of experiments. ANOVA tries to tell which factors are the main drivers of the outcome. Design of experiments tries to tell you what combinations of factors need to be included in a set of experiments to obtain valid statistical results. There are statistical software packages that can help you do the analysis.

One problem that your post does not mention is that the effects of exposures might depend on how they are combined with other exposures. If you are really sure that the effects are independent, the problem is much simpler.

3. Jul 24, 2014

snowfox2004

I thought that fitting the ENTIRE input set involving all the exposures to a logistic regression model would automatically adjust the odds ratios to account for possible confounding variables by this paper on page 319: http://www.iarc.fr/en/publications/pdfs-online/epi/cancerepi/CancerEpi-14.pdf

Thanks for the input on ANOVA. Does ANOVA work well even with sparse data?

4. Jul 24, 2014

FactChecker

I have to admit that I can't make it through the unfamiliar (to me) terminology of your first reference and I couldn't open the link of your second reference. So I am not sure how they address the issue of interacting factors.

If you think about how many possible combinations there are of 20 possible exposure types, I'm sure you will agree that the number of possible combinations is practically infinite. Nothing can solve the problem unless the number of interacting effects is assumed to be very limited. Linear regression will assume that each exposure adds a certain amount, regardless of what other exposures are present. If you accept that, I think your method should work. You can also judiciously introduce additional variables for the combinations which you suspect might affect each other. ANOVA is a general study of the effects of multiple factors, including low order interactions. Experimental design helps you to design experiments that are efficient. Its main concern is to design experiments where you can draw valid conclusions from sparse data. If you already have your data, it may be too late for that.