Using an odds ratio when data is sparse

snowfox2004 · Jul 24, 2014

Suppose I have around 20 exposures that potentially affect an outcome and I want to see which exposures have bigger impacts on the outcome. So I want to calculate each exposures' odds ratios by exponentiating the coefficients obtained from logistic regression. So I have the following input set and output set where 1 means it (exposure or outcome) is present and 0=not present:

So, for example, the first row represents a sample where exposure 1 wasn't present, exposure 2 was present,...exposure 20 was present and the outcome was present. I fit a logistic regression model to this data and exponentiate the coefficients to get odds ratios. The potential problem is that I am going to be working with a VERY sparse data set with many samples. There are many instances where almost all exposures except one or maybe two is going to be present in a sample. My question is if this sparsity is something to be concerned about and if this will make my method of comparing exposures using odds ratios a bad idea.

Page 6 of this paper http://www.epidemiology.ch/history/PDF%20bg/Greenland%20S%201987%20interpretation%20and%20choice%20of%20effect%20measures.pdf seems to imply that sparsity won't matter too much but I want to see what the statisticians here say. Any links to papers would be appreciated.

FactChecker · Jul 24, 2014

Your question is the basic question addressed by the statistical subjects of Analysis of Variance (ANOVA) and design of experiments. ANOVA tries to tell which factors are the main drivers of the outcome. Design of experiments tries to tell you what combinations of factors need to be included in a set of experiments to obtain valid statistical results. There are statistical software packages that can help you do the analysis.

One problem that your post does not mention is that the effects of exposures might depend on how they are combined with other exposures. If you are really sure that the effects are independent, the problem is much simpler.

snowfox2004 · Jul 24, 2014

I thought that fitting the ENTIRE input set involving all the exposures to a logistic regression model would automatically adjust the odds ratios to account for possible confounding variables by this paper on page 319: http://www.iarc.fr/en/publications/pdfs-online/epi/cancerepi/CancerEpi-14.pdf

Thanks for the input on ANOVA. Does ANOVA work well even with sparse data?

FactChecker · Jul 24, 2014

I have to admit that I can't make it through the unfamiliar (to me) terminology of your first reference and I couldn't open the link of your second reference. So I am not sure how they address the issue of interacting factors.

If you think about how many possible combinations there are of 20 possible exposure types, I'm sure you will agree that the number of possible combinations is practically infinite. Nothing can solve the problem unless the number of interacting effects is assumed to be very limited. Linear regression will assume that each exposure adds a certain amount, regardless of what other exposures are present. If you accept that, I think your method should work. You can also judiciously introduce additional variables for the combinations which you suspect might affect each other. ANOVA is a general study of the effects of multiple factors, including low order interactions. Experimental design helps you to design experiments that are efficient. Its main concern is to design experiments where you can draw valid conclusions from sparse data. If you already have your data, it may be too late for that.

blue_raver22 · Jul 31, 2014

I would suggest considering an alternative approach to analyzing your data. While using odds ratios may seem like a straightforward method for comparing exposures, the sparsity of your data may lead to unreliable results. This is because odds ratios are highly sensitive to rare events, and with a small number of samples, the odds ratios may be inflated or deflated.

Instead, I would recommend using a different measure such as risk ratios or risk differences, which are less sensitive to rare events and can provide more stable estimates. Additionally, you may want to consider using a different statistical method such as Bayesian analysis, which can handle sparse data more effectively.

Furthermore, I would also suggest carefully examining your sample size and considering if it is sufficient for the number of exposures you are trying to analyze. If your sample size is too small, it may be difficult to draw reliable conclusions about the impact of each exposure on the outcome.

In conclusion, while sparsity may not completely invalidate your method of comparing exposures using odds ratios, it is something to be cautious of. I would recommend exploring alternative statistical methods and carefully considering the sample size before drawing any conclusions from your data. Lastly, I would suggest consulting with a statistician for further guidance and to ensure that your analysis is appropriate for your specific research question and data.

Using an odds ratio when data is sparse

1. What is an odds ratio?

2. How is an odds ratio calculated?

3. When is an odds ratio used?

4. What is considered a "sparse" data set for using an odds ratio?

5. Are there any limitations to using an odds ratio with sparse data?

Similar threads

Hot Threads

Recent Insights