Question on Probability - been way too long since college

ckirmser · Sep 26, 2019

Summary: Probability of an event based on a data table

Good morning, all -

I'm working on a question involving obesity based on alcohol and tobacco consumption. The question is based on a table with five variables;

• (age) An age group (10-25, 26-50, 51-75, 76+)
• (alc) An alcohol consumption group in g/day (0-40, 41-80, 80-120, 121+)
• (tob) A tobacco consumption group in g/day (0-10, 11-20, 21-30, 31+)
• (num_case) A number of obese cases (X)
• (num_cont) A number of controls (Y)

The question is, "What is the probability that a subject in the highest alcohol consumption group is obese?"

I figured the answer would be to first select only those rows in the table where alc = "121+". Then, from those, sum the num_case entries and divide that by the sum of num_case entries and the sum of the num_cont entries. In pseudocode;

WHERE alc = "121+"
SUM(num_case) / (SUM(num_case) + SUM(num_cont))

Apparently, this is not the answer, but I can't think of what else it might be.

So, I was hoping someone here might be able to clear this mental roadblock for me.

Thanx in advance!

PeroK · Sep 26, 2019

ckirmser said:

Summary: Probability of an event based on a data table

Good morning, all -

I'm working on a question involving obesity based on alcohol and tobacco consumption. The question is based on a table with five variables;

• (age) An age group (10-25, 26-50, 51-75, 76+)
• (alc) An alcohol consumption group in g/day (0-40, 41-80, 80-120, 121+)
• (tob) A tobacco consumption group in g/day (0-10, 11-20, 21-30, 31+)
• (num_case) A number of obese cases (X)
• (num_cont) A number of controls (Y)

The question is, "What is the probability that a subject in the highest alcohol consumption group is obese?"

I figured the answer would be to first select only those rows in the table where alc = "121+". Then, from those, sum the num_case entries and divide that by the sum of num_case entries and the sum of the num_cont entries. In pseudocode;

WHERE alc = "121+"
SUM(num_case) / (SUM(num_case) + SUM(num_cont))

Apparently, this is not the answer, but I can't think of what else it might be.

So, I was hoping someone here might be able to clear this mental roadblock for me.

Thanx in advance!

What happens with your calculation if there is only one subject in the highest alcolol group and they are obese?

ckirmser · Sep 26, 2019

I made the presumption - since there was no guidance otherwise and the table data bears this out - that the num_cont value is never 0; there is always at least one control.

So, presumably, your scenario couldn't happen. If there is only one subject in a row, that subject must be a control value because num_cont != 0, not an obese value. But, if it can happen, then my calculation would yield 1 / (1 + 0) = 1; a probability of 1.

Obviously, my calculation is wrong. And, my presumption is probably wrong, even though the table data supports it. Maybe there can be a situation where num_case > 0 and num_cont = 0, but it never happens in the table. I'm sure that there is some formula necessary to determine this, but I've been searching for two days and have yet to stumble upon the proper wording for the search to yield it.

Thanx for your reply, PeroK!

PeroK · Sep 26, 2019

ckirmser said:

I made the presumption - since there was no guidance otherwise and the table data bears this out - that the num_cont value is never 0; there is always at least one control.

So, presumably, your scenario couldn't happen. If there is only one subject in a row, that subject must be a control value because num_cont != 0, not an obese value. But, if it can happen, then my calculation would yield 1 / (1 + 0) = 1; a probability of 1.

Obviously, my calculation is wrong. And, my presumption is probably wrong, even though the table data supports it. Maybe there can be a situation where num_case > 0 and num_cont = 0, but it never happens in the table. I'm sure that there is some formula necessary to determine this, but I've been searching for two days and have yet to stumble upon the proper wording for the search to yield it.

Thanx for your reply, PeroK!

Forgive my ignorance, but what's a "control"?

ckirmser · Sep 26, 2019

A control is something involved in a test that does not have whatever is being tested applied to it.

For example, if one is testing a new drug, controls are those subjects who are not given the drug; they are there to see what happens if nothing is done, against whom the test subjects - the ones receiving the drug - are compared.

I'm not sure how that applies to this table, but because there is the variable num_cont, I figured that represented the control subjects. That, and because it was always non-zero.

But, maybe I'm wrong. Maybe I'm over-thinking this question and the associated data. But, I tried an answer using num_cont as the population. In that case, I used;

WHERE alc = "121+"
SUM(num_case) / SUM(num_cont)

But, that gave me the wrong answer, too. And, because the num_cont value is never zero, it would still result in some real number.

So, I'm still lost. I'm sure the answer is in a Statistics book somewhere, but I sold mine back to the school after my classes were over and that was two decades ago - I've slept since then...

PeroK · Sep 26, 2019

ckirmser said:

A control is something involved in a test that does not have whatever is being tested applied to it.

What does that mean in this context?

PeroK · Sep 26, 2019

PS my limited understanding of statistical analysis is:

You have a group of people with certain factors: e.g. heavy drinkers. And you have a "control" group who do not have that factor.

You can then calculate the probability that a heavy dinker is obese, say, and the probability that a non-heavy-drinker is obese and compare the two.

What I understand you are doing is counting the control group (of non drinkers) in with the heavy drinkers?

ckirmser · Sep 26, 2019

Well, since the question is asking what is the probability of someone in the highest alcohol group being obese, I figure that is the number of obese in the group divided by the population (the obese and not obese) of that group. Like if I had 5 red straws and 10 not red straws, the probability of being a red straw is 5/15.

Are you thinking that the answer is to take those in the designated alcohol group and find their probability against the population of the entire table?

Hmm. Well, I can give it a shot...

PeroK · Sep 26, 2019

ckirmser said:

Well, since the question is asking what is the probability of someone in the highest alcohol group being obese, I figure that is the number of obese in the group divided by the population (the obese and not obese) of that group. Like if I had 5 red straws and 10 not red straws, the probability of being a red straw is 5/15.

Are you thinking that the answer is to take those in the designated alcohol group and find their probability against the population of the entire table?

Hmm. Well, I can give it a shot...

Are you saying that "control" in this context means "not obese"?

ckirmser · Sep 26, 2019

That was the guess I took, based on the names of the variables. But, honestly, I don't know for certain what it means and I have no way to ask whoever made the table.

That's maybe why I'm having such a problem; a lack of information on what the table represents. I had figured someone familiar with statistics would recognize what I'm asking at a glance and just rattle off the formula to get the answer. I'm sure it's something simple, I'm just not seeing it.

Given further thought, that must be the case; num_case is a count of those who are obese and meet the other conditions of age, alcohol and tobacco consumption, and num_cont is a count of those who are not obese. So, to get a probability, I have to divide by the sum of both variables, because that's the population.

I'm missing something else somewhere, but don't know what.

ckirmser · Sep 26, 2019

OK, I was right. After beating myself over the head for two days, I discovered that my code was not summing the data properly.

Rather than summing just the filtered data, it was summing the num_case and num_cont for the entire table. But, I think that's an issue with my session of Visual Studio Code, because the syntax is correct. I had to break the thing down into individual steps and manually run it to find the problem.

I'm going to try this out in RStudio and see if that fixes it.

I really appreciate your time, PeroK.

WWGD · Sep 26, 2019

PeroK said:

What does that mean in this context?

Usually it is someone who is not given the treatment but who's outcome of interest is observed. There is (always, I think) a form of " blinding" in that either the subject , the experimenter or both does not know ahead of time which subject is being controlled. Maybe the clearest example is someone given a medical treatment to test whether it has some form of effect, say weight loss. Then some subject will not be given the treatment and it will be observed whether they lost weight. This helps determine a psychological/placebo effect , the psychological component of the effect.

WWGD · Sep 26, 2019

ckirmser, areyou familiar with conditional probabilities? I think you can frame this question in those terms.

Question on Probability - been way too long since college

"Critical" Triangle Problem

The optimal way of dividing the bet three ways

Hedging on a weather prediction

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect