Question on Probability - been way too long since college

AI Thread Summary
The discussion revolves around calculating the probability of obesity in individuals from the highest alcohol consumption group (121+ g/day) based on a data table with variables including age, alcohol, tobacco consumption, and counts of obese cases and controls. The initial approach involved summing the number of obese cases and dividing by the total of obese cases and controls, but this was deemed incorrect. Participants clarified the definition of "control" as subjects who do not exhibit the condition being studied, which in this case refers to non-obese individuals. The confusion stemmed from the need to correctly filter and sum the data specific to the alcohol group rather than the entire dataset. Ultimately, the solution requires a proper understanding of conditional probabilities and accurate data manipulation to derive the correct probability.
ckirmser
Messages
105
Reaction score
3
Summary: Probability of an event based on a data table

Good morning, all -

I'm working on a question involving obesity based on alcohol and tobacco consumption. The question is based on a table with five variables;

• (age) An age group (10-25, 26-50, 51-75, 76+)
• (alc) An alcohol consumption group in g/day (0-40, 41-80, 80-120, 121+)
• (tob) A tobacco consumption group in g/day (0-10, 11-20, 21-30, 31+)
• (num_case) A number of obese cases (X)
• (num_cont) A number of controls (Y)

The question is, "What is the probability that a subject in the highest alcohol consumption group is obese?"

I figured the answer would be to first select only those rows in the table where alc = "121+". Then, from those, sum the num_case entries and divide that by the sum of num_case entries and the sum of the num_cont entries. In pseudocode;

WHERE alc = "121+"
SUM(num_case) / (SUM(num_case) + SUM(num_cont))

Apparently, this is not the answer, but I can't think of what else it might be.

So, I was hoping someone here might be able to clear this mental roadblock for me.

Thanx in advance!
 
Physics news on Phys.org
ckirmser said:
Summary: Probability of an event based on a data table

Good morning, all -

I'm working on a question involving obesity based on alcohol and tobacco consumption. The question is based on a table with five variables;

• (age) An age group (10-25, 26-50, 51-75, 76+)
• (alc) An alcohol consumption group in g/day (0-40, 41-80, 80-120, 121+)
• (tob) A tobacco consumption group in g/day (0-10, 11-20, 21-30, 31+)
• (num_case) A number of obese cases (X)
• (num_cont) A number of controls (Y)

The question is, "What is the probability that a subject in the highest alcohol consumption group is obese?"

I figured the answer would be to first select only those rows in the table where alc = "121+". Then, from those, sum the num_case entries and divide that by the sum of num_case entries and the sum of the num_cont entries. In pseudocode;

WHERE alc = "121+"
SUM(num_case) / (SUM(num_case) + SUM(num_cont))

Apparently, this is not the answer, but I can't think of what else it might be.

So, I was hoping someone here might be able to clear this mental roadblock for me.

Thanx in advance!
What happens with your calculation if there is only one subject in the highest alcolol group and they are obese?
 
I made the presumption - since there was no guidance otherwise and the table data bears this out - that the num_cont value is never 0; there is always at least one control.

So, presumably, your scenario couldn't happen. If there is only one subject in a row, that subject must be a control value because num_cont != 0, not an obese value. But, if it can happen, then my calculation would yield 1 / (1 + 0) = 1; a probability of 1.

Obviously, my calculation is wrong. And, my presumption is probably wrong, even though the table data supports it. Maybe there can be a situation where num_case > 0 and num_cont = 0, but it never happens in the table. I'm sure that there is some formula necessary to determine this, but I've been searching for two days and have yet to stumble upon the proper wording for the search to yield it.

Thanx for your reply, PeroK!
 
ckirmser said:
I made the presumption - since there was no guidance otherwise and the table data bears this out - that the num_cont value is never 0; there is always at least one control.

So, presumably, your scenario couldn't happen. If there is only one subject in a row, that subject must be a control value because num_cont != 0, not an obese value. But, if it can happen, then my calculation would yield 1 / (1 + 0) = 1; a probability of 1.

Obviously, my calculation is wrong. And, my presumption is probably wrong, even though the table data supports it. Maybe there can be a situation where num_case > 0 and num_cont = 0, but it never happens in the table. I'm sure that there is some formula necessary to determine this, but I've been searching for two days and have yet to stumble upon the proper wording for the search to yield it.

Thanx for your reply, PeroK!

Forgive my ignorance, but what's a "control"?
 
A control is something involved in a test that does not have whatever is being tested applied to it.

For example, if one is testing a new drug, controls are those subjects who are not given the drug; they are there to see what happens if nothing is done, against whom the test subjects - the ones receiving the drug - are compared.

I'm not sure how that applies to this table, but because there is the variable num_cont, I figured that represented the control subjects. That, and because it was always non-zero.

But, maybe I'm wrong. Maybe I'm over-thinking this question and the associated data. But, I tried an answer using num_cont as the population. In that case, I used;

WHERE alc = "121+"
SUM(num_case) / SUM(num_cont)

But, that gave me the wrong answer, too. And, because the num_cont value is never zero, it would still result in some real number.

So, I'm still lost. I'm sure the answer is in a Statistics book somewhere, but I sold mine back to the school after my classes were over and that was two decades ago - I've slept since then...
 
ckirmser said:
A control is something involved in a test that does not have whatever is being tested applied to it.

What does that mean in this context?
 
PS my limited understanding of statistical analysis is:

You have a group of people with certain factors: e.g. heavy drinkers. And you have a "control" group who do not have that factor.

You can then calculate the probability that a heavy dinker is obese, say, and the probablity that a non-heavy-drinker is obese and compare the two.

What I understand you are doing is counting the control group (of non drinkers) in with the heavy drinkers?
 
Well, since the question is asking what is the probability of someone in the highest alcohol group being obese, I figure that is the number of obese in the group divided by the population (the obese and not obese) of that group. Like if I had 5 red straws and 10 not red straws, the probability of being a red straw is 5/15.

Are you thinking that the answer is to take those in the designated alcohol group and find their probability against the population of the entire table?

Hmm. Well, I can give it a shot...
 
ckirmser said:
Well, since the question is asking what is the probability of someone in the highest alcohol group being obese, I figure that is the number of obese in the group divided by the population (the obese and not obese) of that group. Like if I had 5 red straws and 10 not red straws, the probability of being a red straw is 5/15.

Are you thinking that the answer is to take those in the designated alcohol group and find their probability against the population of the entire table?

Hmm. Well, I can give it a shot...

Are you saying that "control" in this context means "not obese"?
 
  • #10
That was the guess I took, based on the names of the variables. But, honestly, I don't know for certain what it means and I have no way to ask whoever made the table.

That's maybe why I'm having such a problem; a lack of information on what the table represents. I had figured someone familiar with statistics would recognize what I'm asking at a glance and just rattle off the formula to get the answer. I'm sure it's something simple, I'm just not seeing it.

Given further thought, that must be the case; num_case is a count of those who are obese and meet the other conditions of age, alcohol and tobacco consumption, and num_cont is a count of those who are not obese. So, to get a probability, I have to divide by the sum of both variables, because that's the population.

I'm missing something else somewhere, but don't know what.
 
  • #11
OK, I was right. After beating myself over the head for two days, I discovered that my code was not summing the data properly.

Rather than summing just the filtered data, it was summing the num_case and num_cont for the entire table. But, I think that's an issue with my session of Visual Studio Code, because the syntax is correct. I had to break the thing down into individual steps and manually run it to find the problem.

I'm going to try this out in RStudio and see if that fixes it.

I really appreciate your time, PeroK.
 
  • Like
Likes PeroK
  • #12
PeroK said:
What does that mean in this context?
Usually it is someone who is not given the treatment but who's outcome of interest is observed. There is (always, I think) a form of " blinding" in that either the subject , the experimenter or both does not know ahead of time which subject is being controlled. Maybe the clearest example is someone given a medical treatment to test whether it has some form of effect, say weight loss. Then some subject will not be given the treatment and it will be observed whether they lost weight. This helps determine a psychological/placebo effect , the psychological component of the effect.
 
  • #13
ckirmser, areyou familiar with conditional probabilities? I think you can frame this question in those terms.
 

Similar threads

Replies
1
Views
5K
3
Replies
105
Views
14K
Replies
1
Views
4K
Replies
8
Views
4K
Replies
5
Views
4K
Replies
7
Views
3K
Replies
9
Views
3K
Back
Top