Does the statistical weight of data depend on the generating process?

PeterDonis · Dec 6, 2019

The specific example I'm going to give is from a discussion I am having elsewhere, but the question itself, as given in the thread title and summary, is a general one.

We have two couples, each of which has seven children that, in order, are six boys and one girl (i.e., the girl is the youngest of the seven). We ask the two couples how they came to have this set of children, and they give the following responses:

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).

Suppose we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2. Given the information above, is the data from couple #2 stronger evidence in favor of such a bias than the (identical) data from couple #1?

BWV · Dec 6, 2019

Two gamblers flip a coin seven times and each gets 6 tails and one head (in that chronological order). One gambler's motivation was to flip a coin seven time while the other's was to flip a coin until heads came up.

Importantly, each gambler flips a separate coin, can you conclude that either coin is biased?

I don't think so, but what you do have is two separate distributions, each with a single sample

PeterDonis · Dec 6, 2019

BWV said:

what you do have is two separate distributions

Yes, but the question is whether the difference in distributions makes any difference if we are trying to determine whether the coin is biased. It seems that you think the answer to that is no. Can you explain why?

Dale · Dec 6, 2019

PeterDonis said:

Summary:: If we have two identical data sets that were generated by different processes, will their statistical weight as evidence for or against a hypothesis be different?

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).

In general yes, these are different experiments and so the same data will constitute different levels of evidence. This is a big problem with science, particularly fields like psychology. They run an experiment like case 2 but analyze the data like case 1. The resulting p values are not correct as you can verify with a Monte Carlo simulation.

Here is a paper that describes the issue in detail. The author is a strong proponent of Bayesian methods in order to avoid problems like this. With Bayesian methods the intended experiment doesn’t matter, only the data.

https://www.ncbi.nlm.nih.gov/m/pubmed/22774788/

(The below is not exactly the reference I had in mind, but the idea is the same and it is not paywalled)

https://bookdown.org/ajkurz/DBDA_recoded/null-hypothesis-significance-testing.html

fresh_42 · Dec 6, 2019

Dale said:

With Bayesian methods the intended experiment doesn’t matter, only the data.

However, isn't it as stupid`? Just on the other end of the scale? E.g. in the example with the family, we have two different observables, hence comparison of data doesn't mean anything. IMO the entire problem is a problem of how an experiment is modeled, rather than a mathematical or even scientific one.

Two experiments with different setups, ergo two distributions. Interesting would be the case of two different experiments with the same observable (= random variable). But this would imply the same constraints and objective functions. E.g. we could measure the frequency of a pendulum with two different methods, but this wouldn't effect the result, only the data. But if we measured the same pendulum at two different locations (heights), we cannot speak of the same observable anymore, regardless whether the data match or not.

PeroK · Dec 6, 2019

PeterDonis said:

Summary:: If we have two identical data sets that were generated by different processes, will their statistical weight as evidence for or against a hypothesis be different?

The specific example I'm going to give is from a discussion I am having elsewhere, but the question itself, as given in the thread title and summary, is a general one.

We have two couples, each of which has seven children that, in order, are six boys and one girl (i.e., the girl is the youngest of the seven). We ask the two couples how they came to have this set of children, and they give the following responses:

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).

Suppose we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2. Given the information above, is the data from couple #2 stronger evidence in favor of such a bias than the (identical) data from couple #1?

Being a frequentist, I would analyse it like this.

For #1, we have to imagine a large number of families who decided in advance to have seven children. A bias towards boys would result in more boys in general. We can't have a bias towards families based on the order the children are born. So, what we have found is a family that has 6 or more boys.

The question in this case is: how many families would have 6 or 7 boys given the hypothesis that boys and girls are equally likely?

We would expect ##\frac{8}{128} = \frac{1}{16}## to be in this category.

That's the likelihood in this case, given the hypothesis.

For #2, we have found a family that took 7 or more children to produce a girl.

The probability of this is ##(\frac{1}{2})^7 + (\frac{1}{2})^8 \dots = \frac{1}{64}##

The likelihood is less in this case, given the hypothesis.

PS Just for the record, I analysed this problem with absolutely no a priori assumptions about the conclusion I would come to!

PeterDonis · Dec 6, 2019

Dale said:

With Bayesian methods the intended experiment doesn’t matter, only the data.

But if this is true, it would seem like Bayesian methods would say that both sets of data have the same statistical weight for estimating the probability p of having a boy. If that is not the case (and the frequentist analysis, as @PeroK showed for this case, says it isn't), how would Bayesian methods show that?

Dale · Dec 6, 2019

fresh_42 said:

However, isn't it as stupid`? Just on the other end of the scale? E.g. in the example with the family, we have two different observables, hence comparison of data doesn't mean anything. IMO the entire problem is a problem of how an experiment is modeled, rather than a mathematical or even scientific one.

Not really. It is a fundamentally different approach. In the frequentist approach the hypothesis (usually p=0.5) is taken to be certain and the data is considered to be a random variable from some sample space. That is the issue, the two sample spaces are different. For the Bayesian approach the data is considered certain and the hypothesis is a random variable. You can certainly make different hypotheses for the two couples, but if you test the same hypothesis and prior with both couples then you will get the same posterior.

PeterDonis · Dec 6, 2019

fresh_42 said:

in the example with the family, we have two different observables, hence comparison of data doesn't mean anything

We're not comparing the data from the two couples with each other; we're trying to use the data to estimate p, the probability of having a boy. The question is whether, given that the data are identical, the process used to generate the data makes a difference in the estimate for p that we come up with (or the strength with which we can accept or reject particular hypotheses about p, such as the hypothesis that p = 1/2).

Dale · Dec 6, 2019

PeterDonis said:

But if this is true, it would seem like Bayesian methods would say that both sets of data have the same statistical weight for estimating the probability p of having a boy. If that is not the case (and the frequentist analysis, as @PeroK showed for this case, says it isn't), how would Bayesian methods show that?

@PeroK is computing a different probability. He is computing ##p(X|\lambda=0.5)## where ##X## is the observed data and ##\lambda## is the probability of having a boy. That is a completely different quantity from ##p(\lambda|X)##, which is what Bayesian methods calculate.

Notice that what you are interested in is ##\lambda## and that your natural inclination was to treat the calculated probabilities ad probabilities on ##\lambda## instead of what they are, probabilities on ##X##.

PeroK · Dec 6, 2019

Dale said:

@PeroK is computing a different probability. He is computing ##p(X|\lambda=0.5)## where ##X## is the observed data and ##\lambda## is the probability of having a boy. That is a completely different quantity from ##p(\lambda|X)##, which is what Bayesian methods calculate.

Notice that what you are interested in is ##\lambda## and that your natural inclination was to treat the calculated probabilities ad probabilities on ##\lambda## instead of what they are, probabilities on ##X##.

What I was doing is the usual hypothesis testing. The hypothesis is that ##p = 0.5## and testing the likelihood of the data against that.

I'm not sure it makes much sense to test various values of ##p## against the data. Not in this context.

PeterDonis · Dec 6, 2019

PeroK said:

I'm not sure it makes much sense to test various values of ##p## against the data.

Sure it does. A different value of ##p## just changes the probabilities of individual outcomes in the sample space; you can still calculate p-values the same way you did.

Dale · Dec 6, 2019

PeroK said:

What I was doing is the usual hypothesis testing. The hypothesis is that ##p = 0.5## and testing the likelihood of the data against that.

Yes, that is correct.

PeroK said:

I'm not sure it makes much sense to test various values of p against the data. Not in this context.

Why not? It is pretty natural to wonder what ##\lambda## is, and the data provides information about that.

PeterDonis · Dec 6, 2019

Dale said:

He is computing ##p(X|\lambda=0.5)## where ##X## is the observed data and ##\lambda## is the probability of having a boy. That is a completely different quantity from ##p(\lambda|X)##, which is what Bayesian methods calculate.

Yes, I understand that. The question is, which of these quantities is the right one to answer the question I posed in the OP?

fresh_42 · Dec 6, 2019

PeterDonis said:

We're not comparing the data from the two couples with each other; we're trying to use the data to estimate p, the probability of having a boy. The question is whether, given that the data are identical, the process used to generate the data makes a difference in the estimate for p that we come up with (or the strength with which we can accept or reject particular hypotheses about p, such as the hypothesis that p = 1/2).

It makes a difference as the conditions are different. We are not measuring ##p##, we are measuring ##p_i## under different assumptions. As Dale said, the sample space is a different one. Same with the pendulum. If we measure the same data for the same pendulum but at different locations, then all it says is, that we didn't consider all variables: an unknown fact is responsible for the measurement.

PeterDonis · Dec 6, 2019

fresh_42 said:

We are not measuring ##p##, we are measuring ##p_i## under different assumptions.

I don't understand what ##p_i## means. Are you hypothesizing that the two couples had different underlying probabilities of having a boy? I.e., that the value of ##p## is different (or could be different) for couple #1 and couple #2?

Dale · Dec 6, 2019

PeterDonis said:

Yes, I understand that. The question is, which of these quantities is the right one to answer the question I posed in the OP?

The question is:

PeterDonis said:

we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2.

Since you want to know ##p(\lambda>0.5)## it seems to me that that you are more interested in ##p(\lambda|X)## than ##p(X|\lambda)##

PeroK · Dec 6, 2019

PeterDonis said:

Yes, I understand that. The question is, which of these quantities is the right one to answer the question I posed in the OP?

I don't believe that ##p(\lambda|X)## makes much sense. First, you need ##\lambda## in some sort of range. Technically, ##p(\lambda|X) =0## or ##\approx 0##.

Formally, we assume ##\lambda## has a fixed but unknown value. ##\lambda## itself is not assumed to be distributed probabilistically in some way.

The assumption is that we are testing ##\lambda = 0.5##. In this context ##\lambda = 6/7## would make a lot more sense. But, we're not trying to establish something like that. It's obvious that ##\lambda = 6/7## fits the data better. But, that's not the issue.

Also, of course, the sample space is too small to do much here. The only question is whether the data in cases #1 and #2 provides more evidence to doubt that ##\lambda = 0.5##. That's all we can do here.

PeterDonis · Dec 6, 2019

Dale said:

ince you want to know ##p(\lambda>0.5)## it seems to me that that you are more interested in ##p(\lambda|X)## than ##p(X|λ)##

Yes, and that would seem to mean that the data from the two couples has the same weight as far as estimating what I want to know; i.e., that the different processes used to generate the two data sets make no difference for that question. And the response to the frequentist who says that the data sets obviously must have different weights since the p-values are different would be that the p-value is not relevant for the question being asked.

Do you agree?

PeroK · Dec 6, 2019

Dale said:

Since you want to know ##p(\lambda>0.5)## it seems to me that that you are more interested in ##p(\lambda|X)## than ##p(X|\lambda)##

My understanding of this, as I said above, is that ##\lambda## is assumed to have a fixed, definite but unknown value. ##p(\lambda > 0.5)## is not a statistically valid question.

fresh_42 · Dec 6, 2019

PeterDonis said:

I don't understand what ##p_i## means. Are you hypothesizing that the two couples had different underlying probabilities of having a boy? I.e., that the value of ##p## is different (or could be different) for couple #1 and couple #2?

This cannot be ruled out. IIRC, e.g. the age of the father plays a role.

We measure ##P(X|Y)## not ##P(X)##. The null hypothesis is ##P(X=x)=0.5## but one couple does it under the condition more than seven and the other under the condition as long as ##X\neq x##. I think that the two experiments cannot be used to test the null hypothesis as long as not all ##P(Y)## are taken into account, emphasis on all. Two experiments lead to two different sets of conditions, and in real life, some variables are always unknown, and this lack of information has an impact on the test and finally the null hypothesis.

Dale · Dec 6, 2019

PeterDonis said:

Do you agree?

Yes, but I am biased towards Bayesian methods

PeterDonis · Dec 6, 2019

PeroK said:

I don't believe that ##p(\lambda|X)## makes much sense. First, you need ##\lambda## in some sort of range.

Yes, and in the Bayesian context, which was what @Dale was assuming when he talked about ##p(\lambda|X)##, our prior would include some distribution of ##\lambda## in the range ##(0, 1)##. And the question I am asking, in Bayesian terms, would be whether we should update our prior to a posterior distribution for ##\lambda## differently for the data from couple #2 vs. the data from couple #1 because the processes used to generate the two data sets were different. @Dale appears to be saying the Bayesian answer is no--if the data is identical, then updating from a given prior gives the same posterior no matter how the data was generated.

PeroK said:

Formally, we assume ##\lambda## has a fixed but unknown value. ##\lambda## itself is not assumed to be distributed probabilistically in some way.

That's true, but since we don't know the true value of ##\lambda##, we have to adopt some prior distribution for it. That distribution is not saying we think ##\lambda## itself is probabilistically distributed; it is describing our prior knowledge about ##\lambda##, based on whatever information we have.

PeroK said:

The only question is whether the data in cases #1 and #2 provides more evidence to doubt that ##\lambda = 0.5##.

So what is your answer to that question?

PeterDonis · Dec 6, 2019

fresh_42 said:

This cannot be ruled out. IIRC, e.g. the age of the father plays a role.

Ok, but suppose we know that, whatever the true value of ##\lambda## is, it is the same for both couples. To put it another way, suppose that, whatever variables you think could possibly make ##\lambda## different from couple to couple, are the same for both couples. The only relevant difference between the two couples is the process they used. What would your answer be then?

PeroK · Dec 6, 2019

PeterDonis said:

So what is your answer to that question?

See post #6. The second case is less likely, given the hypothesis that ##\lambda = 0.5##. I.e. there is less confidence in ##\lambda = 0.5## given the data in case #2.

Although, really, the sample space is too small.

PeterDonis · Dec 6, 2019

fresh_42 said:

one couple does it under the condition more than seven and the other under the condition as long as ##X\neq x##.

You are misdescribing the conditions. The first couple decides in advance to have exactly seven children, not at least seven. The second couple decides to have children until they have at least one of each gender, not until they have at least one boy.

PeterDonis · Dec 6, 2019

PeroK said:

really, the sample space is too small.

How would this be reflected in a p-value calculation?

Dale · Dec 6, 2019

PeroK said:

My understanding of this, as I said above, is that ##\lambda## is assumed to have a fixed, definite but unknown value. ##p(\lambda > 0.5)## is not a statistically valid question.

It is perfectly valid for Bayesian methods, but not for frequentist methods which are the usual methods.

Dale · Dec 6, 2019

PeterDonis said:

whether we should update our prior to a posterior distribution for λ differently for the data from couple #2 vs. the data from couple #1 because the processes used to generate the two data sets were different.

The updating would be the same, but your prior could conceivably be different.

PeterDonis · Dec 6, 2019

PeroK said:

The second case is less likely, given the hypothesis that ##\lambda = 0.5##.

Yes, but that's not the question I asked. The question I asked was whether ##\lambda = 0.5## is less likely given the second case vs. the first.

PeroK said:

I.e. there is less confidence in ##\lambda = 0.5## given the data in case #2.

How does "the second data set is less likely given the hypothesis that ##\lambda = 0.5##" get transformed to
"the hypothesis that ##\lambda = 0.5## is less likely given the second data set"? That is not a valid deductive syllogism; in fact it's a common error people make (assuming that if A then B is equivalent to if B then A).

fresh_42 · Dec 6, 2019

Dale said:

It is a fundamentally different approach. In the frequentist approach the hypothesis (usually p=0.5) is taken to be certain and the data is considered to be a random variable from some sample space. That is the issue, the two sample spaces are different. For the Bayesian approach the data is considered certain and the hypothesis is a random variable.

Sorry for my stubbornness, but I have difficulties to figure out the difference.

Let's say I test a coin and the null hypothesis is ##p=0.5##. Is it true that in the frequentists' model if I flip the coin in many different test with different setups, I only measure how reliable my data are under the assumption of an ideal coin, whereas in the Bayesian model, I measure the bias of my coin under the assumption that my data will tell me?

Seems a bit linguistic to me.

PeterDonis · Dec 6, 2019

Dale said:

your prior could conceivably be different

How might the prior for couple #2 be different from the prior for couple #1?

PeterDonis · Dec 6, 2019

fresh_42 said:

have difficulties to figure out the difference

See the question I asked @PeroK in the last part of post #30.

fresh_42 · Dec 6, 2019

PeterDonis said:

The only relevant difference between the two couples is the process they used. What would your answer be then?

Given a reasonable sample size, we shouldn't be able to tell a difference. However, I don't think such an ideal case can be realized. Boy to girl is p to (1-p) regardless of the measurement. In reality this is not the case IMO.

PeterDonis · Dec 6, 2019

fresh_42 said:

Given a reasonable sample size, we shouldn't be able to tell a difference.

I already specified what the two samples are: the (identical) data from couples #1 and #2. So are you saying that, if the only difference between the couples is the process they used, the two data sets have the same statistical weight when estimating ##p##?

fresh_42 said:

I don't think such an ideal case can be realized.

I agree--no two couples are ever exactly the same except for just the process they used--but idealized cases are often useful for investigating questions even when they can't be realized.

Does the statistical weight of data depend on the generating process?

Similar threads

Hot Threads

Recent Insights