# Does the statistical weight of data depend on the generating process?

• I
• Featured
Mentor
2019 Award

## Summary:

If we have two identical data sets that were generated by different processes, will their statistical weight as evidence for or against a hypothesis be different?

## Main Question or Discussion Point

The specific example I'm going to give is from a discussion I am having elsewhere, but the question itself, as given in the thread title and summary, is a general one.

We have two couples, each of which has seven children that, in order, are six boys and one girl (i.e., the girl is the youngest of the seven). We ask the two couples how they came to have this set of children, and they give the following responses:

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).

Suppose we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2. Given the information above, is the data from couple #2 stronger evidence in favor of such a bias than the (identical) data from couple #1?

Related Set Theory, Logic, Probability, Statistics News on Phys.org
BWV
Two gamblers flip a coin seven times and each gets 6 tails and one head (in that chronological order). One gambler's motivation was to flip a coin seven time while the other's was to flip a coin until heads came up.

Importantly, each gambler flips a separate coin, can you conclude that either coin is biased?

I dont think so, but what you do have is two separate distributions, each with a single sample

Mentor
2019 Award
what you do have is two separate distributions
Yes, but the question is whether the difference in distributions makes any difference if we are trying to determine whether the coin is biased. It seems that you think the answer to that is no. Can you explain why?

Dale
Mentor
Summary:: If we have two identical data sets that were generated by different processes, will their statistical weight as evidence for or against a hypothesis be different?

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).
In general yes, these are different experiments and so the same data will constitute different levels of evidence. This is a big problem with science, particularly fields like psychology. They run an experiment like case 2 but analyze the data like case 1. The resulting p values are not correct as you can verify with a Monte Carlo simulation.

Here is a paper that describes the issue in detail. The author is a strong proponent of Bayesian methods in order to avoid problems like this. With Bayesian methods the intended experiment doesn’t matter, only the data.

https://www.ncbi.nlm.nih.gov/m/pubmed/22774788/

(The below is not exactly the reference I had in mind, but the idea is the same and it is not paywalled)

https://bookdown.org/ajkurz/DBDA_recoded/null-hypothesis-significance-testing.html

Last edited:
fresh_42
Mentor
With Bayesian methods the intended experiment doesn’t matter, only the data.
However, isn't it as stupid? Just on the other end of the scale? E.g. in the example with the family, we have two different observables, hence comparison of data doesn't mean anything. IMO the entire problem is a problem of how an experiment is modeled, rather than a mathematical or even scientific one.

Two experiments with different setups, ergo two distributions. Interesting would be the case of two different experiments with the same observable (= random variable). But this would imply the same constraints and objective functions. E.g. we could measure the frequency of a pendulum with two different methods, but this wouldn't effect the result, only the data. But if we measured the same pendulum at two different locations (heights), we cannot speak of the same observable anymore, regardless whether the data match or not.

PeroK
Homework Helper
Gold Member
Summary:: If we have two identical data sets that were generated by different processes, will their statistical weight as evidence for or against a hypothesis be different?

The specific example I'm going to give is from a discussion I am having elsewhere, but the question itself, as given in the thread title and summary, is a general one.

We have two couples, each of which has seven children that, in order, are six boys and one girl (i.e., the girl is the youngest of the seven). We ask the two couples how they came to have this set of children, and they give the following responses:

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).

Suppose we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2. Given the information above, is the data from couple #2 stronger evidence in favor of such a bias than the (identical) data from couple #1?
Being a frequentist, I would analyse it like this.

For #1, we have to imagine a large number of families who decided in advance to have seven children. A bias towards boys would result in more boys in general. We can't have a bias towards families based on the order the children are born. So, what we have found is a family that has 6 or more boys.

The question in this case is: how many families would have 6 or 7 boys given the hypothesis that boys and girls are equally likely?

We would expect $\frac{8}{128} = \frac{1}{16}$ to be in this category.

That's the likelihood in this case, given the hypothesis.

For #2, we have found a family that took 7 or more children to produce a girl.

The probability of this is $(\frac{1}{2})^7 + (\frac{1}{2})^8 \dots = \frac{1}{64}$

The likelihood is less in this case, given the hypothesis.

PS Just for the record, I analysed this problem with absolutely no a priori assumptions about the conclusion I would come to!

Last edited:
Mentor
2019 Award
With Bayesian methods the intended experiment doesn’t matter, only the data.
But if this is true, it would seem like Bayesian methods would say that both sets of data have the same statistical weight for estimating the probability p of having a boy. If that is not the case (and the frequentist analysis, as @PeroK showed for this case, says it isn't), how would Bayesian methods show that?

Dale
Mentor
However, isn't it as stupid? Just on the other end of the scale? E.g. in the example with the family, we have two different observables, hence comparison of data doesn't mean anything. IMO the entire problem is a problem of how an experiment is modeled, rather than a mathematical or even scientific one.
Not really. It is a fundamentally different approach. In the frequentist approach the hypothesis (usually p=0.5) is taken to be certain and the data is considered to be a random variable from some sample space. That is the issue, the two sample spaces are different. For the Bayesian approach the data is considered certain and the hypothesis is a random variable. You can certainly make different hypotheses for the two couples, but if you test the same hypothesis and prior with both couples then you will get the same posterior.

Mentor
2019 Award
in the example with the family, we have two different observables, hence comparison of data doesn't mean anything
We're not comparing the data from the two couples with each other; we're trying to use the data to estimate p, the probability of having a boy. The question is whether, given that the data are identical, the process used to generate the data makes a difference in the estimate for p that we come up with (or the strength with which we can accept or reject particular hypotheses about p, such as the hypothesis that p = 1/2).

Dale
Mentor
But if this is true, it would seem like Bayesian methods would say that both sets of data have the same statistical weight for estimating the probability p of having a boy. If that is not the case (and the frequentist analysis, as @PeroK showed for this case, says it isn't), how would Bayesian methods show that?
@PeroK is computing a different probability. He is computing $p(X|\lambda=0.5)$ where $X$ is the observed data and $\lambda$ is the probability of having a boy. That is a completely different quantity from $p(\lambda|X)$, which is what Bayesian methods calculate.

Notice that what you are interested in is $\lambda$ and that your natural inclination was to treat the calculated probabilities ad probabilities on $\lambda$ instead of what they are, probabilities on $X$.

PeroK
Homework Helper
Gold Member
@PeroK is computing a different probability. He is computing $p(X|\lambda=0.5)$ where $X$ is the observed data and $\lambda$ is the probability of having a boy. That is a completely different quantity from $p(\lambda|X)$, which is what Bayesian methods calculate.

Notice that what you are interested in is $\lambda$ and that your natural inclination was to treat the calculated probabilities ad probabilities on $\lambda$ instead of what they are, probabilities on $X$.
What I was doing is the usual hypothesis testing. The hypothesis is that $p = 0.5$ and testing the likelihood of the data against that.

I'm not sure it makes much sense to test various values of $p$ against the data. Not in this context.

Mentor
2019 Award
I'm not sure it makes much sense to test various values of $p$ against the data.
Sure it does. A different value of $p$ just changes the probabilities of individual outcomes in the sample space; you can still calculate p-values the same way you did.

Dale
Mentor
What I was doing is the usual hypothesis testing. The hypothesis is that $p = 0.5$ and testing the likelihood of the data against that.
Yes, that is correct.

I'm not sure it makes much sense to test various values of p against the data. Not in this context.
Why not? It is pretty natural to wonder what $\lambda$ is, and the data provides information about that.

Mentor
2019 Award
He is computing $p(X|\lambda=0.5)$ where $X$ is the observed data and $\lambda$ is the probability of having a boy. That is a completely different quantity from $p(\lambda|X)$, which is what Bayesian methods calculate.
Yes, I understand that. The question is, which of these quantities is the right one to answer the question I posed in the OP?

fresh_42
Mentor
We're not comparing the data from the two couples with each other; we're trying to use the data to estimate p, the probability of having a boy. The question is whether, given that the data are identical, the process used to generate the data makes a difference in the estimate for p that we come up with (or the strength with which we can accept or reject particular hypotheses about p, such as the hypothesis that p = 1/2).
It makes a difference as the conditions are different. We are not measuring $p$, we are measuring $p_i$ under different assumptions. As Dale said, the sample space is a different one. Same with the pendulum. If we measure the same data for the same pendulum but at different locations, then all it says is, that we didn't consider all variables: an unknown fact is responsible for the measurement.

Mentor
2019 Award
We are not measuring $p$, we are measuring $p_i$ under different assumptions.
I don't understand what $p_i$ means. Are you hypothesizing that the two couples had different underlying probabilities of having a boy? I.e., that the value of $p$ is different (or could be different) for couple #1 and couple #2?

Dale
Mentor
Yes, I understand that. The question is, which of these quantities is the right one to answer the question I posed in the OP?
The question is:
we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2.
Since you want to know $p(\lambda>0.5)$ it seems to me that that you are more interested in $p(\lambda|X)$ than $p(X|\lambda)$

PeroK
Homework Helper
Gold Member
Yes, I understand that. The question is, which of these quantities is the right one to answer the question I posed in the OP?
I don't believe that $p(\lambda|X)$ makes much sense. First, you need $\lambda$ in some sort of range. Technically, $p(\lambda|X) =0$ or $\approx 0$.

Formally, we assume $\lambda$ has a fixed but unknown value. $\lambda$ itself is not assumed to be distributed probabilistically in some way.

The assumption is that we are testing $\lambda = 0.5$. In this context $\lambda = 6/7$ would make a lot more sense. But, we're not trying to establish something like that. It's obvious that $\lambda = 6/7$ fits the data better. But, that's not the issue.

Also, of course, the sample space is too small to do much here. The only question is whether the data in cases #1 and #2 provides more evidence to doubt that $\lambda = 0.5$. That's all we can do here.

Mentor
2019 Award
ince you want to know $p(\lambda>0.5)$ it seems to me that that you are more interested in $p(\lambda|X)$ than $p(X|λ)$
Yes, and that would seem to mean that the data from the two couples has the same weight as far as estimating what I want to know; i.e., that the different processes used to generate the two data sets make no difference for that question. And the response to the frequentist who says that the data sets obviously must have different weights since the p-values are different would be that the p-value is not relevant for the question being asked.

Do you agree?

PeroK
Homework Helper
Gold Member
Since you want to know $p(\lambda>0.5)$ it seems to me that that you are more interested in $p(\lambda|X)$ than $p(X|\lambda)$
My understanding of this, as I said above, is that $\lambda$ is assumed to have a fixed, definite but unknown value. $p(\lambda > 0.5)$ is not a statistically valid question.

fresh_42
Mentor
I don't understand what $p_i$ means. Are you hypothesizing that the two couples had different underlying probabilities of having a boy? I.e., that the value of $p$ is different (or could be different) for couple #1 and couple #2?
This cannot be ruled out. IIRC, e.g. the age of the father plays a role.

We measure $P(X|Y)$ not $P(X)$. The null hypothesis is $P(X=x)=0.5$ but one couple does it under the condition more than seven and the other under the condition as long as $X\neq x$. I think that the two experiments cannot be used to test the null hypothesis as long as not all $P(Y)$ are taken into account, emphasis on all. Two experiments lead to two different sets of conditions, and in real life, some variables are always unknown, and this lack of information has an impact on the test and finally the null hypothesis.

Dale
Mentor
Do you agree?
Yes, but I am biased towards Bayesian methods

Mentor
2019 Award
I don't believe that $p(\lambda|X)$ makes much sense. First, you need $\lambda$ in some sort of range.
Yes, and in the Bayesian context, which was what @Dale was assuming when he talked about $p(\lambda|X)$, our prior would include some distribution of $\lambda$ in the range $(0, 1)$. And the question I am asking, in Bayesian terms, would be whether we should update our prior to a posterior distribution for $\lambda$ differently for the data from couple #2 vs. the data from couple #1 because the processes used to generate the two data sets were different. @Dale appears to be saying the Bayesian answer is no--if the data is identical, then updating from a given prior gives the same posterior no matter how the data was generated.

Formally, we assume $\lambda$ has a fixed but unknown value. $\lambda$ itself is not assumed to be distributed probabilistically in some way.
That's true, but since we don't know the true value of $\lambda$, we have to adopt some prior distribution for it. That distribution is not saying we think $\lambda$ itself is probabilistically distributed; it is describing our prior knowledge about $\lambda$, based on whatever information we have.

The only question is whether the data in cases #1 and #2 provides more evidence to doubt that $\lambda = 0.5$.

Mentor
2019 Award
This cannot be ruled out. IIRC, e.g. the age of the father plays a role.
Ok, but suppose we know that, whatever the true value of $\lambda$ is, it is the same for both couples. To put it another way, suppose that, whatever variables you think could possibly make $\lambda$ different from couple to couple, are the same for both couples. The only relevant difference between the two couples is the process they used. What would your answer be then?

PeroK
See post #6. The second case is less likely, given the hypothesis that $\lambda = 0.5$. I.e. there is less confidence in $\lambda = 0.5$ given the data in case #2.