I Does the statistical weight of data depend on the generating process?

  • #101
Dale said:
In this model each poll is considered to have some underlying probability of a win (analogous to a couple's probability of having a boy) which is considered a "hyperparameter", then the respondents to the poll are binomial draws from the prior (analogous to each child being a draw from the couple's probability). The observed data then informs us both about the probability for each couple as well as the distribution of probabilities for the population.

Hm, interesting! If I'm understanding this correctly, this methodology could provide a way of investigating questions like "does ##\lambda## depend on the criterion the couple uses to decide when to stop having children" by simply grouping the couples by that criterion--i.e., assuming that the same hyperparameter value applies to all couples in a group, but can vary between groups--and seeing whether the posterior distribution for the hyperparameter does in fact vary from group to group. And as I commented earlier, it would seem like the evidence described in the OP, where two couples are from different groups but produce the same outcome data, would be evidence against any hypothesis that the hyperparameter varied from group to group.
 
Physics news on Phys.org
  • #102
PeterDonis said:
this methodology could provide a way of investigating questions like ...
Yes, you could do it that way. The details vary a little if you want to consider only these two stopping criteria or if you want to consider them as elements of a whole population of stopping criteria. The hierarchical model is more appropriate for the second case. Essentially this is the difference between a fixed effect and a random effect model.
PeterDonis said:
the evidence described in the OP ... would be evidence against any hypothesis that the hyperparameter varied from group to group
Yes
 
  • #103
PeterDonis said:
One way of rephrasing the question is whether and under what circumstances changing the stopping rule makes a difference. In particular, in the case under discussion we have two identical data sets that were collected under different stopping rules; the question is whether the different stopping rules should affect how we estimate the probability of having a boy given the data.

I won't weigh in on variance issues, but the long-run estimates for the probability of boy vs girl are the same with either strategy. (Mathematically its via use of Strong Law of Large Numbers, but in the real world we do have tons of data on demographics spanning many years which should give pretty good estimates) .

inspection paradox related items:

if you estimate/sample by children:
we should be able to see that our estimates are the same either way -- i.e. in all cases the modelling is a sequence of one child at a time (we can ignore zero probability events of exactly the same time of birth so there is a nice ordering here) and each child birth is bernouli trial -- a coin toss with probability of heads given by some parameter ##p##. Depending on "strategy" taken what may change is who is tossing the coin (parents) but it doesn't change the fact that in this model we have bernouli process where the tosser/parent is irrelevant for modelling purposes.

if you estimate/sample by parents/couples:
this one is a bit more subtle.
PeterDonis said:
This is not the correct stopping rule for couple #2. The correct stopping rule is "when there is at least one child of each gender". It just so happens that they had a boy first, so they went on until they had a girl. But if they had had a girl first, they would have gone on until they had a boy.
I evidently misread the original post. Given this structure I opted to view it as a baby markov chain (pun intended?), and use renewal rewards.

for strategy #2 we have a sequence of ##X_k## iid random variables -- where ##X_k## denotes the number of kids given by parent/couple ##k##.

Part 1) give a reward of 1 for each girl the couple k has, with probability ##p \in (0,1)##
direct calculation (using total expectation) gives
##E\big[R_k\big] = \frac{1-p + p^2}{1-p}##
Part 2) give a reward of 1 for each boy the couple k has, with probability ##1-p##
either mimicking the above calculation, or just changing variables we get
##E\big[R_k'\big] = \frac{(1-p)^2 + p}{p}##

and the total time (i.e. number of kids) per couple k is
##E\big[X_k\big] = E\big[R_k + R_1'\big] = E\big[R_k\big] + E\big[R_1'\big]##

with R(t) as the reward function (t = integer time by custom = total number of kids in our model)
##\frac{R(t)}{t} \to_{as} \frac{E\big[R_k\big]}{E\big[X_k\big]}= p##
##\frac{E[R(t)]}{t} \to \frac{E\big[R_k\big]}{E\big[X_k\big]} = p##
where wolfram did the simplifications
https://www.wolframalpha.com/input/?i=((++(1-p)+++p^2)/(1-p))/(+(++(1-p)+++p^2)/(1-p)+++((1-p)^2+++p)/p)

I suppose the result may seem obvious to some, but a lot of things that are 'obviously true', actually aren't true in probability, which is why there are many so called paradoxes in probability. (The 'paradox paradox' of course tells us that they aren't really paradoxes, just a mismatch between math and intuition.) E.g. in the above, taking the expectation of X in the denominator can break things if we don't have justification-- this is why I used Renewal Rewards theorem here.

We can apply the same argument to strategy one to see an expected reward of ##E\big[R_k] = 7\cdot p## and ##E\big[R_k'] = 7\cdot (1-p)## so, yes this too tends to ##p##

PeterDonis said:
Can you give examples of each of the two possibilities you describe? I.e, can you give an example of a question, arising from the scenario described in the OP, for which stopping rules don't matter? And can you give an example of a question for which they matter a lot?
I can try... it's an enourmously complex and broad question in terms of math, and then more so when trying to map these approximations to the real world. A classical formulation for martingales and random walks is in terms of gambling. The idea behind martingales is with finite dimensions a fair game stays fair, and a skewed game stays skewed, no matter what 'strategy' the better has in terms of bet sizing. With infinite dimensions all kinds of things can happen and a lot of care is needed -- you can even have a formally fair game with finite first moments but if you don't have variance (convergence in L2/ access to Central Limit Theorem) then extremely strange things can happen -- Feller vol 1 has a nice example of this (chapter 10, problem 15 in the 3rd edition).

With respect to you original post, I've shown that neither 'strategy' changes the long-run estimates of ##p##. The fact that both strategies not only have second moments, but valid moment generating functions should allow for concentration inequalities around the mean, which can show that the mean convergence isn't 'too slow', but this is outside the scope I think.
- - - -
For an explicit example / model:
As far as simple models and examples go, I suggest considering the simple random walk, where we move to the left with probability q = 1-p and to the right with probability p. Suppose we start at zero and have a stopping rule of "stop when we're ahead" i.e. once the net score is +1. for ##p \in [0,\frac{1}{2})##, our random variable ##T## for number of moves until stopping is defective (i.e. not finite WP1), which is problematic. For ##p=\frac{1}{2}## the process stops With Probability 1, but ##E\big[T\big] = \infty## which is problematic = (e.g. see earlier comment on wanting to have finite 2nd moment...). Now for ##p \in (\frac{1}{2}, 1]## from a modelling standpoint, things are nice, but is this "ok"? Well it depends on what we're looking into. This admittedly very simple model could be used to interpret a construct for a (simplified) pharmaceutical trial -- say if they used the stopping rule: stop when the experimental evidence looks good (e.g. when they're ahead). The result would be to only publish favorable results even if the drug's effects were basically a standard coin toss (and possibly with significant negative side effects "when they're behind"). When things go bad, the results wouldn't be reported as the trial would be ongoing or maybe they'd stop funding it and it would just show up as 'no valid trial as terminated before proper finish (stopping rule)'

it reminds me a bit of this
https://www.statisticsdonewrong.com/regression.html
which has some nice discussion under 'truth inflation' that seems germane here
- - - -
edit: thanks to Fresh for resetting a latex/ server bug
 
Last edited:
  • #104
PeroK said:
What does a Bayesian analysis give numerically for the data in post #1?
So, the easiest way to do this analysis is using conjugate priors. As specified by @PeterDonis we assume that both couples have the same ##\lambda##. Now, in Bayesian statistics you always start with a prior. A conjugate prior is a type of prior that will have the same distribution as the posterior. In this case the conjugate prior is the Beta distribution. If these were the first two couples that we had ever studied then we would start with an ignorant prior, like so:
LambdaIgnorantPrior.png


After observing 12 boys and 2 girls we would update our beliefs about the distribution of ##\lambda## from the Beta(1,1) prior to a Beta(3,13) posterior distribution, like so:
LambdaIgnorantPosterior.png


From that posterior we can calculate any quantity we want regarding ##\lambda##. For example, the mean is 0.81 with a 95% Bayesian confidence region from 0.60 to 0.96 and a median of 0.83 and a mode of 0.86. This confidence region should be close to the frequentist confidence interval.

Now, suppose that we did not want to pretend that this is the first couple that we had ever seen. We can incorporate the knowledge we have from other couples in the prior. That is something that cannot be done in frequentist statistics. Remember, ##\lambda## is not the proportion of boys in the overall population, it is the probability of a given couple producing boys. While the overall proportion of boys in the population is close to 0.5, individual couples can be highly variable. I know several couples with >80% girls and several with >80% boys, but we don't know if they would have started having more of the other gender if they continued. So let's set our prior to be symmetric about 0.5 and have 90% of the couples within the range ##0.25<\lambda<0.75##. This can be achieved with an informed Beta(5,5) prior.
LambdaInformedPrior.png


Now, after collecting data of 6 boys and 1 girl for each couple we find the posterior distribution is Beta(7,17), which leads to a lower estimate of the mean ##\lambda## of 0.71 with a 95% confidence region from 0.52 to 0.87.
LambdaInformedPosterior.png


Notice that the mean is substantially lower because we are informed by the fact that we have seen other couples before. When couples have a unusual ratio we automatically suspect random chance may be skewing the results a bit, but do admit that there is some possibility that there is something different with this couple so that the results are not totally random. The informed posterior shows that balanced assessment.
 
Last edited:
  • Like
Likes PeroK
  • #105
Dale said:
we assume that both couples have the same ##\lambda##.

This doesn't seem to be quite what you're assuming. As you describe your analysis, you're not assuming that ##\lambda## is fixed for all couples; you're allowing for the possibility that different couples might have different unknown factors at work that could affect their respective probabilities of producing boys. But you are assuming that we have no reason to suppose that either couple #1 or couple #2 in our example is more or less likely to have unknown factors skewing them in one direction or the other, so we should use the same prior distribution (the "informed prior" Beta distribution) for both. I think that way of looking at it is fine.

Dale said:
When couples have a unusual ratio we automatically suspect random chance may be skewing the results a bit, but do admit that there is some possibility that there is something different with this couple so that the results are not totally random.

But, more importantly, the posterior distribution is the same for both couples, since they both have the same data. The different choice of stopping criterion does not affect the posterior distribution. In terms of the way of looking at it that I described above, we are assuming that a couple's choice of stopping criterion is independent of any unknown factors that might affect their propensity for favoring one gender over the other in births.
 
  • #106
PeterDonis said:
But, more importantly, the posterior distribution is the same for both couples, since they both have the same data. The different choice of stopping criterion does not affect the posterior distribution
Yes, the stopping criterion does not affect our retrospective belief about that couple's ##\lambda##, provided we use the same prior for both couples. Theoretically there could be reasons to use different priors for the two couples, but for this scenario all such reasons seem pretty far-fetched.
 
  • #107
PeterDonis said:
But, more importantly, the posterior distribution is the same for both couples, since they both have the same data. The different choice of stopping criterion does not affect the posterior distribution. In terms of the way of looking at it that I described above, we are assuming that a couple's choice of stopping criterion is independent of any unknown factors that might affect their propensity for favoring one gender over the other in births.

After some calculations, I agree with this. If we assume that there are some couples who are more likely to have girls than boys, say, then the conditional probability that each couple is in that category, given the data, is the same in both cases.

It appears that in general the stopping criteria are indeed irrelevant.
 
  • #108
PeroK said:
It appears that in general the stopping criteria are indeed irrelevant.
They are irrelevant for determining the estimate of ##\lambda##, but not for determining the p-value, as you calculated somewhere back on the first page.
 
  • Like
Likes PeroK
  • #109
Dale said:
They are irrelevant for determining the estimate of ##\lambda##, but not for determining the p-value, as you calculated somewhere back on the first page.

I can patch that up! First, because of the asymmetry in the data, we should take the p-value as strictly more extreme than the data.

In case 1, we need the probability of either 7 boys or 7 girls. That's ##\frac{1}{64}##.

In case 2, I also misread the question and assumed they were waiting for a girl, rather than wanting at least one of each. The probability of having a family of more than 7 is ##\frac{1}{64}##.

The p-values match.

The mistake was that the exactly observed data was less likely in the second case, but because I was measuring numbers of boys against size of family, this created an asymmetry. There was no exact correspondence in what was observed. What I should really have calculated was the probability of getting up to six boys or girls against the probability of having a family size up to 7. I.e. the complement of strictly more extreme outcome, as above.

(This must be a general point to be aware of: if you can't match up the data exactly, you need to take the strictly more unlikely outcomes for the p-value.)

But, there has to be a twist! Suppose that the second family were, indeed, waiting for a girl. Now, the likelihood of a family of more than 7 is only ##\frac{1}{128}##. And, again there is a difference in p-values.

This may be a genuine case where the stopping criteria does make a difference (*).

(*) PS As Peter points out below, this is just a case of limiting it to a one-tailed scenario.
 
Last edited:
  • Informative
Likes Dale
  • #110
PeroK said:
we should take the p-value as strictly more extreme than the data.

"Strictly more extreme" is ambiguous, though. Does it mean "one-tailed" or "two-tailed"? In this case, does it mean "at least that many boys" or "at least that many children of the same gender"?

This doesn't affect whether the p-values are the same or not, but it does affect their actual numerical values. I'll assume the "one-tailed" case in what follows.

PeroK said:
The p-values match.

I don't think they do.

For couple #1, the sample space is all possible combinations of 7 children, and the "at least as extreme" ones are those that have at least 6 boys. All combinations are equally probable so we can just take the ratio of the total numbers of each. There are ##2^7## of the former and 8 of the latter (one with 7 boys and 7 with 6 boys), so the p-value is ##8 / 2^7 = 1/16##.

For couple #2, the sample space is all possible combinations of 2 or more children that have at least one of each gender; but the combinations are not all equally probable so we would have to take that into account if we wanted to compute the p-value using a ratio as we did for couple #1 above. However, an easier way is to just compute the probability of getting 6 boys in a row, which is just ##1 / 2^6 = 1/64##. This covers all combinations at least as extreme as the one observed--half of that 1/64 probability is for the combination actually observed (6 boys and 1 girl), and the other half covers all the other possibilities that are at least as extreme, since all of them are just some portion of the combinations that start with 7 boys. So the p-value is ##1/64##.

PeroK said:
there has to be a twist! Suppose that the second family were, indeed, waiting for a girl.

Since they started out having a boy as their first child, they are waiting for a girl. Or are you considering a case where the stopping criterion is simply "stop when the first girl is born"? For that case, the p-value would be the same as the one I computed above for couple #2; the difference is that the underlying sample space is now "all combinations that end with a girl", which means that if you tried to compute the p-value using ratios, as I did for couple #1 above, you would end up computing a different set of combinations and a different set of associated probabilities.

The other twist in this case is that there is no "two-tailed" case, since the stopping criterion is not symmetric between genders. So you could say that the p-value for this case is different from both of the ones I computed above if you converted the ones I computed above to the two-tailed case (which means multiplying by 2).

PeroK said:
This may be a genuine case where the stopping criteria does make a difference.

It can make a difference in p-value, yes, as shown above.

However, it still doesn't make a difference in the posterior distribution for ##\lambda##, or, in your terms, the conditional probability of each couple being in a particular category as far as a propensity for having boys or girls.
 
  • #111
Just for grins I also did a Monte Carlo simulation of the original problem. I assumed ##\lambda## starting at 0.01 and going to 0.99 in increments of 0.01. For each value of ##\lambda## I simulated 10000 couples using each stopping criterion. I then counted the number of couples that had 6 exactly boys. The plots of the counts are as follows. For the case where they stop after exactly 7 children regardless:
MonteCarlo7Total.png


For the case where they stop after they get one of each
MonteCarlo1ofEach.png


Notice that the shape is the same for both strategies, this is why the fact that we get the same data leads to the same estimate of ##\lambda##. However, note that the vertical scale is much different, this is why the probabilities are different for the two cases, it is simply much less likely to get 6 boys if trying for 1 of each than it is to get 6 boys if you simply have 7 children. This doesn't make the estimate any different, but it makes us more surprised to see the data.
 
  • Like
Likes Auto-Didact and PeroK
  • #112
PeterDonis said:
"Strictly more extreme" is ambiguous, though. Does it mean "one-tailed" or "two-tailed"? In this case, does it mean "at least that many boys" or "at least that many children of the same gender"?

This doesn't affect whether the p-values are the same or not, but it does affect their actual numerical values.

I assumed two-tailed.

You can see that

##p(7-0, 0-7) = p(n \ge 8)##

Where that's the total probablity of a unisex family of seven on the left and a family size of eight or more to get at least one of each sex. But:

##p(6-1, 1-6) \ne p(n =7)##

Which creates another interesting ambiguity. Is that genuinely a difference in p-values or just an asymmetry in the possible outcomes?
 
  • #113
PS if the p-values for two sets of data cannot be the same because of the discrete structure of the data, then having different p-values loses some of its significance!
 
  • #114
I did a few calculations for the cases of different sizes of families. There is a clear pattern. The "strict" p-value agrees in all cases. But, the "inclusive" p-value becomes more different as the size of the family increases. This is all two-tailed:

For a family of size ##N##, the strict p-value (the probability of the data being more extreme) is ##\frac{1}{2^{N-1}}## for both case 1 and case 2.

For the "inclusive" p-values (the data being as observed or more extreme), the p-values are:

##N = 4, \ p_1 = 5/8, \ p_2 = 1/4##
##N = 5, \ p_1 = 3/8, \ p_2 = 1/8##
##N = 6, \ p_1 = 7/32, \ p_2 = 1/16##
##N = 7, \ p_1 = 1/8, \ p_2 = 1/32##
##N = 8, \ p_1 = 9/128, \ p_2 = 1/64##

There's a clear pattern: ##p_2 = \frac{1}{2^{N-2}}## and ##p_1 = \frac{N+1}{2} p_2##.

This raises an interesting question about whether the p-value should be "strict" or "inclusive". In this problem, there is a case for choosing the strict version. Which reflects the fact that, after all, the data is the same.

Alternatively, the fact that the (inclusive) p-value in case 2 is lower for larger ##N## might be telling us something statistically significant.
 
  • Like
Likes Auto-Didact and Dale
  • #115
PeroK said:
This raises an interesting question about whether the p-value should be "strict" or "inclusive".

The "inclusive" p-value is different for case #1 vs. case #2 because the number of combinations that are equally extreme as the one actually observed is different for the two cases; whereas, in this particular case, the number of combinations which are more extreme happens to be the same for both case #1 and case #2. I don't think either of those generalizes well.

PeroK said:
the fact that the (inclusive) p-value in case 2 is lower for larger ##N## might be telling us something statistically significant

It's telling you that, as ##N## goes up, the number of combinations that are equally extreme as the one actually observed increases for case #1, whereas for case #2 it remains constant (it's always just 2 combinations, the one actually observed and its counterpart with boys and girls interchanged).

However, the more fundamental point is that, no matter how you slice and dice p-values, they are answers to a different question than the question I posed in this thread. They are answers to questions about how likely the observed data are given various hypotheses. But the question I posed is a question about how likely various hypotheses are given the observed data. In most real-world cases, the questions we are actually interested in are questions of the latter type, not the former. For those kinds of questions, the Bayesian viewpoint seems more appropriate.
 
  • Like
Likes Auto-Didact and Dale
  • #116
PeterDonis said:
Summary:: If we have two identical data sets that were generated by different processes, will their statistical weight as evidence for or against a hypothesis be different?

The specific example I'm going to give is from a discussion I am having elsewhere, but the question itself, as given in the thread title and summary, is a general one.

We have two couples, each of which has seven children that, in order, are six boys and one girl (i.e., the girl is the youngest of the seven). We ask the two couples how they came to have this set of children, and they give the following responses:

Couple #1 says that they decided in advance to have seven children, regardless of their genders (they think seven is a lucky number).

Couple #2 says that they decided in advance to have children until they had at least one of each gender (they didn't want a family with all boys or all girls).

Suppose we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2. Given the information above, is the data from couple #2 stronger evidence in favor of such a bias than the (identical) data from couple #1?
Sorry if this was brought up already but isn't something similar done in medicine with likelihood ratios, using a database of priors and adjusting? Then you can decide , assuming equal priors I guess, if the likelihood ratio is the same in both cases?

EDIT: e.g., given symptoms A,B,C, etc. and a given age, there is a certain prior attached and then tests are given whose results have a likelihood ratio to them. Wonder if something similar can be made with your question, seeing if one has a higher likelihood ratio than the other?
 
Last edited:
  • Like
Likes Dale
  • #117
PeterDonis said:
the question I posed is a question about how likely various hypotheses are given the observed data. In most real-world cases, the questions we are actually interested in are questions of the latter type, not the former
That was actually the first thing that drew my attention and interest in Bayesian statistics. The outcome of Bayesian tests are more aligned with how I personally think of science and scientific questions. Plus, it naturally and quantitatively incorporates some philosophy of science in a non-philosophical way, specifically Popper’s falsifiability and Ockham’s razor.
 
  • Like
Likes jim mcnamara, Auto-Didact and WWGD
  • #118
Dale said:
That was actually the first thing that drew my attention and interest in Bayesian statistics. The outcome of Bayesian tests are more aligned with how I personally think of science and scientific questions. Plus, it naturally and quantitatively incorporates some philosophy of science in a non-philosophical way, specifically Popper’s falsifiability and Ockham’s razor.
Other than Bayes' theorem, does modern Probability, Math Statistics deal with Bayesian stats or just frequentist? EDIT: The type you would study in most grad courses that are not explicitly called frequentist which includes the CLT, LLN, etc.
 
  • #119
WWGD said:
isn't something similar done in medicine with likelihood ratios, using a database of priors and adjusting?

The results of medical tests for rare conditions are usually much better analyzed using Bayesian methods, yes, because those methods correctly take into account the rarity of the underlying condition, in relation to the accuracy of the test. Roughly speaking, if the condition you are testing for is rarer than a false positive on the test, any given positive result on the test is more likely to be a false positive than a true one. Frequentist methods don't give you the right tools for evaluating this.
 
  • Like
Likes Auto-Didact, Dale and WWGD
  • #120
WWGD said:
The type you would study in most grad courses that are not explicitly called frequentist
My classes were all purely frequentist, but I am an engineer that likes statistics rather than a statistician and also school was more than a decade ago. (Significantly more, even with a small sample)
 
Last edited:
  • #121
PeterDonis said:
The results of medical tests for rare conditions are usually much better analyzed using Bayesian methods, yes, because those methods correctly take into account the rarity of the underlying condition, in relation to the accuracy of the test. Roughly speaking, if the condition you are testing for is rarer than a false positive on the test, any given positive result on the test is more likely to be a false positive than a true one. Frequentist methods don't give you the right tools for evaluating this.

Peter, you are fairly harsh in the physics forums when nonsense is posted, so there is no reason not to point out that this is nonsense. The vast majority of medical research has used standard statistical analysis, which is based on frequentist methods.

If what you say were true there would have been a mass conversion to Bayesian methods.

I'd like to see a statistical journal where your claims about standard statistical methods being inadequate simply because a test can yield more false positives than true positives is substantiated.
 
  • #122
WWGD said:
Sorry if this was brought up already but isn't something similar done in medicine with likelihood ratios, using a database of priors and adjusting? Then you can decide , assuming equal priors I guess, if the likelihood ratio is the same in both cases?
Yes, this is becoming more and more standard practice in medicine. There are not only journals but even undergraduate medical textbooks which directly address such issues as part of the core clinical theory of medicine. This has been this way for at least 20 years and is steadily developing.

However, from my experience of polling undergraduates and graduates, the emphasis on the utility of Bayesian methods is so marginal - both educationally and clinically - that it is practically forgotten by the time rounds begin; older physicians that are not in academia and/or not educators tend to be wholly unfamiliar with these relatively novel methods, so they straight out ignore them.
PeroK said:
Peter, you are fairly harsh in the physics forums when nonsense is posted, so there is no reason not to point out that this is nonsense. The vast majority of medical research has used standard statistical analysis, which is based on frequentist methods.

If what you say were true there would have been a mass conversion to Bayesian methods.

I'd like to see a statistical journal where your claims about standard statistical methods being inadequate simply because a test can yield more false positives than true positives is substantiated.
In medicine, frequentist statistics is only utilized for academic research i.e. generalizing from single instances to entire populations, while Bayesian statistics is used in clinical practice, i.e. specifying from generalities to particular cases. Medicine as clinical practice is purely concerned with the latter, which is why quantitative operationalizations of certain aspects of the medical process such as likelihood ratio analyses have been invented; such purely clinical quantitative methods tend to be Bayesian, i.e. the clinical application of knowledge gained using frequentist statistical methods is Bayesian.

While I get your sentiment you are simply wrong here and your misunderstanding is a widespread one in medicine as well. Moreover, you have misconstrued the actual issue by not qualifying your statement, i.e. the vast majority of medical research focused on comparing treatments and demonstrating effectiveness of treatment have focused on standard statistical analysis. To use the actual terminology, most medical research is quantitative research.

This terminology is extremely misleading because it pretends that standard statistical analysis is the only kind of quantitative research - something which some medical researchers will actually tell you! - which is obviously wrong! See e.g. the difference in mathematical sophistication and background required between 'quantitative finance' and 'finance'; in fact, recognizing this early on is what made me realize I had to take a degree in either applied mathematics or physics in order to learn alternative quantitative and mathematical methods for research in medicine which are completely unknown in medicine.

In any case, the fact that most research in medicine has focused only on the type of question 'does A work/is A better than B' is because practically these are the easiest types of questions to research and answer with little to no uncertainty: in fact, the path is so completely straightforward such that with statistical packages already available all that is practically left to do is just collect data and correctly feed it into the computer. This has transformed both the standard MD/PhD programme as well as the typical PhD programme in medicine into a very straightforward path which can be reduced to mastering standard statistical analysis, but I digress.

Apart from the obviously different kinds of research which require different methods - e.g. laboratory work and sociological analysis - there are of course also other types of quantitative questions that are of direct interest in medicine, both in the scientific as well as the clinical context. The problem for medicine with such quantitative questions is that they do not fit the existing mold i.e. they require alternative quantitative methods that simply aren't taught in the standard medical curriculum; Bayesian likelihood ratio analysis is an exception that is taught.

It is generally recognized by clinicians that alternative quantitative methods however are to some extent taught in other sciences. Because of this many of these alternative quantitative questions are simply directly deferred to other sciences (biomedical sciences, pharmacology, physiology and so on). The problem then remains that the purely clinical questions cannot be deferred to other sciences because they are purely practical medical issues and belong to the domain of the clinical physician. How do clinicians deal with this? They simply ignore it and/or leave it as an issue for the next generation to solve.
 
  • #123
Auto-Didact said:
While I get your sentiment you are simply wrong here and your misunderstanding is a widespread one in medicine as well.

Okay, I'm willing to believe this. But, I would like to see some evidence.

I can see the potential for the Bayesian approach. What I don't see is how the standard approach can ultimately fail in general.

Why has everyone (who uses standard statistical analysis) been wrong all along how many people know this?
 
  • #124
PeroK said:
Okay, I'm willing to believe this. But, I would like to see some evidence.

I can see the potential for the Bayesian approach. What I don't see is how the standard approach can ultimately fail in general.

Why has everyone (who uses standard statistical analysis) been wrong all along how many people know this?
I've been trying to answer this for over a decade now. If you could answer that convincingly, you'd probably get the Nobel Prize in Medicine.
 
  • #125
Auto-Didact said:
I've been trying to answer this for over a decade now. If you could answer that convincingly, you'd probably get the Nobel Prize in Medicine.

Well, I'm not after a Nobel Prize. As far as I can see, it's the traditional camp that is concerned about the reliability of Bayesian methods. Not the other way round.
 
  • #126
Deciding when to stop data collection is an important part of an experimental design to prevent the introduction of bias. My preference is to design experiments from the outset that stop either with a fixed, pre-determined number of data points, or run for a fixed, pre-determined duration of time. It is hard to introduce a human decision to stop data collection once it has begun that is free of bias, especially if the human decision maker(s) are aware of the results so far.
 
  • Like
Likes jim mcnamara, Jimster41, Auto-Didact and 1 other person
  • #127
PeroK said:
Well, I'm not after a Nobel Prize. As far as I can see, it's the traditional camp that is concerned about the reliability of Bayesian methods. Not the other way round.
You're of course correct. Apart from the Nobel Prize it is likely that a solution would go a long ways to solving the reproduction crisis and problem with p-value hacking, as these all seem to be symptoms of the same disease, which is precisely why solving it is Prize worthy in the first place.

I actually have an explanation, but the question is whether or not that explanation is going to be convincing to the traditional camp. In summary, medicine is an extremely traditional discipline: an unspoken principle is 'don't fix what ain't broken'. If one doesn't conform to the traditions of medicine, one is quickly ostracized and cast out; this almost instantly applies once one suggests going beyond the traditional boundaries. If one has to go against the foundational traditions of the medical establishment to prove their point - even if one can demonstrate that what they are doing is in fact correct - this is simply not a path that many people are willing to take.

Notice the striking resemblance between this issue and the arguments regarding the problems in the foundations of QM, which is also split into two camps: those who take the issues seriously as unjustifiable loose ends in physics - i.e. foundationalists - and those arguing that those problems aren't actually real problems and can just be straightforwardly ignored for whatever instrumental or practical reasons, such as personal convenience - i.e. pragmatists.
 
  • #128
Dr. Courtney said:
Deciding when to stop data collection is an important part of an experimental design to prevent the introduction of bias. My preference is to design experiments from the outset that stop either with a fixed, pre-determined number of data points, or run for a fixed, pre-determined duration of time. It is hard to introduce a human decision to stop data collection once it has begun that is free of bias, especially if the human decision maker(s) are aware of the results so far.
This sounds like the conventional methodology to decide necessary sample sizes a priori based on power analysis used in standard statistical clinical research.

On the other hand, in the practice of clinical medicine among experienced practitioners we have a non-explanatory term for limiting data collection only to the bare minimum necessary in order to make a clinical decision: correct practice. To contrast, collecting data which cannot directly be considered to be relevant for the problem at hand is seen as 'incorrect practice'.

Engaging in incorrect practice too frequently, either deliberately or by mistake, is a punishable offense; I reckon implementing something like this would be effective as well to deter such behavior in scientific practice.
 
  • #129
PeterDonis said:
Suppose we are trying to determine whether there is a bias towards boys, i.e., whether the probability p of having a boy is greater than 1/2. Given the information above, is the data from couple #2 stronger evidence in favor of such a bias than the (identical) data from couple #1?

To get a mathematical answer, we would have to define what "evidence" for p > 1/2 means and what procedure will used to determine that evidence_A is stonger than evidence_B.

In frequentist statistics, the common language notion of "strength of evidence" suggests comparing "power curves" for statistical tests. To do that, you must pick a particular statistics and define the rejection region for each test. (The number of boys in the data is but one example of a statistic that can defined as a function of the data.)

In Bayesian statistics, one can compute the probability that p > 1/2 given a prior distribution for p and the data. Suppose the two experiments A and B produce respective data sets ##D_A## and ##D_B##. For particular data sets, it might turn out that ##Pr(p>1/2 | D_A) > Pr(p> 1/2| D_B)##. However, for different particular data sets, the inequality might be reversed. So how shall we phrase your question in order to consider in general whether experiment A or experiment B provides more evidence?

I suppose one way is consider the expected value for ##Pr(p > 1/2 | D)## where the expectation is taken over the joint distribution of possible data sets and values of ##p## - do this for each experiment and compare answers. This is a suspicious procedure from the viewpoint of experimental design. It seems to be asking "Which experiment should I pick to give the strongest evidence that p > 1/2?". However, that seems to be the content of your question.

From the point of view of experimental design, a nobler question is "Which experiment gives a better estimate of p?". To translate that into mathematics requires defining what estimators will be used.
 
  • Like
Likes Auto-Didact
  • #130
PeroK said:
Okay, I'm willing to believe this. But, I would like to see some evidence.

I can see the potential for the Bayesian approach. What I don't see is how the standard approach can ultimately fail in general.

Why has everyone (who uses standard statistical analysis) been wrong all along how many people know this?
Coincidentally, Sabine Hossenfelder just uploaded a video which gives a (simplified) explanation of an aspect of this same topic, which applies to all the sciences more broadly instead of just w.r.t. how statistical methodology is used by scientists in medicine:

An important general lesson to take away from the video is that biases which have not been quantified - perhaps simply because the type of bias was discovered after statistical methodology - are often ignored by scientists; this also weakens the efficacy of statistical analysis, regardless of how careful the scientists were.
 
  • #131
PeroK said:
The vast majority of medical research has used standard statistical analysis, which is based on frequentist methods.

Yes, and much of that medical research fails to be replicated. The "replication crisis" that was making headlines some time back was not limited to medical research, but it included medical research. One of the key criticisms of research that failed to be replicated, on investigation, was inappropriate use of p-values. That criticism was basically saying the same thing that @Dale and I are saying in this thread: the p-value is the answer to a different question than the question you actually want the answer to.

PeroK said:
standard statistical methods being inadequate simply because a test can yield more false positives than true positives

My point was that the p-value, which is the standard statistical method for hypothesis testing, can't answer this question for you. The p-value tells you the probability that the positive test result would have happened by chance, if you don't have the disease. But the probability you are interested in is the probability that you have the disease, given the positive test result. It's easy to find actual tests and actual rare conditions where the p-value after a positive test result can be well below the 5% "significance" threshold, which under standard statistical methods means you reject the null hypothesis (i.e., you tell the patient they most likely have the disease), but the actual chance that the patient has the disease given a positive test result is small.
 
  • Like
Likes Auto-Didact
  • #133
PeroK said:
Peter, you are fairly harsh in the physics forums when nonsense is posted, so there is no reason not to point out that this is nonsense.
Actually, what he described is pretty standard introductory material for Bayesian probability.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4585185/
 
  • Like
Likes Auto-Didact
  • #134
PeterDonis said:
That criticism was basically saying the same thing that @Dale and I are saying in this thread: the p-value is the answer to a different question than the question you actually want the answer to.
This, as well as basically the entire thread, reminds me of a quote by Cantor:
To ask the right question is harder than to answer it.

This essentially is why science in general (and physics in particular) is difficult; i.e. not because solving technical (mathematical) questions can be somewhat difficult, but instead because the right question has to be identified and then asked first. This means that in any open-ended scientific inquiry one should postpone naively mathematicizing what can easily be mathematicized if it isn't clear what is essential, i.e. prematurely mathematicizing a conceptual issue into a technical issue is a waste of time which should be avoided!

It took me quite along while to learn this lesson because it goes against both my instincts as well as my training. Moreover, the realization that this lesson is actually useful is a reoccuring theme when doing applied mathematics in the service of some science, which only comes when one e.g. repeatedly tries to generalize from some particular idealization towards a more realistic description, which then generally turns out to be literally unreachable in any obvious way.
 
  • #135
Stephen Tashi said:
To get a mathematical answer, we would have to define what "evidence" for p > 1/2 means and what procedure will used to determine that evidence_A is stonger than evidence_B.
In Bayesian statistics this is well defined and straightforward.

https://en.m.wikipedia.org/wiki/Bayes_factor

Of course, there are limitations to any technique
 
  • #136
Auto-Didact said:
medicine is an extremely traditional discipline: an unspoken principle is 'don't fix what ain't broken'
I think there is a growing recognition of the parts of medical science that are broken. I am optimistic in the long term and even in the short term the changes are at least interesting.
 
  • #137
Dale said:
Actually, what he described is pretty standard introductory material for Bayesian probability.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4585185/

@PeterDonis I apologise as I spoke too harshly. I really don't want to get involved in a debate on medical statistics and how they are used. I didn't realize that was what was at the root of all this.

That article seems to me more about the politics of communicating with patients than actual statistic methods themselves.

If you are all telling me that traditional statistical methods are widely misunderstood and misused in medical science, then I have no grounds to challenge that.
 
  • #138
PeroK said:
That article seems to me more about the politics of communicating with patients than actual statistic methods themselves.
Yes, the communication with patients is particularly important since they cannot be expected to understand the statistical issues themselves. The article did talk about the fact that for rare diseases the likelihood of having the disease after receiving a positive test result is low. I.e. for rare diseases most positives are false positives.
 
  • #139
Dale said:
Yes, the communication with patients is particularly important since they cannot be expected to understand the statistical issues themselves. The article did talk about the fact that for rare diseases the likelihood of having the disease after receiving a positive test result is low. I.e. for rare diseases most positives are false positives.
Yes, but it doesn't take Bayesian methods to come to that conclusion.
 
  • Like
Likes Dale
  • #140
PeterDonis said:
The results of medical tests for rare conditions are usually much better analyzed using Bayesian methods, yes, because those methods correctly take into account the rarity of the underlying condition, in relation to the accuracy of the test. Roughly speaking, if the condition you are testing for is rarer than a false positive on the test, any given positive result on the test is more likely to be a false positive than a true one. Frequentist methods don't give you the right tools for evaluating this.

As @PeroK has pointed out, this is wrong. You are getting Bayes's rule confused with Bayesian methods. Bayes's rule is part of both Frequentist and Bayesian methods. Frequentist methods and Bayes's rule are perfectly fine for analyzing rare conditions.
 
  • Like
Likes PeroK
  • #141
atyy said:
As @PeroK has pointed out, this is wrong. You are getting Bayes's rule confused with Bayesian methods. Bayes's rule is part of both Frequentist and Bayesian methods. Frequentist methods and Bayes's rule are perfectly fine for analyzing rare conditions.
Bayes' theorem is explicitly not part of the formalism of frequentist probability theory. Any importation of Bayes' theorem into statistical practice using frequentist methods is a transition to statistical practice using Bayesian methods.
 
  • Skeptical
Likes Dale
  • #142
Auto-Didact said:
Bayes' theorem is explicitly not part of the formalism of frequentist probability theory. Any importation of Bayes' theorem into statistical practice using frequentist methods is a transition to statistical practice using Bayesian methods.

Bayes' theorem can be proved with a simple use of a Venn diagram. It must be true. It also falls out of the "probability tree" approach.

You are confusing statistical methods with probability theory. Bayes' theorem is a fundamental part of probability theory that underpins any set of statistical methods.

The Wikipedia page gives the two Bayesian and frequentist interpretations of the theorem:

https://en.wikipedia.org/wiki/Bayes'_theorem#Bayesian_interpretation
 
  • Like
Likes Dale
  • #143
I agree that Bayes' theorem is generally valid, as part of mathematics. It is instead the interpretation of probability theory based on the idea that probabilities are objective relative frequencies which specifically doesn't acknowledge the general validity of Bayes' theorem w.r.t. probabilities. Standard statistical methodology are based on this frequentist interpretation of probability theory.
 
  • Skeptical
Likes Dale
  • #144
Here, Andrew Gelman, a noted Bayesian, explicitly says that one does not need to be a Bayesian to apply Bayes's rule.

http://www.stat.columbia.edu/~gelman/research/published/badbayesmain.pdf
Bayesian statisticians are those who would apply Bayesian methods to all problems (Everyone would apply Bayesian inference in situations where prior distributions have a physical basis or a plausible scientific model, as in genetics.)

Of course, one should not need Gelman's authority to say this. Bayes's rule is just a basic part of probability.
 
  • Like
Likes Dale
  • #145
Auto-Didact said:
It is instead the interpretation of probability theory based on the idea that probabilities are objective relative frequencies which specifically doesn't acknowledge the general validity of Bayes' theorem w.r.t. probabilities.

That is simply a fundamental misunderstanding on your part.
 
  • #146
PeroK said:
That is simply a fundamental misunderstanding on your part.
This seems to go in the face of the literature, as well as how statistical methodology is actually practiced.

What do you mean by the term Bayesian methods? It seems that you aren't referring to any statistical methods based on Bayesian probability theory as invented by Laplace, but instead to something else much more limited in scope.
 
  • #147
Auto-Didact said:
This seems to go in the face of the literature, as well as how statistical methodology is actually practiced.

What do you mean by the term Bayesian methods? It seems that you aren't referring to any statistical methods based on Bayesian probability theory as invented by Laplace, but instead to something else much more limited in scope.

Technically a "statistic" is, by definition, something used to estimate a population parameter. The simplest example is the mean. One of the first things you have to do is decide whether the mean is relevant. If you have some data, no one argues (within reason) over the value of the mean. The debate would be on the relevance of the mean as an appropriate statistic.

Overuse of the mean could be seen as a questionable statistical method. E.g. taking average salary, where perhaps the median is more important. Average house price, likewise.

Testing the null hypothesis and using the p-value is a statistical method. Again, there is probably no argument over the p-value itself, but of its relevance.

These are examples of traditional (aka frequentist) statistical methods.

Examples of Bayesian methods have been given by @Dale in this thread.

The example that started this thread perhaps illustrates the issues. I'l do a variation:

We start, let's say, with a family of six girls and no boys.

1) You could argue that there is no medical evidence or hypothesis that some couples have a predisposition to girls, hence there is no point in looking at this data. Instead you must look at many families and record the distribution in terms of size and sex mixture. This is simply a family with six boys - so what? - that happens.

2) You could suggest a hypothesis that this couple is more likely to have boys than girls and test that. But, with only six children standard statistical methods are unlikely to tell you anything. Even if you consider this an undertaking of any purpose.

3) You could analyse the data using Bayesian methods and calculate a posterior mean for that particular couple. Again, you have to decide whether this calculation is of any relevance.

Here a general theme emerges. Bayesian are able to say something about data where traditionalists are silent. That could be good or bad. What's said could be an insight that traditional methods miss; or, it could be a misplaced conclusion.
 
  • #148
Auto-Didact said:
This seems to go in the face of the literature, as well as how statistical methodology is actually practiced.

What do you mean by the term Bayesian methods? It seems that you aren't referring to any statistical methods based on Bayesian probability theory as invented by Laplace, but instead to something else much more limited in scope.

I found this. It looks good to me:

https://www.probabilisticworld.com/frequentist-bayesian-approaches-inferential-statistics/
 
  • #149
Auto-Didact said:
Bayes' theorem is explicitly not part of the formalism of frequentist probability theory. Any importation of Bayes' theorem into statistical practice using frequentist methods is a transition to statistical practice using Bayesian methods.
I don’t think Rev Bayes signed an exclusive licensing agreement with the Bayesianists for the use of his theorem. Frequentists can still use it.
 
  • Like
Likes PeterDonis and PeroK
  • #150
PeroK said:
The Wikipedia page gives the two Bayesian and frequentist interpretations of the theorem:

https://en.wikipedia.org/
I hope you agree that there is a huge difference between Bayes theorem appearing as an extratheoretical purely mathematical consequence of set theoretical intersections and (the functions in) Bayes theorem serving as the definition of probability; only the latter is Bayesian probability theory.
PeroK said:
Technically a "statistic" is, by definition, something used to estimate a population parameter. The simplest example is the mean. One of the first things you have to do is decide whether the mean is relevant. If you have some data, no one argues (within reason) over the value of the mean. The debate would be on the relevance of the mean as an appropriate statistic.

Overuse of the mean could be seen as a questionable statistical method. E.g. taking average salary, where perhaps the median is more important. Average house price, likewise.

Testing the null hypothesis and using the p-value is a statistical method. Again, there is probably no argument over the p-value itself, but of its relevance.

These are examples of traditional (aka frequentist) statistical methods.

Examples of Bayesian methods have been given by @Dale in this thread.

The example that started this thread perhaps illustrates the issues. I'l do a variation:

We start, let's say, with a family of six girls and no boys.

1) You could argue that there is no medical evidence or hypothesis that some couples have a predisposition to girls, hence there is no point in looking at this data. Instead you must look at many families and record the distribution in terms of size and sex mixture. This is simply a family with six boys - so what? - that happens.

2) You could suggest a hypothesis that this couple is more likely to have boys than girls and test that. But, with only six children standard statistical methods are unlikely to tell you anything. Even if you consider this an undertaking of any purpose.

3) You could analyse the data using Bayesian methods and calculate a posterior mean for that particular couple. Again, you have to decide whether this calculation is of any relevance.

Here a general theme emerges. Bayesian are able to say something about data where traditionalists are silent. That could be good or bad. What's said could be an insight that traditional methods miss; or, it could be a misplaced conclusion.
I basically agree with all of this, but the question is why are Bayesians able to say something when frequentists must be silent: the answer is that they have another definition of probability.
PeroK said:
Again, a certain formula appearing as an application when doing mathematics and a certain formula being the central definition of the theory are clearly two different things.
Dale said:
I don’t think Rev Bayes signed an exclusive licensing agreement with the Bayesianists for the use of his theorem. Frequentists can still use it.
Of course frequentists can use it, in the same sense that curved space can be imported into QFT by engaging in semi-classical physics. If they use it as a form of applied mathematics on intersecting sets then there is no foul play, but if they use it for statistical inference in such a manner that Bayes theorem replaces the frequentist definition of probability then they are de facto doing Bayesian statistics while merely pretending not to.

The key question is therefore if the given theorem has a fundamental status within their theory as the central definition or principle; clearly for frequentist probability theory and any statistical method of inference based thereon the answer is no.
 
  • Skeptical
Likes Dale
Back
Top