I What is the Margin of Error in Polls and How is it Calculated?

  • Thread starter Thread starter Vanadium 50
  • Start date Start date
  • Tags Tags
    Margin
Click For Summary
The discussion centers on the concept of "margin of error" (MOE) in polling, questioning its statistical basis and implications. The MOE is typically understood as a 95% confidence interval, indicating the range within which the true support for candidates likely falls, but the accuracy of this measure can be affected by factors like sample size and methodology. Participants express concerns about the potential for systematic biases in polling, especially when different methods yield varying results, complicating comparisons between polls. The conversation also touches on the role of Bayesian versus frequentist analyses in interpreting polling data, with some arguing that Bayesian methods may not be more sensitive to prior assumptions than other approaches. Ultimately, the complexities of polling methodologies and their implications for the reliability of MOE are emphasized throughout the discussion.
  • #31
I think we're saying similar things - we can beat √N by replacing counting with pre-existing knowledge. When we buy a dozen eggs, there is no √12 uncertainty.

The problem - or a problem - comes up as Dale (and Mark Twain before him) when these turn out to be incorrect. "I don't have to poll North Springfield because Jones has it all in the bag". Well, what if she doesn't? And how would you know?

At some point, you are crossing the line between corrected polling and poll-inspired modeling. Which means at some point you are no longer quoting a statistical estimate of uncertainty but an expert's estimate.

Farther along that path and we're into the realm of fortunetelling. I don't think we are there yet, but it would be a pity if we did someday.
 
Physics news on Phys.org
  • #32
Vanadium 50 said:
"I don't have to poll North Springfield because Jones has it all in the bag". Well, what if she doesn't? And how would you know?
I don't think that the modeling errors are of this type. It is a lot more subtle. It is more like: in the US census 20% of the population has less than 4 years college in Somewhereville. Only 15% of my Somewhereville sample has less than 4 years of college, so I need to correct for my undersampling of the less than 4 years college population. But what I don't realize is that now it is actually 25%, so my correction isn't large enough.
 
  • Like
Likes FactChecker
  • #33
Ye,s but the issues are exposed by looking at the limiting cases. I think we can all agree that, for example, the US Presidential Election of 2024 that polling Pennsylvania heavily tells you more than polling Hawaii or Utah.

Further, if your goal is to beat √N, you have to count what you need to count over what you (believe) don't. Otherwise your error doesn't go down.
 
  • #34
Vanadium 50 said:
Ye,s but the issues are exposed by looking at the limiting cases.
I don't think that "I don't have to poll North Springfield because Jones has it all in the bag" is a limiting case of anything that high quality pollsters actually do. There is a difference between a limiting case and a strawman.
 
  • Like
Likes FactChecker
  • #35
Vanadium 50 said:
Ye,s but the issues are exposed by looking at the limiting cases.
I don't believe that they make naive mistakes. Their analysis is far more sophisticated than I will ever understand.
Vanadium 50 said:
I think we can all agree that, for example, the US Presidential Election of 2024 that polling Pennsylvania heavily tells you more than polling Hawaii or Utah.

Further, if your goal is to beat √N, you have to count what you need to count over what you (believe) don't. Otherwise your error doesn't go down.
Some things can be determined with good accuracy, like the percentage of people of certain age, education, wealth, internet connection, smartphone ownership, living in the country versus the city, income level, etc., in a state. Those have a strong influence on voting trends. They should not be ignored. Stratified sampling can take that into account (within reason and limits). It's good to know the characteristics of your sample. Even then, there is a lot of uncertainty.
 
  • Like
Likes Klystron and Dale
  • #36
I don't think they make naive mistakes either. A more realistic "mistake" is to deliberately undersample subsets where you think you alreadsy know the answer to some degree, so you can oversample the subset where you don't.

This trades uncertainty in one subsample for undcertainty in another. And this can improve the overall error.

The problems start when the assumptions on the "well-known" sample turn out to be incorrect. They are aggravated if the undersampling is sufficient to hide the discrepancy between what is ecxpected and what is observed.

In the physical sciences we would say that one is reducing the statistical error at a cost o increased systematic error.
 
  • Like
Likes FactChecker
  • #37
Vanadium 50 said:
A more realistic "mistake" is to deliberately undersample subsets where you think you alreadsy know the answer to some degree
I don't think that high quality pollsters do that at all. The corrections, stratifications, and so forth are based on the independent variables that go into your statistical model (demographics), not the dependent variable that comes out of the statistical model (opinion). That is why one is not an edge case of the other. They don't make their sampling decisions based on assumed knowledge of the dependent variables.
 
Last edited:
  • Like
Likes FactChecker
  • #38
Dale said:
I don't think that high quality pollsters do that at all.
Then how do you think they get errors below √N?
 
  • #39
Vanadium 50 said:
A more realistic "mistake" is to deliberately undersample subsets where you think you alreadsy know the answer to some degree, so you can oversample the subset where you don't.
Define "undersample". If there is a subset of the population that has a small variation in the dependent variable of interest the data should show that. Is it a mistake to sample that subset less? Why? Spending time and money to drive that subset standard deviation lower than necessary instead of using it on other subsets where the uncertainties are greater would be a mistake. The goal is to get the best estimate for your time and money.
 
  • #40
If you prefer "sample less", I am OK with that.
 
  • #41
Vanadium 50 said:
If you prefer "sample less", I am OK with that.
My point is that it is often the smart thing to do to get the best answer for the time and money. Increasing the sample size is not the only way to improve the result.
 
  • #42
At the risk of being beat up again for strawmen, if I have a box labeled "10,000 white balls" and a box labeled "10,000 red balls" and a third box labeled "10,000 balls, mixed red and white" I only need to sample the last box.

If the first two boxes say "9000 red (white) and 1000 white (red)" I still only need to sample the third box.

I will get in trouble if the contents of the first two boxes doesn't match the label.
 
  • #43
Vanadium 50 said:
Then how do you think they get errors below √N?
Through stratified sampling of the independent variables, if indeed they do actually get errors below ##\sigma/\sqrt{N}##
 
  • Like
Likes FactChecker
  • #44
Vanadium 50 said:
At the risk of being beat up again for strawmen, if I have a box labeled "10,000 white balls" and a box labeled "10,000 red balls" and a third box labeled "10,000 balls, mixed red and white" I only need to sample the last box.
For what purpose?
Suppose you are studying the emergency stopping distance of car drivers. Suppose that half the cars in the general population have ABS and half do not. Also, suppose that stopping distance of cars with ABS have a standard deviation of 5 feet, but cars without ABS have a standard deviation of 30 feet because some drivers pump the brakes well, some pump brakes too slowly, and others don't pump brakes at all. Every stopping test costs $500, so you can only test 1000 drivers. You should not pick a sample ignoring who has ABS. You will get a more accurate result if you test more without ABS.

Suppose you are polling voters. Older voters are more likely to vote for candidate A and younger voters for candidate B. The general population has 30% over 60 years old, but your poll is on smartphones and your sample only had 15% over 60 years old. You should apply stratified sampling techniques to adjust your sample results to better match the general population.
 
Last edited:
  • #45
Vanadium 50 said:
Then what am I to make of two polls that differ by 2x or 3x the margin of error?
That it is not feasible to obtain an unbiased sample from a population of 160 million voters.
 
  • #46
FactChecker said:
ou should apply stratified sampling techniques to adjust your sample results to better match the general population.
Fine. Let me then ask yet again, how do you beat the √N uncertainty, as hee CNN poll I mentioned claims to.
 
  • #47
Vanadium 50 said:
Fine. Let me then ask yet again, how do you beat the √N uncertainty, as hee CNN poll I mentioned claims to.
It is easy if there are groups that cluster around certain values with small variances within that subsample. The mathematics of it is simple. This example was post #24.

FactChecker said:
Suppose you have a sample from two groups of equal sizes, one clustered closely around 100 and the other clustered closely around -100. By grouping the subsamples, you have two small subsample variances. The end result will be smaller than if you ignored the groups and had a lot of large ##(x_i-0)^2 \approx 100^2## terms to sum.
 
Last edited:
  • #48
FactChecker said:
It is easy if there are groups that cluster around certain values
I don't see it. Take two delta functions. Your uncertainty on the total mean is not zero. It's driven by your uncerrainty in counting how many elements are in each distribution. And you are back to √N.

You can beat it if you don't have to count. But again, now we are moving away from polls.
 
  • #49
Vanadium 50 said:
I don't see it. Take two delta functions. Your uncertainty on the total mean is not zero. It's driven by your uncerrainty in counting how many elements are in each distribution.
Those are often well known about the general population. Age distributions, wealth, education levels, home locations, etc. are all known fairly well from the government population census. A pollster will probably not rely on sampling to determine those characteristics about the general population. He has better sources for that information. On the other hand, he will record those characteristics about his sample so that he can adjust his sample results, if necessary, to better reflect the general population.
Vanadium 50 said:
And you are back to √N.
No. The individual variances within the subgroups may be greatly reduced.
 
  • #50
Vanadium 50 said:
Fine. Let me then ask yet again, how do you beat the √N uncertainty, as hee CNN poll I mentioned claims to.
Vanadium 50 said:
The latest CNN poll has N=2074 and a stated margin of error of 3.0%. It's already hard to reconcile those two numbers, especially at 2σ. It's certainly not the binomial error.
So I did a brief Monte Carlo simulation with a poll result represented as a draw from a binomial distribution with N=2074 and p=0.5. I simulated 1000 such polls. The mean was 0.5002 with a standard deviation of 0.0107. So a margin of error of 3.0% is greater than twice the standard deviation (0.0214). This is not an example of "beat[ing] √N uncertainty".

Since a lot of polls use Bayesian techniques I also calculated the posterior for a 50/50 split on 2074 responses using a flat Beta distributed prior because the Beta distribution is a conjugate prior for Binomial or Bernoulli data. With that I got a 95% credible interval of plus or minus 0.0215, which is almost identical to the Monte Carlo result above.

This example does not appear to be an example where the margin of error is lower than what can be justified based on the sample size. In fact, it seems about the opposite. It seems that there is about 1% additional statistical uncertainty included beyond the idealized uncertainty. This could include the fact that the result was not 50/50, and also possibly that the weighting that was needed for this sample increased the overall variance.
 
Last edited:
  • Like
Likes Klystron, Vanadium 50 and FactChecker
  • #51
There is another theoretical aspect of polling: How is the usual variance equation influenced by the constraint that the total of the percentages must add up to 100%?
I have no experience with this.
 
  • #52
If you have a sample of N, and the fraction voting for Jones is f, the uncertainty on that number is ##\sqrt{Nf (1-f)}##.
 
  • Like
Likes hutchphd and FactChecker
  • #53
Vanadium 50 said:
If you have a sample of N, and the fraction voting for Jones is f, the uncertainty on that number is ##\sqrt{Nf (1-f)}##.
Yes. But what we are talking about is the uncertainty on that number divided by ##N## (times 100 %). For the CNN poll that works out to 1.1 %, which is right in line with my Monte Carlo simulation and the Bayesian posterior.
 
Last edited:
  • #54
I trust @FactChecker 's ability to do algebra to convert that formula to whatever one he is most interested in. (You probably want to vide by Nf and not N in most cases)
 
  • #55
Vanadium 50 said:
I trust @FactChecker 's ability to do algebra to convert that formula to whatever one he is most interested in.
OK, but you are claiming that the CNN poll "beat the ##\sqrt{N}## uncertainty", which it didn't.
 
  • #56
So, I repeated Dale's Monte Carlo, and got an 1σ variation of 1.2%. I did some things slightly differently (e,h, a 48-47-5 true distribution), but would say we agree. There are also a couple of things I did that I didn't like for expedience sake. Did you know Excel doesn't have a POIISSON.INV function?

So I am convinced.

Even so, I think this number is, if not questionable, at least discussable. It implies that the 1σ uncertainty on the sample correction is 0.9%, which is nine respondents in each column. I'll leave it to people to decide for themselves if they believe that a gigantic poll reqyurubf would get the correct answer to better than 1%.

Insofar as the betting odds people are rational actors, they believe they poll errors are uncerestimated, or equivalently this race is even close than the polls suggest. I'm not saying they are right and I am not saying they are wrong - just that that is what they are betting their own money on.
 
  • #57
Vanadium 50 said:
Insofar as the betting odds people are rational actors, they believe they poll errors are uncerestimated
I think that is accurate. The betting odds people are making a prediction on behavior, while the pollsters are (in the best case) making a measurement of opinion. So the uncertainty in the behavior prediction is much greater than the uncertainty in the measurement of opinion. And the uncertainty in the measurement of opinion is also greater than just the margin of error.
 
  • #58
Dale said:
making a prediction on behavior, while the pollsters are (in the best case) making a measurement of opinion
That would have sent one of my social sciences professors into a tizzy. He always argued that polls measure behavior - they measure what people say they think, not what they actually think. :smile:

He was also rumored to make his own moonshine. FWIW.

However, I think you're still dealing with a difference in behavior - what people will say and how people will vote.
 
  • Like
Likes Bystander and Dale
  • #59
The actual results depend on the weather, job demands, attitude regarding whether their vote matters, etc.
Those are sources of variability that are hard to factor in and I am not sure we would want them to try.
 
  • #60
While people did argue "no, this is just opinion", it's pretty clearly really an attempt to prognosticate. Otherwise, why use likely voters? Why not include everyone - resident aliens, illegal aliens, those under 18, and so on. They have opinions as well.

The bigger issue is, of course, that US presidents are elected by the states, not the populace at large. Changing opinions in California or Wyoming makes no difference. So what is being measured is correlated with electoral outcome but not the same.

A must-win district for the Democrats is NE-2, Omaha. This is what got me thinking about thus. Harris is polling 11 points ahead of Biden in the latest poll. That's well above the margin of error, and well above the national shift. Maybe they just really dig her in Omaha. But in an election where both candidates have high floors and low ceilings, an 11 point swing cries out for explanation.

BTW, this is also a CNN poll, contemporaneous with the 2074 subhect national poll. I am hoping these are two completely eparate polls and not that a third of the people surveyed in the national poll are from Omaha.
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
12K
  • · Replies 3 ·
Replies
3
Views
2K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 10 ·
Replies
10
Views
10K
Replies
11
Views
3K
Replies
25
Views
3K
  • · Replies 13 ·
Replies
13
Views
2K
Replies
28
Views
4K
  • · Replies 7 ·
Replies
7
Views
4K
Replies
2
Views
2K