# Average of unweighted data

1. Jul 26, 2014

### rwf

I have some data from gas emission samples, part of the data was sampled in a region with a characteristic A and elsewhere with characteristic B. However, each region corresponds to a different percentage of total area.

I want to calculate the central value (average or median) of unbalanced with different weights data with the following characteristics:
- I have 35 observations in total, with 15 observations in group A and 20 in group B.
- The 20 observations of group B show average much higher (aprox. 40x higher) than group A.
- Group A corresponds to 58% and group B to 42% of my study area.

I used ponderate average to calculate overall average data:
((Average Group A * 0.58) + (0.42 * Average Group B)) / 2.
Am i right doing this?
And how can i calculate those overall median?

Thanks for your attention.

2. Jul 27, 2014

I think you are referring to the mean? the median is defined as the central data point via the exclusion of extrema.

where do the percentages come from?

3. Jul 27, 2014

### rwf

I still do not know what best describe a central value of my data.
My data distribution is not normal. Group A is composed primarily of low values ​​and corresponds to 58%% of the study area. Group B is composed mainly of high values ​​and corresponds to 42% of the study area. I think the best calculation that describes a central value for this is the median, however, I cant simply take the median of the 35 values ​​because 15 of them correspond to a group with a weight and 20 of them correspond to another group with different weight.
The percentages come from study area, i just transform them from m² to %.

4. Jul 27, 2014

### Stephen Tashi

What determines whether one way is better than another? If you had two different numbers for "central value", could you yourself say which one was best? Or are you relying on some outside authority (like the editor of a scientific journal) to approve the number you use?

5. Jul 27, 2014

### FactChecker

You have the right idea but don't divide by two. The weights of .58 and .42 do all the averaging you need since they total 1.0 without dividing by 2.

6. Jul 28, 2014

### gill1109

If you are interested in the mean then you should weigh the two subsample means using the weights .58 and .42. If you are interested in the median then the procedure is a bit more tricky. You could imagine doing it like this: you have 15 observations in one group 35 in the other. Duplicate each of the first group 35 times and each of the second group 15 times. Now you have two equal sized "samples". Now duplicate each in those groups 58 or 42 times depending on whether its first group or second group. Merge together, take median. This is a sensible estimate of the median of the original stratified population (it has two strata). You took two samples one from each stratum but the size of those two samples are not in the same proportion as the size of the two strata. Hence you must do a bit of juggling to figure out where the median of the whole population would be, once you have as it were projected the two samples to the two complete population strata.

7. Jul 31, 2014

### rwf

Yeah, okay I think I got a result that makes sense.
Thanks.

Can I calculate a deviaton with those data using this mean?
I can not understand how to do a standard deviation with weighted data

8. Jul 31, 2014

### rwf

Wow, this was tricky! I ll try then post results.
Thanks by the way.

9. Jul 31, 2014

### rwf

I can not, but with very low values ​​accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?

10. Jul 31, 2014

### FactChecker

Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.

11. Aug 1, 2014

### gill1109

The mean is the average. If data is very skewed then the average might not be a very interesting summary statistic. The median is the number such that half are smaller, half are larger. We always know what it means, at least. Whether or not it is interesting/important/useful depends on what you want to use it for.

Casino's and insurance companies are interested in the mean because they are interested in the ratio of total income to total expenses. Mean equals total divided by number of cases. They have little interest in the median. Except perhaps for advertising purposes e.g. 50% chance of winning a prize! (mainly of winning a prize which is worth less than the fee for joining)

12. Aug 1, 2014

### Stephen Tashi

Your basic problem is that "median" and "mean" are only specific when they refer to a specific random variable. You haven't defined what your random variable is.

You have two batches of emission numbers of different sizes.

If you assume each batch is drawn from the same population of emission numbers then you can define the random variable is "pick an mission number at random from the population".

If you assume the the two batches are from different populations of emission numbers, you could define a random variable as "Pick a population at random, with probability of 1/2 of selecting a given population. Then pick a number at random from the population that was selected."

If you assume the two batches are drawn from different populations of emission numbers, you could also define a different random variable as "Pick a population at random, letting the probability of picking the population be proportional to how many samples are in its correspoinding batch. Then pick a number at random from the population that was selected".

As a matter of correct speech, we can't "find" the median of a population by doing a computation on sample data. What we are doing is "estimating" the population median. If you have a sample, we can "find" the sample median.

13. Aug 1, 2014

### gill1109

You haven't covered the actual case of interest. We are talking about two samples, one from each of two strata (sub populations) of one population. We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.

14. Aug 1, 2014

### Stephen Tashi

OK, "we" are, but I don't know about the original poster.

I agree.

I agree that the frequentist approach is to consider p1 and p2 "fixed but unknown" numbers. I like that better than saying they are known.

I agree that it has intuitive appeal to use the use the batch sizes to estimate p1 and p2. But what are the properties of the estimator that you have suggested? Most nice estimation theory deals with estimating means. I'm not saying your estimator is bad, just that I'm unfamiliar with any results about it. For example, is it unbiased?

15. Aug 1, 2014

### FactChecker

This is a problem studied in stratified population sampling. The book "Sampling Techniques" by Cochran is a standard reference on the subject. But I can not find the equation in this link. The equation seems to make sense but I can not verify it. I am struggling with the notation in the book. If I can find the equation in Cochran, I will post a follow-up.

16. Aug 2, 2014

### gill1109

I refer back to the original problem of the original poster!

And he tells us that p1 and p2 are known!! Do not use the batch sizes to estimate them. The batch sizes are arbitrary. Please read original post.

Modern statistical theory is about estimating probability distributions. And estimating functionals of probability distributions. The sample average is the same functional of the empirical probability distribution, as the population mean is of the true probability distribution. The sample median idem. Read a modern book on statistics such as J. A. Rice "Introduction to Mathematical Statistics and Data Analysis". I see that present day physicists are not taught any statistics or probability (or hardly any). They know less about 20th century probability and statistics than students of psychology or of various social sciences. Not to mention 21st century probability and statistics. It's a very weird situation (sorry for rant).

17. Aug 2, 2014

### Stephen Tashi

The original post mentions percentages of areas. I agree we can use those for p1 and p2 provided the random variable of interest involves picking an area to sample with a probability that is proportional to that area. We really should be clear about the defiinition of the random variable.

That's good. Hypothesis testing doesn't capture my imagination!

I grasp an intuitive meaning for that statement - the pre-modern definition of a "consistent" estimator was that it used a form of calculation on the sample data that was similar to the form of calculation used on population distribution to compute the parameter being estimated.

In the case of a continuous population distribution, we could assume the "empirical probability distribution" is a continuous distribution. "The mean" is a functional whose domain is a set of continuous probability distsributions. Hence "the sample mean" and "the population mean" are names denoting a functional evaluated at two different places in its domain. That's a modern approach, but it lacks punch. It just says "I defined a functional. Hence the a reference to the functional evaluated at one place in its domain and a reference to the functional evaluated at another place in its domain are references to the same functional."

I'm unfamiliar with the education of present day physicists,.so I'll skip reading an entire book just to see what they are missing!

The description of the current problem would benefit from more physics before any mathematics is applied. As I visualize it, we might be dealing with some sort of cloud of pollution over a city and we want to estimate the spatial mean of the pollution. Were all these emission levels measured at the same time on the clock?

18. Aug 2, 2014

### gill1109

*The* empirical distribution based on a sample of size N from some population is the probability distribution putting probability mass 1/N on each of the N sample points. It's a technical term. A standard term of modern applied statistics. It's obviously not a continuous distribution, whether or not the underlying true distribution is continuous. Despite this many functionals of it are decent estimators of the same functional of the underlying distribution.

19. Aug 2, 2014

### FactChecker

This tells me is to take the weighted average of the individual strata-mean estimates. That was the original question. It seems like everyone is saying the same thing.

Last edited: Aug 2, 2014
20. Aug 2, 2014

### FactChecker

The mean is much more stable. The median can be very sensitive to a small change in the percentages. Mode can be very sensitive to a small probability spike. I doubt that those are preferable for most uses. On the other hand, small changes in percentages of the strata or in the probability function will only give small changes in the mean. I would only use the median or mode if there was some special reason that those characteristics were significant for your application.

Last edited: Aug 2, 2014
21. Aug 2, 2014

### Stephen Tashi

It's a digression from the original problem to get into a discussion of what is meant by "the same functional". Perhaps it deserves a separate thread. I'm distracted by the question of how one would define a functional on discrete distributions to be "the same" as a functional defined on continuous distributions. However, as I said, I under stand the pre-modern interpretation of what you say.

I repeat my request for rwf to reveal more detals about the physics of the problem.

22. Aug 2, 2014

### FactChecker

I want to correct something in the top answer in the referenced link http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. It says to use weights of the # samples in each strata divided by the total number of samples. That is wrong. If you know them, use the real weights of each strata in the whole space. There are many examples (e.g. importance sampling) where the sample numbers are intentionally distorted to get a more accurate answer. In those cases, the real proportion of each strats should still be used.

23. Aug 3, 2014

### gill1109

Mathematics. We all know what a probability distribution on the real line is. We all know what the median of a distribution is. We all know what the mean of a distribution is. In modern statistics, the "empirical distribution" of the data is the probability distribution putting probability mass 1/N on each observed data point.

https://en.wikipedia.org/wiki/Empirical_distribution_function

This is completely standard, it's in all the text-books. This way of thinking motivates and explains so-called bootstrap methods. Very popular in today's computer age.

https://en.wikipedia.org/wiki/Bootstrapping_(statistics [Broken])

Last edited by a moderator: May 6, 2017
24. Aug 3, 2014

### FactChecker

I followed another link from the standard deviation link I gave before. The National Institute of Standards and Technology gives the equation for an (unbiased?) estimator of the weighted standard deviation that appeared in the original link. Here is the link http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf I consider this to be authoritative. And the equation makes sense, which is always a bonus.

I could not find an equation in Cochran.

25. Aug 3, 2014

### gill1109

Nowadays we hardly need formulas any more. We can just use the statistical bootstrap

https://en.wikipedia.org/wiki/Bootstrapping_(statistics [Broken])

Let the computer work, much more reliable than us thinking or implementing formulas from dusty textbooks, as long as we understand the basic principles.

Last edited by a moderator: May 6, 2017