Average of unweighted data

rwf · Jul 26, 2014

I have some data from gas emission samples, part of the data was sampled in a region with a characteristic A and elsewhere with characteristic B. However, each region corresponds to a different percentage of total area.

I want to calculate the central value (average or median) of unbalanced with different weights data with the following characteristics:
- I have 35 observations in total, with 15 observations in group A and 20 in group B.
- The 20 observations of group B show average much higher (aprox. 40x higher) than group A.
- Group A corresponds to 58% and group B to 42% of my study area.

I used ponderate average to calculate overall average data:
((Average Group A * 0.58) + (0.42 * Average Group B)) / 2.
Am i right doing this?
And how can i calculate those overall median?

Thanks for your attention.

adrianmitrea · Jul 27, 2014

I think you are referring to the mean? the median is defined as the central data point via the exclusion of extrema.

where do the percentages come from?

rwf · Jul 27, 2014

adrianmitrea said:

I think you are referring to the mean? the median is defined as the central data point via the exclusion of extrema.

where do the percentages come from?

I still do not know what best describe a central value of my data.
My data distribution is not normal. Group A is composed primarily of low values and corresponds to 58%% of the study area. Group B is composed mainly of high values and corresponds to 42% of the study area. I think the best calculation that describes a central value for this is the median, however, I can't simply take the median of the 35 values because 15 of them correspond to a group with a weight and 20 of them correspond to another group with different weight.
The percentages come from study area, i just transform them from m² to %.

Stephen Tashi · Jul 27, 2014

rwf said:

I still do not know what best describe a central value of my data.
.

What determines whether one way is better than another? If you had two different numbers for "central value", could you yourself say which one was best? Or are you relying on some outside authority (like the editor of a scientific journal) to approve the number you use?

FactChecker · Jul 27, 2014

You have the right idea but don't divide by two. The weights of .58 and .42 do all the averaging you need since they total 1.0 without dividing by 2.

gill1109 · Jul 28, 2014

If you are interested in the mean then you should weigh the two subsample means using the weights .58 and .42. If you are interested in the median then the procedure is a bit more tricky. You could imagine doing it like this: you have 15 observations in one group 35 in the other. Duplicate each of the first group 35 times and each of the second group 15 times. Now you have two equal sized "samples". Now duplicate each in those groups 58 or 42 times depending on whether its first group or second group. Merge together, take median. This is a sensible estimate of the median of the original stratified population (it has two strata). You took two samples one from each stratum but the size of those two samples are not in the same proportion as the size of the two strata. Hence you must do a bit of juggling to figure out where the median of the whole population would be, once you have as it were projected the two samples to the two complete population strata.

rwf · Jul 31, 2014

FactChecker said:

You have the right idea but don't divide by two. The weights of .58 and .42 do all the averaging you need since they total 1.0 without dividing by 2.

Yeah, okay I think I got a result that makes sense.
Thanks.

Can I calculate a deviaton with those data using this mean?
I can not understand how to do a standard deviation with weighted data

rwf · Jul 31, 2014

gill1109 said:

If you are interested in the mean then you should weigh the two subsample means using the weights .58 and .42. If you are interested in the median then the procedure is a bit more tricky. You could imagine doing it like this: you have 15 observations in one group 35 in the other. Duplicate each of the first group 35 times and each of the second group 15 times. Now you have two equal sized "samples". Now duplicate each in those groups 58 or 42 times depending on whether its first group or second group. Merge together, take median. This is a sensible estimate of the median of the original stratified population (it has two strata). You took two samples one from each stratum but the size of those two samples are not in the same proportion as the size of the two strata. Hence you must do a bit of juggling to figure out where the median of the whole population would be, once you have as it were projected the two samples to the two complete population strata.

Wow, this was tricky! I ll try then post results.
Thanks by the way.

rwf · Jul 31, 2014

Stephen Tashi said:

What determines whether one way is better than another? If you had two different numbers for "central value", could you yourself say which one was best? Or are you relying on some outside authority (like the editor of a scientific journal) to approve the number you use?

I can not, but with very low values accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?

FactChecker · Jul 31, 2014

rwf said:

Can I calculate a deviaton with those data using this mean?
I can not understand how to do a standard deviation with weighted data

Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.

gill1109 · Aug 1, 2014

rwf said:

I can not, but with very low values accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?

The mean is the average. If data is very skewed then the average might not be a very interesting summary statistic. The median is the number such that half are smaller, half are larger. We always know what it means, at least. Whether or not it is interesting/important/useful depends on what you want to use it for.

Casino's and insurance companies are interested in the mean because they are interested in the ratio of total income to total expenses. Mean equals total divided by number of cases. They have little interest in the median. Except perhaps for advertising purposes e.g. 50% chance of winning a prize! (mainly of winning a prize which is worth less than the fee for joining)

Stephen Tashi · Aug 1, 2014

rwf said:

I can not, but with very low values accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?

Your basic problem is that "median" and "mean" are only specific when they refer to a specific random variable. You haven't defined what your random variable is.

You have two batches of emission numbers of different sizes.

If you assume each batch is drawn from the same population of emission numbers then you can define the random variable is "pick an mission number at random from the population".

If you assume the the two batches are from different populations of emission numbers, you could define a random variable as "Pick a population at random, with probability of 1/2 of selecting a given population. Then pick a number at random from the population that was selected."

If you assume the two batches are drawn from different populations of emission numbers, you could also define a different random variable as "Pick a population at random, letting the probability of picking the population be proportional to how many samples are in its correspoinding batch. Then pick a number at random from the population that was selected".

As a matter of correct speech, we can't "find" the median of a population by doing a computation on sample data. What we are doing is "estimating" the population median. If you have a sample, we can "find" the sample median.

gill1109 · Aug 1, 2014

Stephen Tashi said:

Your basic problem is that "median" and "mean" are only specific when they refer to a specific random variable. You haven't defined what your random variable is.

You have two batches of emission numbers of different sizes.

If you assume each batch is drawn from the same population of emission numbers then you can define the random variable is "pick an mission number at random from the population".

If you assume the the two batches are from different populations of emission numbers, you could define a random variable as "Pick a population at random, with probability of 1/2 of selecting a given population. Then pick a number at random from the population that was selected."

If you assume the two batches are drawn from different populations of emission numbers, you could also define a different random variable as "Pick a population at random, letting the probability of picking the population be proportional to how many samples are in its correspoinding batch. Then pick a number at random from the population that was selected".

As a matter of correct speech, we can't "find" the median of a population by doing a computation on sample data. What we are doing is "estimating" the population median. If you have a sample, we can "find" the sample median.

You haven't covered the actual case of interest. We are talking about two samples, one from each of two strata (sub populations) of one population. We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.

Stephen Tashi · Aug 1, 2014

gill1109 said:

We are talking about two samples, one from each of two strata (sub populations) of one population.

OK, "we" are, but I don't know about the original poster.

We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

I agree.

In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

I agree that the frequentist approach is to consider p1 and p2 "fixed but unknown" numbers. I like that better than saying they are known.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.

I agree that it has intuitive appeal to use the use the batch sizes to estimate p1 and p2. But what are the properties of the estimator that you have suggested? Most nice estimation theory deals with estimating means. I'm not saying your estimator is bad, just that I'm unfamiliar with any results about it. For example, is it unbiased?

FactChecker · Aug 1, 2014

FactChecker said:

Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.

This is a problem studied in stratified population sampling. The book "Sampling Techniques" by Cochran is a standard reference on the subject. But I can not find the equation in this link. The equation seems to make sense but I can not verify it. I am struggling with the notation in the book. If I can find the equation in Cochran, I will post a follow-up.

gill1109 · Aug 2, 2014

Stephen Tashi said:

OK, "we" are, but I don't know about the original poster.

I agree.

I agree that the frequentist approach is to consider p1 and p2 "fixed but unknown" numbers. I like that better than saying they are known.

I agree that it has intuitive appeal to use the use the batch sizes to estimate p1 and p2. But what are the properties of the estimator that you have suggested? Most nice estimation theory deals with estimating means. I'm not saying your estimator is bad, just that I'm unfamiliar with any results about it. For example, is it unbiased?

I refer back to the original problem of the original poster!

And he tells us that p1 and p2 are known! Do not use the batch sizes to estimate them. The batch sizes are arbitrary. Please read original post.

Modern statistical theory is about estimating probability distributions. And estimating functionals of probability distributions. The sample average is the same functional of the empirical probability distribution, as the population mean is of the true probability distribution. The sample median idem. Read a modern book on statistics such as J. A. Rice "Introduction to Mathematical Statistics and Data Analysis". I see that present day physicists are not taught any statistics or probability (or hardly any). They know less about 20th century probability and statistics than students of psychology or of various social sciences. Not to mention 21st century probability and statistics. It's a very weird situation (sorry for rant).

Stephen Tashi · Aug 2, 2014

gill1109 said:

I refer back to the original problem of the original poster!

And he tells us that p1 and p2 are known! Do not use the batch sizes to estimate them. The batch sizes are arbitrary. Please read original post.

The original post mentions percentages of areas. I agree we can use those for p1 and p2 provided the random variable of interest involves picking an area to sample with a probability that is proportional to that area. We really should be clear about the defiinition of the random variable.

Modern statistical theory is about estimating probability distributions. And estimating functionals of probability distributions.

That's good. Hypothesis testing doesn't capture my imagination!

The sample average is the same functional of the empirical probability distribution, as the population mean is of the true probability distribution. The sample median idem.

I grasp an intuitive meaning for that statement - the pre-modern definition of a "consistent" estimator was that it used a form of calculation on the sample data that was similar to the form of calculation used on population distribution to compute the parameter being estimated.

In the case of a continuous population distribution, we could assume the "empirical probability distribution" is a continuous distribution. "The mean" is a functional whose domain is a set of continuous probability distsributions. Hence "the sample mean" and "the population mean" are names denoting a functional evaluated at two different places in its domain. That's a modern approach, but it lacks punch. It just says "I defined a functional. Hence the a reference to the functional evaluated at one place in its domain and a reference to the functional evaluated at another place in its domain are references to the same functional."

Read a modern book on statistics such as J. A. Rice "Introduction to Mathematical Statistics and Data Analysis". I see that present day physicists are not taught any statistics or probability (or hardly any). They know less about 20th century probability and statistics than students of psychology or of various social sciences. Not to mention 21st century probability and statistics. It's a very weird situation (sorry for rant)

I'm unfamiliar with the education of present day physicists,.so I'll skip reading an entire book just to see what they are missing!

The description of the current problem would benefit from more physics before any mathematics is applied. As I visualize it, we might be dealing with some sort of cloud of pollution over a city and we want to estimate the spatial mean of the pollution. Were all these emission levels measured at the same time on the clock?

gill1109 · Aug 2, 2014

Stephen Tashi said:

The original post mentions percentages of areas. I agree we can use those for p1 and p2 provided the random variable of interest involves picking an area to sample with a probability that is proportional to that area. We really should be clear about the defiinition of the random variable.

That's good. Hypothesis testing doesn't capture my imagination!

I grasp an intuitive meaning for that statement - the pre-modern definition of a "consistent" estimator was that it used a form of calculation on the sample data that was similar to the form of calculation used on population distribution to compute the parameter being estimated.

In the case of a continuous population distribution, we could assume the "empirical probability distribution" is a continuous distribution. "The mean" is a functional whose domain is a set of continuous probability distsributions. Hence "the sample mean" and "the population mean" are names denoting a functional evaluated at two different places in its domain. That's a modern approach, but it lacks punch. It just says "I defined a functional. Hence the a reference to the functional evaluated at one place in its domain and a reference to the functional evaluated at another place in its domain are references to the same functional."

I'm unfamiliar with the education of present day physicists,.so I'll skip reading an entire book just to see what they are missing!

The description of the current problem would benefit from more physics before any mathematics is applied. As I visualize it, we might be dealing with some sort of cloud of pollution over a city and we want to estimate the spatial mean of the pollution. Were all these emission levels measured at the same time on the clock?

*The* empirical distribution based on a sample of size N from some population is the probability distribution putting probability mass 1/N on each of the N sample points. It's a technical term. A standard term of modern applied statistics. It's obviously not a continuous distribution, whether or not the underlying true distribution is continuous. Despite this many functionals of it are decent estimators of the same functional of the underlying distribution.

FactChecker · Aug 2, 2014

gill1109 said:

You haven't covered the actual case of interest. We are talking about two samples, one from each of two strata (sub populations) of one population. We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.

This tells me is to take the weighted average of the individual strata-mean estimates. That was the original question. It seems like everyone is saying the same thing.

FactChecker · Aug 2, 2014

rwf said:

I still do not know what best describe a central value of my data.

The mean is much more stable. The median can be very sensitive to a small change in the percentages. Mode can be very sensitive to a small probability spike. I doubt that those are preferable for most uses. On the other hand, small changes in percentages of the strata or in the probability function will only give small changes in the mean. I would only use the median or mode if there was some special reason that those characteristics were significant for your application.

Stephen Tashi · Aug 2, 2014

Despite this many functionals of it are decent estimators of the same functional of the underlying distribution.

It's a digression from the original problem to get into a discussion of what is meant by "the same functional". Perhaps it deserves a separate thread. I'm distracted by the question of how one would define a functional on discrete distributions to be "the same" as a functional defined on continuous distributions. However, as I said, I under stand the pre-modern interpretation of what you say.

I repeat my request for rwf to reveal more detals about the physics of the problem.

FactChecker · Aug 2, 2014

FactChecker said:

Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.

I want to correct something in the top answer in the referenced link http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. It says to use weights of the # samples in each strata divided by the total number of samples. That is wrong. If you know them, use the real weights of each strata in the whole space. There are many examples (e.g. importance sampling) where the sample numbers are intentionally distorted to get a more accurate answer. In those cases, the real proportion of each strats should still be used.

gill1109 · Aug 3, 2014

Stephen Tashi said:

It's a digression from the original problem to get into a discussion of what is meant by "the same functional". Perhaps it deserves a separate thread. I'm distracted by the question of how one would define a functional on discrete distributions to be "the same" as a functional defined on continuous distributions. However, as I said, I under stand the pre-modern interpretation of what you say.

I repeat my request for rwf to reveal more detals about the physics of the problem.

Mathematics. We all know what a probability distribution on the real line is. We all know what the median of a distribution is. We all know what the mean of a distribution is. In modern statistics, the "empirical distribution" of the data is the probability distribution putting probability mass 1/N on each observed data point.

https://en.wikipedia.org/wiki/Empirical_distribution_function

This is completely standard, it's in all the text-books. This way of thinking motivates and explains so-called bootstrap methods. Very popular in today's computer age.

https://en.wikipedia.org/wiki/Bootstrapping_(statistics )

FactChecker · Aug 3, 2014

I followed another link from the standard deviation link I gave before. The National Institute of Standards and Technology gives the equation for an (unbiased?) estimator of the weighted standard deviation that appeared in the original link. Here is the link http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf I consider this to be authoritative. And the equation makes sense, which is always a bonus.

I could not find an equation in Cochran.

gill1109 · Aug 3, 2014

FactChecker said:

I followed another link from the standard deviation link I gave before. The National Institute of Standards and Technology gives the equation for an (unbiased?) estimator of the weighted standard deviation that appeared in the original link. Here is the link http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf I consider this to be authoritative. And the equation makes sense, which is always a bonus.

I could not find an equation in Cochran.

Nowadays we hardly need formulas any more. We can just use the statistical bootstrap

https://en.wikipedia.org/wiki/Bootstrapping_(statistics )

Let the computer work, much more reliable than us thinking or implementing formulas from dusty textbooks, as long as we understand the basic principles.

FactChecker · Aug 3, 2014

gill1109 said:

Nowadays we hardly need formulas any more. We can just use the statistical bootstrap

https://en.wikipedia.org/wiki/Bootstrapping_(statistics )

Let the computer work, much more reliable than us thinking or implementing formulas from dusty textbooks, as long as we understand the basic principles.

1) How do you think those computer methods get programmed? The equation at the government website is the equation their computer program currently implements -- not "dusty old", just common sense.
2) Suppose I wanted to optimize a study where there are several strata with significantly different standard deviations and the cost of each experiment varied for the different strata (a very common occurrence in sampling). The number of samples in each strata should be adjusted according to cost and strata population variation. I would use an equation and linear/nonlinear optimization techniques.

Average of unweighted data

1. What is the definition of average in unweighted data?

2. How is the average in unweighted data different from the weighted average?

3. Can the average in unweighted data be affected by outliers?

4. How do you interpret the average in unweighted data?

5. What are some limitations of using the average in unweighted data?

Similar threads

Hot Threads

Recent Insights