Average of unweighted data

In summary: Oh, you are right. I was thinking of a single value.You are correct in thinking that the median would be a better measure of the central value in this case, given the distribution of your data. As for calculating the standard deviation with weighted data, there is a specific formula that takes into account the weights when calculating the sum-squared-deviations. You can find this formula and more information on how to calculate a weighted standard deviation here: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. Keep in mind that there is also an adjustment factor that needs to be applied to the equation to account for the fact that the mean is estimated from a sample rather than being the true
  • #1
rwf
5
0
I have some data from gas emission samples, part of the data was sampled in a region with a characteristic A and elsewhere with characteristic B. However, each region corresponds to a different percentage of total area.


I want to calculate the central value (average or median) of unbalanced with different weights data with the following characteristics:
- I have 35 observations in total, with 15 observations in group A and 20 in group B.
- The 20 observations of group B show average much higher (aprox. 40x higher) than group A.
- Group A corresponds to 58% and group B to 42% of my study area.


I used ponderate average to calculate overall average data:
((Average Group A * 0.58) + (0.42 * Average Group B)) / 2.
Am i right doing this?
And how can i calculate those overall median?

Thanks for your attention. :biggrin:
 
Physics news on Phys.org
  • #2
I think you are referring to the mean? the median is defined as the central data point via the exclusion of extrema.

where do the percentages come from?
 
  • #3
adrianmitrea said:
I think you are referring to the mean? the median is defined as the central data point via the exclusion of extrema.

where do the percentages come from?

I still do not know what best describe a central value of my data.
My data distribution is not normal. Group A is composed primarily of low values ​​and corresponds to 58%% of the study area. Group B is composed mainly of high values ​​and corresponds to 42% of the study area. I think the best calculation that describes a central value for this is the median, however, I can't simply take the median of the 35 values ​​because 15 of them correspond to a group with a weight and 20 of them correspond to another group with different weight.
The percentages come from study area, i just transform them from m² to %.
 
  • #4
rwf said:
I still do not know what best describe a central value of my data.
.

What determines whether one way is better than another? If you had two different numbers for "central value", could you yourself say which one was best? Or are you relying on some outside authority (like the editor of a scientific journal) to approve the number you use?
 
  • #5
You have the right idea but don't divide by two. The weights of .58 and .42 do all the averaging you need since they total 1.0 without dividing by 2.
 
  • #6
If you are interested in the mean then you should weigh the two subsample means using the weights .58 and .42. If you are interested in the median then the procedure is a bit more tricky. You could imagine doing it like this: you have 15 observations in one group 35 in the other. Duplicate each of the first group 35 times and each of the second group 15 times. Now you have two equal sized "samples". Now duplicate each in those groups 58 or 42 times depending on whether its first group or second group. Merge together, take median. This is a sensible estimate of the median of the original stratified population (it has two strata). You took two samples one from each stratum but the size of those two samples are not in the same proportion as the size of the two strata. Hence you must do a bit of juggling to figure out where the median of the whole population would be, once you have as it were projected the two samples to the two complete population strata.
 
  • #7
FactChecker said:
You have the right idea but don't divide by two. The weights of .58 and .42 do all the averaging you need since they total 1.0 without dividing by 2.

Yeah, okay I think I got a result that makes sense.
Thanks.

Can I calculate a deviaton with those data using this mean?
I can not understand how to do a standard deviation with weighted data
 
  • #8
gill1109 said:
If you are interested in the mean then you should weigh the two subsample means using the weights .58 and .42. If you are interested in the median then the procedure is a bit more tricky. You could imagine doing it like this: you have 15 observations in one group 35 in the other. Duplicate each of the first group 35 times and each of the second group 15 times. Now you have two equal sized "samples". Now duplicate each in those groups 58 or 42 times depending on whether its first group or second group. Merge together, take median. This is a sensible estimate of the median of the original stratified population (it has two strata). You took two samples one from each stratum but the size of those two samples are not in the same proportion as the size of the two strata. Hence you must do a bit of juggling to figure out where the median of the whole population would be, once you have as it were projected the two samples to the two complete population strata.

Wow, this was tricky! I ll try then post results.
Thanks by the way.
 
  • #9
Stephen Tashi said:
What determines whether one way is better than another? If you had two different numbers for "central value", could you yourself say which one was best? Or are you relying on some outside authority (like the editor of a scientific journal) to approve the number you use?

I can not, but with very low values ​​accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?
 
  • #10
rwf said:
Can I calculate a deviaton with those data using this mean?
I can not understand how to do a standard deviation with weighted data
Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.
 
  • #11
rwf said:
I can not, but with very low values ​​accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?

The mean is the average. If data is very skewed then the average might not be a very interesting summary statistic. The median is the number such that half are smaller, half are larger. We always know what it means, at least. Whether or not it is interesting/important/useful depends on what you want to use it for.

Casino's and insurance companies are interested in the mean because they are interested in the ratio of total income to total expenses. Mean equals total divided by number of cases. They have little interest in the median. Except perhaps for advertising purposes e.g. 50% chance of winning a prize! (mainly of winning a prize which is worth less than the fee for joining)
 
  • #12
rwf said:
I can not, but with very low values ​​accumulated at one end to very high values accumulated in other end I believe that the median would be ideal, right?

Your basic problem is that "median" and "mean" are only specific when they refer to a specific random variable. You haven't defined what your random variable is.

You have two batches of emission numbers of different sizes.

If you assume each batch is drawn from the same population of emission numbers then you can define the random variable is "pick an mission number at random from the population".

If you assume the the two batches are from different populations of emission numbers, you could define a random variable as "Pick a population at random, with probability of 1/2 of selecting a given population. Then pick a number at random from the population that was selected."

If you assume the two batches are drawn from different populations of emission numbers, you could also define a different random variable as "Pick a population at random, letting the probability of picking the population be proportional to how many samples are in its correspoinding batch. Then pick a number at random from the population that was selected".

As a matter of correct speech, we can't "find" the median of a population by doing a computation on sample data. What we are doing is "estimating" the population median. If you have a sample, we can "find" the sample median.
 
  • #13
Stephen Tashi said:
Your basic problem is that "median" and "mean" are only specific when they refer to a specific random variable. You haven't defined what your random variable is.

You have two batches of emission numbers of different sizes.

If you assume each batch is drawn from the same population of emission numbers then you can define the random variable is "pick an mission number at random from the population".

If you assume the the two batches are from different populations of emission numbers, you could define a random variable as "Pick a population at random, with probability of 1/2 of selecting a given population. Then pick a number at random from the population that was selected."

If you assume the two batches are drawn from different populations of emission numbers, you could also define a different random variable as "Pick a population at random, letting the probability of picking the population be proportional to how many samples are in its correspoinding batch. Then pick a number at random from the population that was selected".

As a matter of correct speech, we can't "find" the median of a population by doing a computation on sample data. What we are doing is "estimating" the population median. If you have a sample, we can "find" the sample median.
You haven't covered the actual case of interest. We are talking about two samples, one from each of two strata (sub populations) of one population. We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.
 
  • #14
gill1109 said:
We are talking about two samples, one from each of two strata (sub populations) of one population.

OK, "we" are, but I don't know about the original poster.

We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

I agree.


In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

I agree that the frequentist approach is to consider p1 and p2 "fixed but unknown" numbers. I like that better than saying they are known.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.

I agree that it has intuitive appeal to use the use the batch sizes to estimate p1 and p2. But what are the properties of the estimator that you have suggested? Most nice estimation theory deals with estimating means. I'm not saying your estimator is bad, just that I'm unfamiliar with any results about it. For example, is it unbiased?
 
  • #15
FactChecker said:
Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.
This is a problem studied in stratified population sampling. The book "Sampling Techniques" by Cochran is a standard reference on the subject. But I can not find the equation in this link. The equation seems to make sense but I can not verify it. I am struggling with the notation in the book. If I can find the equation in Cochran, I will post a follow-up.
 
  • #16
Stephen Tashi said:
OK, "we" are, but I don't know about the original poster.



I agree.




I agree that the frequentist approach is to consider p1 and p2 "fixed but unknown" numbers. I like that better than saying they are known.



I agree that it has intuitive appeal to use the use the batch sizes to estimate p1 and p2. But what are the properties of the estimator that you have suggested? Most nice estimation theory deals with estimating means. I'm not saying your estimator is bad, just that I'm unfamiliar with any results about it. For example, is it unbiased?
I refer back to the original problem of the original poster!

And he tells us that p1 and p2 are known! Do not use the batch sizes to estimate them. The batch sizes are arbitrary. Please read original post.

Modern statistical theory is about estimating probability distributions. And estimating functionals of probability distributions. The sample average is the same functional of the empirical probability distribution, as the population mean is of the true probability distribution. The sample median idem. Read a modern book on statistics such as J. A. Rice "Introduction to Mathematical Statistics and Data Analysis". I see that present day physicists are not taught any statistics or probability (or hardly any). They know less about 20th century probability and statistics than students of psychology or of various social sciences. Not to mention 21st century probability and statistics. It's a very weird situation (sorry for rant).
 
  • #17
gill1109 said:
I refer back to the original problem of the original poster!

And he tells us that p1 and p2 are known! Do not use the batch sizes to estimate them. The batch sizes are arbitrary. Please read original post.

The original post mentions percentages of areas. I agree we can use those for p1 and p2 provided the random variable of interest involves picking an area to sample with a probability that is proportional to that area. We really should be clear about the defiinition of the random variable.

Modern statistical theory is about estimating probability distributions. And estimating functionals of probability distributions.

That's good. Hypothesis testing doesn't capture my imagination!

The sample average is the same functional of the empirical probability distribution, as the population mean is of the true probability distribution. The sample median idem.

I grasp an intuitive meaning for that statement - the pre-modern definition of a "consistent" estimator was that it used a form of calculation on the sample data that was similar to the form of calculation used on population distribution to compute the parameter being estimated.

In the case of a continuous population distribution, we could assume the "empirical probability distribution" is a continuous distribution. "The mean" is a functional whose domain is a set of continuous probability distsributions. Hence "the sample mean" and "the population mean" are names denoting a functional evaluated at two different places in its domain. That's a modern approach, but it lacks punch. It just says "I defined a functional. Hence the a reference to the functional evaluated at one place in its domain and a reference to the functional evaluated at another place in its domain are references to the same functional."



Read a modern book on statistics such as J. A. Rice "Introduction to Mathematical Statistics and Data Analysis". I see that present day physicists are not taught any statistics or probability (or hardly any). They know less about 20th century probability and statistics than students of psychology or of various social sciences. Not to mention 21st century probability and statistics. It's a very weird situation (sorry for rant)

I'm unfamiliar with the education of present day physicists,.so I'll skip reading an entire book just to see what they are missing!

The description of the current problem would benefit from more physics before any mathematics is applied. As I visualize it, we might be dealing with some sort of cloud of pollution over a city and we want to estimate the spatial mean of the pollution. Were all these emission levels measured at the same time on the clock?
 
  • #18
Stephen Tashi said:
The original post mentions percentages of areas. I agree we can use those for p1 and p2 provided the random variable of interest involves picking an area to sample with a probability that is proportional to that area. We really should be clear about the defiinition of the random variable.



That's good. Hypothesis testing doesn't capture my imagination!



I grasp an intuitive meaning for that statement - the pre-modern definition of a "consistent" estimator was that it used a form of calculation on the sample data that was similar to the form of calculation used on population distribution to compute the parameter being estimated.

In the case of a continuous population distribution, we could assume the "empirical probability distribution" is a continuous distribution. "The mean" is a functional whose domain is a set of continuous probability distsributions. Hence "the sample mean" and "the population mean" are names denoting a functional evaluated at two different places in its domain. That's a modern approach, but it lacks punch. It just says "I defined a functional. Hence the a reference to the functional evaluated at one place in its domain and a reference to the functional evaluated at another place in its domain are references to the same functional."





I'm unfamiliar with the education of present day physicists,.so I'll skip reading an entire book just to see what they are missing!

The description of the current problem would benefit from more physics before any mathematics is applied. As I visualize it, we might be dealing with some sort of cloud of pollution over a city and we want to estimate the spatial mean of the pollution. Were all these emission levels measured at the same time on the clock?
*The* empirical distribution based on a sample of size N from some population is the probability distribution putting probability mass 1/N on each of the N sample points. It's a technical term. A standard term of modern applied statistics. It's obviously not a continuous distribution, whether or not the underlying true distribution is continuous. Despite this many functionals of it are decent estimators of the same functional of the underlying distribution.
 
  • #19
gill1109 said:
You haven't covered the actual case of interest. We are talking about two samples, one from each of two strata (sub populations) of one population. We are interested in properties of the whole population (perhaps the median, perhaps the mean, perhaps something else). The two strata are of different size. The two samples are of different sizes. The ratio between sample sizes is not the same as the ratio of strata-sizes. So in order to "project" the sample findings to the population, you have to take careful note of *both* these ratios.

In other words, the probability distribution of interest is the distribution obtained by picking completely at random one element either from sub-population 1 or from sub-population 2 with (known) probabilities p1 and p2. He has two samples one from each sub-population of size n1 and n2. The ratio n1:n2 is quite different from the ratio p1:p2.

No problem. I've explained what to do. Use each sample to estimate the unknown probability distribution of each subpopulation by using the empirical distributions. Then combine the empirical distributions in the correct ratio p1 to p2. Then figure out the mean or the median or whatever you are interested in, of your final estimate of the population distribution.
This tells me is to take the weighted average of the individual strata-mean estimates. That was the original question. It seems like everyone is saying the same thing.
 
Last edited:
  • #20
rwf said:
I still do not know what best describe a central value of my data.
The mean is much more stable. The median can be very sensitive to a small change in the percentages. Mode can be very sensitive to a small probability spike. I doubt that those are preferable for most uses. On the other hand, small changes in percentages of the strata or in the probability function will only give small changes in the mean. I would only use the median or mode if there was some special reason that those characteristics were significant for your application.
 
Last edited:
  • #21
Despite this many functionals of it are decent estimators of the same functional of the underlying distribution.

It's a digression from the original problem to get into a discussion of what is meant by "the same functional". Perhaps it deserves a separate thread. I'm distracted by the question of how one would define a functional on discrete distributions to be "the same" as a functional defined on continuous distributions. However, as I said, I under stand the pre-modern interpretation of what you say.

I repeat my request for rwf to reveal more detals about the physics of the problem.
 
  • #22
FactChecker said:
Yes, you can use this mean to calculate the sum-squared-deviations and then apply weights to those sums for each type. Here is a reference for the formula: http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. There is also a factor of (M-1)/M in the equation which adjusts for the fact that the mean is estimated from the sample rather than being the true mean.
I want to correct something in the top answer in the referenced link http://stats.stackexchange.com/questions/6534/how-do-i-calculate-a-weighted-standard-deviation-in-excel. It says to use weights of the # samples in each strata divided by the total number of samples. That is wrong. If you know them, use the real weights of each strata in the whole space. There are many examples (e.g. importance sampling) where the sample numbers are intentionally distorted to get a more accurate answer. In those cases, the real proportion of each strats should still be used.
 
  • #23
Stephen Tashi said:
It's a digression from the original problem to get into a discussion of what is meant by "the same functional". Perhaps it deserves a separate thread. I'm distracted by the question of how one would define a functional on discrete distributions to be "the same" as a functional defined on continuous distributions. However, as I said, I under stand the pre-modern interpretation of what you say.

I repeat my request for rwf to reveal more detals about the physics of the problem.

Mathematics. We all know what a probability distribution on the real line is. We all know what the median of a distribution is. We all know what the mean of a distribution is. In modern statistics, the "empirical distribution" of the data is the probability distribution putting probability mass 1/N on each observed data point.

https://en.wikipedia.org/wiki/Empirical_distribution_function

This is completely standard, it's in all the text-books. This way of thinking motivates and explains so-called bootstrap methods. Very popular in today's computer age.

https://en.wikipedia.org/wiki/Bootstrapping_(statistics )
 
Last edited by a moderator:
  • #24
I followed another link from the standard deviation link I gave before. The National Institute of Standards and Technology gives the equation for an (unbiased?) estimator of the weighted standard deviation that appeared in the original link. Here is the link http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf I consider this to be authoritative. And the equation makes sense, which is always a bonus.

I could not find an equation in Cochran.
 
  • #25
FactChecker said:
I followed another link from the standard deviation link I gave before. The National Institute of Standards and Technology gives the equation for an (unbiased?) estimator of the weighted standard deviation that appeared in the original link. Here is the link http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf I consider this to be authoritative. And the equation makes sense, which is always a bonus.

I could not find an equation in Cochran.

Nowadays we hardly need formulas any more. We can just use the statistical bootstrap

https://en.wikipedia.org/wiki/Bootstrapping_(statistics )

Let the computer work, much more reliable than us thinking or implementing formulas from dusty textbooks, as long as we understand the basic principles.
 
Last edited by a moderator:
  • #26
gill1109 said:
Nowadays we hardly need formulas any more. We can just use the statistical bootstrap

https://en.wikipedia.org/wiki/Bootstrapping_(statistics )

Let the computer work, much more reliable than us thinking or implementing formulas from dusty textbooks, as long as we understand the basic principles.
1) How do you think those computer methods get programmed? The equation at the government website is the equation their computer program currently implements -- not "dusty old", just common sense.
2) Suppose I wanted to optimize a study where there are several strata with significantly different standard deviations and the cost of each experiment varied for the different strata (a very common occurrence in sampling). The number of samples in each strata should be adjusted according to cost and strata population variation. I would use an equation and linear/nonlinear optimization techniques.
 
Last edited by a moderator:

1. What is the definition of average in unweighted data?

The average in unweighted data is a measure of central tendency that represents the typical value of a dataset without considering the relative importance or frequency of each data point. It is calculated by summing all the data points and dividing by the total number of data points.

2. How is the average in unweighted data different from the weighted average?

The average in unweighted data is calculated by giving equal weight to each data point, whereas the weighted average takes into account the importance or frequency of each data point. This means that the weighted average can be more accurate in representing the overall value of a dataset, especially when there are significant variations in the importance or frequency of data points.

3. Can the average in unweighted data be affected by outliers?

Yes, the average in unweighted data can be greatly influenced by outliers, which are extreme values that are significantly different from the rest of the data. This is because the average is calculated by summing all the data points, including outliers, which can greatly skew the overall value.

4. How do you interpret the average in unweighted data?

The average in unweighted data represents the central value of a dataset and can be used to compare different datasets or track changes over time. However, it is important to also consider the range and distribution of the data to get a more complete understanding of the dataset.

5. What are some limitations of using the average in unweighted data?

One limitation of using the average in unweighted data is that it does not take into account the variability or spread of the data. This means that datasets with similar averages can have very different distributions and ranges. Additionally, the average may not accurately represent the dataset if it is heavily skewed or if there are outliers present.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
717
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
937
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
697
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
19
Views
2K
  • MATLAB, Maple, Mathematica, LaTeX
Replies
4
Views
815
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
Back
Top