Should I be treating the data I have as a Population or Sample?

SumDood_ · Mar 26, 2024

A study on strength properties of high-performance concrete obtained by using super-plasticizers and certain binders recorded the following data on flexural strength (in mega-pascals, MPa) from 28 tests:
6.1, 5.6, 7.1, 7.3, 6.6, 8.0, 6.8, 6.6, 7.6, 6.8, 6.7, 6.6, 6.8, 7.6, 9.3, 8.2, 8.7, 7.7, 9.3, 6.9, 8.1, 10.0, 7.5, 8.0,
11.6, 11.3, 11.9, 10.3.
a) Find the mean and standard deviation of these 28 strengths.
Mean = 8.04 MPa

b) Discuss which central tendency measures are appropriate for this data set, and which are inappropriate.
Mean and median are appropriate measures, but mode is not. Is this correct? I don't know how to justify my answer.

FactChecker · Mar 26, 2024

SumDood_ said:

TL;DR Summary: The question is fairly easy. What I am not sure is if I am supposed to treat this data as a sample or population?

The data you have is not the entire set of every possible result. It's a sample.

SumDood_ said:

b) Discuss which central tendency measures are appropriate for this data set, and which are inappropriate.
Mean and median are appropriate measures, but mode is not. Is this correct? I don't know how to justify my answer.

A sample can have the mode occur at two, widely separated, values. It could also have multiple values that are closely tied and a more samples can make the mode jump around significantly.
As an extreme example, consider the uniform distribution on the real line between 0 and 1. What is its mode? What kind of behavior could you expect for the mode of a sample from that distribution?

WWGD · Mar 26, 2024

That's sample data. If you new the population parameters you wouldn't need to conduct tests.

SumDood_ · Mar 27, 2024

FactChecker said:

The data you have is not the entire set of every possible result. It's a sample.

A sample can have the mode occur at two, widely separated, values. It could also have multiple values that are closely tied and a more samples can make the mode jump around significantly.
As an extreme example, consider the uniform distribution on the real line between 0 and 1. What is its mode? What kind of behavior could you expect for the mode of a sample from that distribution?

When taking a sample from a uniform distribution, every real number within 0 and 1 has equal probability of being drawn. I would think that there would be no mode, as no 2 values would be the same.

PeroK · Mar 27, 2024

SumDood_ said:

When taking a sample from a uniform distribution, every real number within 0 and 1 has equal probability of being drawn. I would think that there would be no mode, as no 2 values would be the same.

Technically the mode is only defined for a discrete distribution. You could consider the mode on a uniform distribution of random numbers to a fixed number of decimal places.

PeroK · Mar 27, 2024

WWGD said:

That's sample data. If you new the population parameters you wouldn't need to conduct tests.

It's possible to test an entire population.

SumDood_ · Mar 27, 2024

PeroK said:

Technically the mode is only defined for a discrete distribution. You could consider the mode on a uniform distribution of random numbers to a fixed number of decimal places.

Then, does that mean that the mode becomes a relevant central tendency measure? I would say it isn't because it does not provide much information about the sample. If we already know the probability of each sample being drawn is equal, then what benefit does the mode value provide?

PeroK · Mar 27, 2024

SumDood_ said:

Then, does that mean that the mode becomes a relevant central tendency measure? I would say it isn't because it does not provide much information about the sample. If we already know the probability of each sample being drawn is equal, then what benefit does the mode value provide?

The question is whether the mode is relevant to this concrete sample?

Alternatively, you could consider the criteria for which the mode is relevant, and whether your sample meets those criteria?

SumDood_ · Mar 27, 2024

PeroK said:

The question is whether the mode is relevant to this concrete sample?

Alternatively, you could consider the criteria for which the mode is relevant, and whether your sample meets those criteria?

Well, yes. That is what I am trying to find out. How does one determine whether any value of central tendency is appropriate or not for a given sample?

PeroK · Mar 27, 2024

SumDood_ said:

How does one determine whether any value of central tendency is appropriate or not for a given sample?

Perhaps that's a good question. Ultimately, it's a matter of experience and intelligence. In general, if the data is spread out with an relativity large number of possible values, which occur at most a couple of times, then the mode is not very useful.

More generally, I would say the mode is not often that useful. Although, you may want to think of an example where it is.

PeroK · Mar 27, 2024

PS the point @FactChecker made was that the median and the mean should be similar for any significant sample. But, in this case, the mode is fairly random. And, therefore, tells you very little.

WWGD · Mar 27, 2024

Median is used when you have outliers, as it's not greatly affected by them, unlike the mean, which is , by contrast. As in, " Bill Gates , you and I have an average net worth of 500 billion "*

* Not literally.

FactChecker · Mar 27, 2024

WWGD said:

Median is used when you have outliers, as it's not greatly affected by them, unlike the mean, which is , by contrast. As in, " Bill Gates , you and I have an average net worth of 500 billion "*

* Not literally.

Good point! That is also an advantage of the median. I see the median used often in exactly the situation of your example. For instance, this.

EDIT: Sorry, I read your "median" and thought "mode". I just repeated what you said about median.
This post can be deleted if you want.

WWGD · Mar 27, 2024

FactChecker said:

Good point! That is also an advantage of the median. I see the median used often in exactly the situation of your example. For instance, this.

EDIT: Sorry, I read your "median" and thought "mode". I just repeated what you said about median.
This post can be deleted if you want.

On average, I am on track. Bill Gates and some 50,000 of us. By contrast, if the median is, say $30,000, Between FactChecker and WWGD, then the median ( Half the values in this case, IIRC) , then it won't be much different if we include William Gates III.

Sargon38 · Mar 28, 2024

FactChecker said:

A sample can have the mode occur at two, widely separated, values. It could also have multiple values that are closely tied and a more samples can make the mode jump around significantly.
As an extreme example, consider the uniform distribution on the real line between 0 and 1. What is its mode? What kind of behavior could you expect for the mode of a sample from that distribution?

This is true. However, that can, to a certain point, also be the case of the median when you have few entries in your sample.

Essentially, there are two different issues with "central tendency" measures. The first is whether the concept itself of central tendency has significance on the given distribution, and the second is the effects of statistical errors due to small sample. These are two different issues actually.

To address the first, if one even talks about a "central tendency", most of the time, one ASSUMES that the measured quantity is somehow "lumped" around a central value. This usually comes down to assuming that the distribution is a "single bump". If your distribution is made up of several "bumps", then the very notion of central tendency is questionable. For instance, if you're talking about body size of a mixed population of rats and dogs, where you have two or more bumps, namely one "around the average size of a rat" and then "several around the average sizes of different dog breeds, from a chiwawa to a Saint Bernard", what conceptual value could a central tendency measure actually have ?

So, even before considering WHAT central tendency measure could possibly be useful, the notion itself of central tendency must have a meaning, which includes the hypothesis that values are somehow "lumped around a central value", which comes down to assuming that the distribution is a "lump".

Once we make that hypothesis, three different estimators, namely sample mean, sample median, and sample mode, have different behaviours according to different properties of the original distribution and the sampling method.

If we have a small sample, this means two things:
1) we will have big statistical errors
2) the probability of getting "outliers" is small

the mean is the most reliable estimator, because it filters best the statistical noise. On a small sample, the heights of the different bins in a histogram are noisy, and the "highest one" could be relatively far away from the "central one" because of these fluctuations, so the mode is not appropriate. Also, the median is one of the sample values, and if you don't have many samples, the possibility that you are close to the "good" value is not very high either.

The bigger your sample gets, and the smaller the statistical noise, the better get these two other estimators such as mode and median, and the worse the mean can get, if there are outliers (that means, if the original distribution has "long tails"). See the Bill Gate example. Mode and median are not affected by rare outliers.

As to the median versus the mode, this will depend on the actual shape of the "bump". If the shape of the bump is "well-peaked", the mode may be a very good estimator. If however the bump is "flat-topped", then the median will do better. See the "uniform number distribution" example.

The median has the extra advantage of being a sample value, while the mode precision is depending on your chosen bin size of your histogram. If the distribution is rather symmetric, then both are good estimators. If your distribution is asymmetric, then you should think of why you need a central tendency. The mode will be closer to what has highest probability to happen, the median will be closer to "half has more, half has less".

Should I be treating the data I have as a Population or Sample?

Similar threads

Hot Threads

Recent Insights