Sampling theory and random sample

  • Context: Undergrad 
  • Thread starter Thread starter fog37
  • Start date Start date
  • Tags Tags
    Sampling
Click For Summary

Discussion Overview

The discussion revolves around the interpretation of random samples in inferential statistics, particularly focusing on whether a random sample can be viewed as realizations of a single random variable or as realizations of multiple independent random variables. The conversation touches on concepts relevant to regression analysis and the implications of these interpretations in statistical methods.

Discussion Character

  • Debate/contested
  • Conceptual clarification
  • Technical explanation

Main Points Raised

  • Some participants propose that a random sample can be interpreted as realizations of a single random variable, while others argue that it is more accurately described as realizations of multiple independent random variables.
  • A participant questions whether the two interpretations are equivalent and seeks clarification on the implications of each interpretation.
  • Another participant notes that the first interpretation assumes identical population distributions, while the second allows for variations in distributions, particularly in contexts like cluster analysis or stratified sampling.
  • One participant highlights that the distinction between a sample and the random variables that generated it is significant, especially in scenarios involving data collection stages or Bayesian methods.
  • There is mention of a specific source that discusses how the second interpretation allows for sample statistics to also be treated as random variables.

Areas of Agreement / Disagreement

Participants express differing views on the equivalence of the two interpretations of random samples. Some see no practical difference, while others believe the distinctions are important in certain statistical contexts. The discussion remains unresolved regarding which interpretation is more technically correct.

Contextual Notes

Participants acknowledge that the interpretations depend on specific statistical contexts and assumptions, such as the independence of random variables and the nature of the population distribution. There is also mention of potential implications for various statistical methods, including Bayesian approaches and bootstrap methods.

fog37
Messages
1,566
Reaction score
108
TL;DR
sampling theory and Inference
In inferential statistics, we have a large population, collect data from it to get random sample of size ##n##, and infer the population parameters from that single sample.

I read that the random sample can be interpreted as the collection of the ##n## realizations of a single random variable ##X##. For example, the height ##H## of individuals in a population can be define as a random variable and the height of each individual in the random sample is a realization of the r.v. However, a more correct interpretation of a random sample is the following: each element of random sample, for example the 5 heights ##[6, 5.4, 6.1, 5.5, 6.4]##, is the realization of a different random variables. So the random sample is the realization of a random vector, a sequence of i.i.d. random variables ##[X_1, X_2, X_3, X_4, X_5]## with a joint probability distribution ##f(x_1, x_2, x_3, x_4, x_5)##. Why is this the correct interpretation of the random sample and not the first one with a single r.v.? Are the two interpretations somehow equivalent to each other? How?

When we perform regression analysis on some random sample of data, are we dealing with a pair of random variables, ##X## and ##Y##, i.e. a 2D random vector ##Z=(X,Y)##? Or with two random vectors, ##X=[X_1, X_2, X_3, X_4, X_5]## and ##Y= [Y_1, Y_2, Y_3, Y_4, Y_5]## where each value of x and each value of y are realizations of different random variable X and different random variable Y?

Thank you as always for any comment and correction.
 
Physics news on Phys.org
fog37 said:
I read that the random sample can be interpreted as the collection of the ##n## realizations of a single random variable ##X##. For example, the height ##H## of individuals in a population can be define as a random variable and the height of each individual in the random sample is a realization of the r.v. However, a more correct
Is "more correct" your phrase or theirs? A restriction of the first interpretation is that the population distribution is assumed to be identical. If the intent is to study things like cluster analysis, importance sampling, or stratified sampling, then there is some freedom to say that there are more than one distribution involved in the sample.

CORRECTION: I missed the IID part of the description of the second interpretation. I see no practical difference between the two interpretations.
 
Last edited:
FactChecker said:
Is "more correct" your phrase or theirs? A restriction of the first interpretation is that the population distribution is assumed to be identical. If the intent is to study things like cluster analysis, importance sampling, or stratified sampling, then there is some freedom to say that there are more than one distribution involved in the sample.
Well, I have found this interpretation in several places. For example:
1704940604794.png


The population is an infinite set of values drawn from a random variable ##X##. Sampling from a population is the same as repeatedly drawing new values from ##X##. A a random sample of size ##n## is a collection of individual draws from ##X##.

The point seems to be that ##n## independent draws from a random variable ##X## is equivalent to one draw of ##n## i.i.d. random variables ##X_1, X_2,....X_n## Is that really the case? Can you help me appreciate why the two scenarios are equivalent...
 
Sorry. I missed the IID part of second interpretation. I see no practical difference between the two. So I wonder where you read that the second interpretation was better.
 
  • Like
Likes   Reactions: fog37
FactChecker said:
Sorry. I missed the IID part of second interpretation. I see no practical difference between the two. So I wonder where you read that the second interpretation was better.
Thank you FactChecker for your support. Let me share with you this stats.stackexchange.com answer:
https://stats.stackexchange.com/questions/368492/about-sampling-and-random-variables/368517#368517

The response by shadowtalker is discussed how the 2nd interpretation allows for for the sample statistics to also be random variables, as they are...

So why are the two interpretations really identical? Would you mind sharing your thought process. It is the same random reality but described in two different ways...Is one more technically correct that the other? As mentioned, when we talk about regression analysis, it seems better to keep the random sample of data, each pair of ##x## and ##y## values, are realizations of two random variables ##X## and ##Y## instead of two sequences of random variables, one for the ##x## values and one of the ##y## values...

For example, in the case of tossing a die multiple times, the outcome of each toss is the realization of a single random variable OR are the outcomes are the realizations of different random variables...

Thank you!

Thank you!
 
fog37 said:
Thank you FactChecker for your support. Let me share with you this stats.stackexchange.com answer:
https://stats.stackexchange.com/questions/368492/about-sampling-and-random-variables/368517#368517

The response by shadowtalker is discussed how the 2nd interpretation allows for for the sample statistics to also be random variables, as they are...
I agree. It is a distinction that I have probably been careless about in the past. There is a difference between a sample, which is an already collected set of data, versus the random variables the gave you that sample. I think it is standard to use lower case (##x_i##) for the data and upper case (##X_i##) for the random variables.
fog37 said:
So why are the two interpretations really identical? Would you mind sharing your thought process. It is the same random reality but described in two different ways...Is one more technically correct that the other?
IMO, one situation where the distinction is significant is if you talk about collecting data in stages so that some data is collected but other data is not yet collected and still a random variable. You might see this in stopping problems. Suppose that you were doing an experiment where collecting data was expensive or difficult and you need to decide if you should collect more data. Also, I think that the distinction would be significant in many Bayesian methods with prior and post distributions. Also bootstrap methods.
I have no real experience with these types of problems and will have to leave this discussion to others.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 7 ·
Replies
7
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
4K
  • · Replies 7 ·
Replies
7
Views
3K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
1K