Statistical modeling and relationship between random variables

  • #1
fog37
1,568
108
TL;DR Summary
Statistical modeling and relationship between 3 random variables and 2 random variables
In statistical modeling, the goal is to come up with a model that describes the relationship between random variables. A function of randoms variables is also a random variable.
We could have three random variables, ##Y##, ##X##, ##\epsilon## with the r.v. ##Y## given by ##Y=b_1 X + b_2 + \epsilon## where ##b_1, b_2## are constant. The expectation value of ##Y## is simply ##E[Y|X] = b_1 E[X]+ b_2 + E[\epsilon]## with ##E[\epsilon]=0##. This is what simple linear regression is about. A note: an author wrote ##E[Y;X]## instead of ##E[Y|X]##, stating that it is not really a conditional expectation value, but I am not sure about the difference...

But in most textbooks, the variable ##X## is generally said to not be a random variable but a deterministic one...Why? Clearly, that would simplify the expectation value of ##Y## to ##E[Y|X] = b_1 X+ b_2##.

On the other hand, when ##X## is also a r.v., we need to know its expectation value ##E[X]## in order to proceed. How would we get ##E[X]## from the sample data?

For example, in practice, if we asked 50 random people, out of a population, their height ##Y## and age ##X##, both age and height would be r.v. , correct? That seems the most common scenario for linear regression. What kind of situation would instead have ##X## to be deterministic? Maybe if we search from the beginning for people of specific ages and then ask them their height? In that case, we planned what the values of the variable ##X## would be...But in many other cases, it seems that both variables would be commonly random. How would we then handle
 
Physics news on Phys.org
  • #2
fog37 said:
The expectation value of ##Y## is simply ##E[Y|X] = b_1 E[X]+ b_2 + E[\epsilon]## with ##E[\epsilon]=0##.
Be careful about this. ##E[Y|X]## is a function of ##X## whereas the right side is just a single number.
fog37 said:
This is what simple linear regression is about. A note: an author wrote ##E[Y;X]## instead of ##E[Y|X]##, stating that it is not really a conditional expectation value, but I am not sure about the difference...
What author? If I am going to say something that is contradicted by the author, then I would like to know what all the surrounding text said.
fog37 said:
But in most textbooks, the variable ##X## is generally said to not be a random variable but a deterministic one...Why?
Clearly, that would simplify the expectation value of ##Y## to ##E[Y|X] = b_1 X+ b_2##.
That equation equation is to be used with a particular value of ##X##. How ##X## got to have that value, whether deterministic or random, does not matter.
fog37 said:
On the other hand, when ##X## is also a r.v., we need to know its expectation value ##E[X]## in order to proceed.
Suppose that ##X_1, X_2, ... , X_n## are random variables, and more statistical analysis needs to be done with it. That is a more complicated situation. Some variables might be correlated, others might be independent. ADDED.(actually you need to do this even if the ##X_i##s ar all deterministic.)
fog37 said:
How would we get ##E[X]## from the sample data?

For example, in practice, if we asked 50 random people, out of a population, their height ##Y## and age ##X##, both age and height would be r.v. , correct? That seems the most common scenario for linear regression. What kind of situation would instead have ##X## to be deterministic?
You would normally collect a sample, ##(x_1,y_1), (x_2,y_3), ..., (x_m,y_m)##. Would it matter whether the ##x_i##s were from a random variable? If you have several input variables, ##X_i##s and want to do more detailed analysis of their relationship with ##Y## then you would first need to address the issue of correlated ##X## variables. ADDED.(actually you need to do this even if the ##X_i##s ar all deterministic.)
 
  • #3
I think this is more a matter of convention than anything. Bayesian statistics tends to treat everything as a random variable and assign it prior probability distributions. So you would model $$y\sim \mathcal N (b_1 X + b_0, \sigma)$$ You would not just treat ##y## as a random variable, but everything else too. Each one would have their own prior distribution.
 
  • #4
I believe the " deterministic" is just one that is not random, such as , the date, say, yearwise. You can control your choice of date when plotting, say, inflation vs year/date, or record high jump vs year. Notice you can regress one random variable vs a deterministic, but finding correlations are not defined.
 
  • #5
Dale said:
I think this is more a matter of convention than anything. Bayesian statistics tends to treat everything as a random variable and assign it prior probability distributions. So you would model $$y\sim \mathcal N (b_1 X + b_0, \sigma)$$ You would not just treat ##y## as a random variable, but everything else too. Each one would have their own prior distribution.
Some authors , maybe frequentists, define correlation in terms of conditional expectation ( E( Y|X), when regressing X against Y). Is this done with the Bayesian approach?
 
  • #6
WWGD said:
Some authors , maybe frequentists, define correlation in terms of conditional expectation ( E( Y|X), when regressing X against Y). Is this done with the Bayesian approach?
I haven’t seen that as a definition, but Bayesians use the same computations to actually calculate correlations as frequentists do.
 
  • Like
Likes WWGD
  • #7
Dale said:
I haven’t seen that as a definition, but Bayesians use the same computations to actually calculate correlations as frequentists do.
Thanks, I was also wondering on whether regressing Y on X is seen, described as the conditional expectation of Y on X , i.e., E[Y|X].
 
  • #8
WWGD said:
Thanks, I was also wondering on whether regressing Y on X is seen, described as the conditional expectation of Y on X , i.e., E[Y|X].
Sort of. It is not just the conditional expectation, but you get the entire conditional distribution. So you can get the expectation of the conditional distribution, but you can also get any other measure such as the variance or anything else you like.
 
  • Like
Likes WWGD

1. What is statistical modeling?

Statistical modeling is a mathematical framework used to understand the relationships between variables and to make predictions or informed decisions based on data. It involves constructing models that represent the process by which data is generated, allowing for an understanding of the structure and dynamics of the data. These models can be used to summarize patterns, make predictions, and simulate potential outcomes.

2. What are random variables in statistics?

In statistics, a random variable is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables: discrete and continuous. Discrete random variables can take on a countable number of distinct values, such as the number of heads in a series of coin tosses. Continuous random variables, on the other hand, can take on an infinite number of different values within a given range, such as the height of individuals in a population.

3. How do you determine the relationship between random variables?

The relationship between random variables can be determined using various statistical methods, including correlation and regression analysis. Correlation measures the strength and direction of a linear relationship between two random variables. Regression analysis, on the other hand, describes how one variable depends on one or more other variables, providing a more detailed model of relationships, including the ability to predict values of one variable based on the others.

4. What is the difference between correlation and causation?

Correlation and causation are often confused, but they describe very different concepts. Correlation between two variables simply means that the variables tend to change together, but it does not imply that one variable causes the change in the other. Causation, on the other hand, means that one variable actually causes the change in the other. Establishing causation requires more rigorous experimental or observational evidence where confounding factors are controlled.

5. What are the common statistical models used to analyze the relationship between variables?

Common statistical models used to analyze relationships between variables include linear regression, logistic regression, and multivariate regression. Linear regression is used for predicting a continuous dependent variable based on one or more independent variables. Logistic regression is used when the dependent variable is categorical, commonly for binary outcomes. Multivariate regression involves multiple dependent variables being predicted simultaneously. Each of these models helps in understanding how variables are interconnected and can be used to predict outcomes.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
485
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
6
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
981
  • Set Theory, Logic, Probability, Statistics
Replies
8
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
551
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
494
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
894
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
Back
Top