Probability observed value not in range for prediction

Click For Summary
SUMMARY

The discussion centers on calculating the probability that a future observed value (F) falls within a specific range (270 to 284) based on a predictive model developed from two normally distributed datasets. The model, a 5th order polynomial fitted using Excel, shows a correlation of 92% between GDP and a Risk score derived from hazardous substance laws. The user calculated the probabilities for the observed values at the boundaries (P(A) for 270 and P(B) for 284) and derived a probability of 89% for F being within the specified range. The conversation emphasizes the importance of understanding the model's equation and the distribution of the error term.

PREREQUISITES
  • Understanding of linear regression and correlation coefficients
  • Knowledge of normal distribution and probability calculations
  • Familiarity with polynomial regression models, specifically 5th order
  • Experience using Excel for trendline fitting and residual analysis
NEXT STEPS
  • Learn about calculating probabilities using normal distribution functions in Python or R
  • Study the implications of error terms in predictive modeling
  • Explore advanced regression techniques, including polynomial regression
  • Investigate the relationship between GDP and regulatory compliance metrics
USEFUL FOR

Data scientists, statisticians, and analysts involved in predictive modeling, particularly those working with economic indicators and regulatory compliance metrics.

Math33
Messages
2
Reaction score
0

Homework Statement


Hello all, I created a predictive model from a data set of observed values and am looking for probabilities for accuracy. Data set A (observed) and data set B (predictive model) have a correlation of 84 % using linear regression. Data set A and B are both normally distributed, also for every predicted B value there is an assigned A value for prediction mapping. Ex: B model produces a score for a data point of 420 and the closest A score to that is 410. Now, let's say in the future this data point is able to be observed (let's call this F.) What is the probability that F is in between 410 and 420.

Homework Equations


P(410<F<420). A and B are two separate normal distributions with two different means and standard deviations.

The Attempt at a Solution


I found the probability of A for 410 in the first normal distribution (let's say P(A)=0.56) then I found the probability of B for 420 in the second normal distribution (let's say P(B)=0.67) I then subsracted P(B)- P(A) to get 0.11. Then I substrated 1-0.11 to get 0.89. So the probability that F is going to be in the range of 410 to 420 is 89%. I am not sure if I'm doing this right . Thanks in advance.
 
Physics news on Phys.org
To comment usefully on this we'd need more information. A predictive model usually takes the form of an equation with an error term, like
$$X_{B,i}=X_{A,i}+\varepsilon_i$$
where ##X_{A,i}## and ##X_{B,i}## are the ##i##th ##A## and ##B## values respectively and ##\varepsilon_i## is a random variable called the 'error term', usually independent between different values of ##i##. ##\varepsilon_i## has a known distribution - usually, but not always, normal - which is usually the same for all ##i##.

What is the equation for your model?
 
andrewkirk said:
To comment usefully on this we'd need more information. A predictive model usually takes the form of an equation with an error term, like
$$X_{B,i}=X_{A,i}+\varepsilon_i$$
where ##X_{A,i}## and ##X_{B,i}## are the ##i##th ##A## and ##B## values respectively and ##\varepsilon_i## is a random variable called the 'error term', usually independent between different values of ##i##. ##\varepsilon_i## has a known distribution - usually, but not always, normal - which is usually the same for all ##i##.

What is the equation for your model?

Hi Andrew, thank you for the response. I actually just got back from work so I don't have the equation in front of me but it is a 5th order polynomial, non-linear model. I fitted the model using excel trendline and actually found a correlation of 92%. Yes, I already did plot the error term (residuals) for each predicted value and a very good random pattern was shown for N=40 data points with virtually no correlation.

What my model is attempting to do is trying to find the relationship between the adaptation of a specific hazardous substance law that is present throughout the globe, and the GDP of any given country. A little background is that I created a metric system that assigns a Risk score based on how much of this hazardous substance law a country adopts. I did this with many existing countries that have this law already and generated a Risk score for each of them. My 92 % correlation is between GDP and Risk score of existing countries that have the law. What I want to try to do is to predict what the hazardous substance law is going to be for a country that might adopt it in the future. So let's say a country F that doesn't have this law has a Risk score of 284 given from the model. Then the closest score country that already has the law has a Risk score of 270. Both predicted and actual data sets are normally distributed. So I am trying to find out once country F adopts the law, what is the probability the final Risk score will be between 270 and 284. Thank you for your time.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 2 ·
Replies
2
Views
1K
  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 2 ·
Replies
2
Views
1K
Replies
5
Views
4K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 6 ·
Replies
6
Views
7K
  • · Replies 4 ·
Replies
4
Views
1K