Can you Combinie two transition probability matrices?

  • #51
Bayesian probability is a fancy way of saying that your parameters have a distribution: in other words, your parameters are a random variable. That's all it is.

It just makes the highest generalization possible and it is useful not just in an abstract way, but in a very applied way as well.

I don't know how you can do the Dirichlet Prior calculations in Excel but you could always create the function from the definition by using either a VBA routine or a single formula entry in the spread-sheet.

Here is the Wikipedia site with the Dirichlet PDF on that site:

http://en.wikipedia.org/wiki/Dirichlet_distribution

If you write a VBA function or some other similar routine to calculate the above, then you can calculate probabilities and moments (like expectation and variance).
 
Physics news on Phys.org
  • #52
Hello Chiro,

Could I ask you a question/get your opinion about modelling with a Copula function.

I have 3 variables:

1. Time of departure from the home in the morning
2. Total distance traveled in the day
3. The number of journeys made in the day

This is their correlation matrix

https://dl.dropbox.com/u/54057365/All/matrix.JPG
I've seen technical papers modelling variables with correlations of 0.3-0.4.

Variables 1 and 2 are continuous and variable 3 is discrete (1-11 journeys).

My question is, can you model discrete and continuous data with a copula?

I fitted an empirical copula to the data as I found parametric copulas were not modelling the correlation structure well.

Here is a graph of the fitted copula and raw data for variables 2 and 3. The raw data is red and the fitted copula data is blue.

https://dl.dropbox.com/u/54057365/All/copula.JPG

You'll notice 11 "lines" of red data points corresponding to the 11 discrete journey numbers.

My question is would you consider the correlation structure to be well modeled here?

You'll notice how there are no blue data point in the top left (a good thing) but there is blue data points between the red lines.

Appreciate your comments

Thanks

John
 
Last edited by a moderator:
  • #53
In terms of your copula function, I think you'll need to repost the specifics for the Copula function and its associated constraints: it's been a little while since we talked about it (in this and similar threads) so if you could post your query with more specific constraints I'll try and address those.
 
  • #54
Hi,

I tried fitting a Normal copula to the data but it was not modelling the correlation structure well, between the discrete variable (number of journeys) and the 2 continuous variables mentioned above. The normal copula did however model well the correlation between the 2 continuous variables.

Here is the output of the Normal Copula for the for the discrete variable (number of journeys) and a continuous variable (total distance traveled in the day). The correlation structure is not well modeled.

Note the red data are the raw data points and blue points are the generated data by the copula.

https://dl.dropbox.com/u/54057365/All/empirical.JPG


So then I fitted an empirical copula, described here
http://www.vosesoftware.com/ModelRiskHelp/index.htm#Help_on_ModelRisk/Fitting/Vose_Empirical_Copula.htm

Here is a graph of the fitted copula and raw data for the same 2 variables as above. The raw data is red and the fitted copula data is blue.

https://dl.dropbox.com/u/54057365/All/copula.JPG

You can see that it models the correlation structure better.

My main question is, can you model discrete and continuous data within the same copula?

My other question is would you consider the correlation structure between the 2 variables to be well modeled here given their Pearson's correlation is 0.38 from the matrix?

Thanks
 
  • #55
You should be able to do this provided it treats the distribution definitions in the right manner in the algorithm.

Mixed distributions occur quite frequently (particularly in insurance statistics) and if you want say model a multi-variate distribution where one was discrete and the other continuous then you do the same sort of thing as a normal Cartesian product.

As an example let's you have a normal distribution and discrete uniform: then the cartesian product of these sets would look like a "staircase normal" where you would five sets of normal distributions side by side each being one slice for the appropriate discrete event.

Provided your algorithm has treated the data correctly, then this won't be a problem at all.

In fact if this is done correctly, all later statistical techniques should work properly.

You would have to check that the actual algorithm is able to treat the distribution function as it should (the multi-variate) if it deals with either mixed distributions (continuous and discrete in the same distribution) or distributions where you have a mixture of discrete and continuous random variables permuted with all possible combinations of both.
 
  • #56
Hello Chiro,

Can I ask you a quick question, it is a bit usual.

In my previous post I posted a graph of a fitted copula and raw data for 2 variables, number of journeys and distance travelled.

This was it:

https://dl.dropbox.com/u/54057365/All/copula.JPG

This is raw data on its own.
https://dl.dropbox.com/u/54057365/All/data.JPG

Would you be able to explain what makes the copula generate the values between the red "lines"? One might expect the generated blue data to be in "lines" also similar to the red raw data.

I think I know the answer myself, is it because one of the variables is continuous and those points between the lines actually do model the correlation structure? That is not a very good explanation though is it.

Thanks

John
 
Last edited by a moderator:
  • #57
Hello Chiro,

I have a problem which I'm difficult to find a solution to. Hopefully you could offer some insight.

I have a copula function that generates the total distance traveled in a day (i.e. 40 km) and the number of journeys (i.e 4)

The question is how to calculate the distances of the individual journeys.

Originally, I was doing the following:

Sampling distance x1 from the distribution of journeys distances which were made on days were the total distance traveled was 40km.

f(x1 | Distances on days were total distance traveled = 40)

Then, I sample distance x2 from f(x2 | Distances on days were total distance traveled = 40 - x1)

Then sample x3 from f(x2 | Distances on days were total distance traveled = 40 - x1 - x2)

Then x4 = total distance - x1 - x2 - x3

The problem with this is that journey distances don't "make sense".

The problem is that for example if you travel 2 km to the shop or to work chances are that you next journey will be 2 km in order to return home. But this not always the case, you could stop off on the way of a journey. For example x1 could be 5 km, x2 could be 2 km and then x3 could be 7 (5+2).

Could you suggest a better approach? Would there be a way to look at the relationships between consecutive journeys distances? and some way of sampling them? I have all the data.

Appreciate your comments

J
 
  • #58
For this you will need to consider what the distribution is for an individual journey given by the data.

So you will need to look at conditional expectation with regards to the expected journey for all possible journeys in a single day (you have mentioned four) and this is basically E_y[E_x[X|Y]] = E[X] which is known as the law of total expectation

http://en.wikipedia.org/wiki/Law_of_total_expectation

So you are trying to find E[X] for all possible conditional information relative to the choice of Y (which is the number of possible journey times in one data given your data) and the formulas for this are just the formulas for expectation (and if this is data in an excel spreadsheet then convert it to a binned PDF and use that formula).
 
  • #59
Hello Chiro,

Could I ask you a question? You have been very helpful in the past.

I'm trying to the compare the overall similarity of journeys based on some statistics for example average velocity and acceleration etc.

Each journey has for example 4 measurable attributes with equal weights.

I have some baseline statistics and some comparative statistics from other journeys.

The objective is to determine a measure of the how similar the other journeys are to the baseline journey.

Would you be able suggest a suitable measure?

Can you take an average of the percentage differences?
Could you use the norm, of the differencevector: (journey1 - base) and take the norm of this vector?

Appreciate your comments

https://dl.dropbox.com/u/54057365/All/comp1.JPG
 
Last edited by a moderator:
  • #60
I would recommend a couple of things in this instance.

The first would involve a two sample t-test or one of its non-parametric forms to test whether pairs of parameters (i.e. baseline vs other journey) provides evidence of being statistically significantly the same.

You should look into techniques like Bonferroni or other mechanisms that are used to do multiple sets of comparisons where you would test say four pairs of tests in which the significance level would be alpha/4.

The other thing I would recommend is doing a chi-square (Pearsons chi-square good-ness of fit) on the parameters by considering that each attribute is a random variable.

I would personally start out looking at 2-sample t-tests and the non-parametric equivalents first.

I would also look at ANOVA's (also check non-parametric if you need to) to test whether all groups of journeys have the same parameter as the base-line.

So do the ANOVA first and then do the pair-wise comparisons after that while thinking about whether you should use multiple pair-wise comparisons by applying Bonferroni correction of alpha values (i.e. the probability used to reject or accept the hypothesis that they are the same/different).
 
  • #61
Hello Chiro,

Could I ask you a question? You have been very helpful in the past.

I am trying to quantify the difference between two discrete distributions. I have been reading online and there seems to be a few different ways such as a Kolmogorov-Smirnov test and a chi squared test.

My first question is which of these is the correct method for comparing the distributions below?

The distributions are discrete distributions with 24 bins.

My second question is that, it pretty obvious looking at the distributions that they will be statistically significantly different, but is there a method to quantify how different they are? I'm not sure, but a percentage or distance perhaps?

I've been told that if you use a two sample Kolmogorov-Smirnov test, a measure of how different the distributions are will be the p-value. Is that correct?

http://www.mathworks.co.uk/help/stats/kstest2.html

I appreciate your help and comments

Kind Regards

https://dl.dropbox.com/u/54057365/All/phy.JPG
 
Last edited by a moderator:
  • #62
What attribute specifically are you trying to see the difference in?

The Chi-Square test acts like a lot like a 2-norm (think of Pythagoras Theorem) for an n-dimensional vector in the way that you get an analog of "distance" between two vectors.

If you know some kind of attribute (even if its qualitative, you can find a way to give a quantitative description with further clarification), then you can mould a norm or a test-statistic in that manner.
 
  • #63
Hi,

Well I developed a model which simulates car journeys. The distribution of the arrival times home in the evening simulated by the model is "different" than the actual distribution of the arrival times home observed in actual real world data. The model appears to be not that accurate.

What I ideally would like to say is that the distribution produced by the model is some percentage different from the the real world distribution.

Would a Chi squared or Kolmogorov-Smirnov test quantify the difference?

What would you recommend in this case?

Can these tests be used for discrete data? The times are rounded to the nearest hour.

What would you think of summing up the sum up the point wise absolute value of the differences between the two distributions. Would that be a good idea?

abs( Data_bin1_model - Data_bin1_data) + abs( Data_bin2_model - Data_bin2_data) + ...+bs( Data_bin24_model - Data_bin24_data) =

I'd prefer to use a statistical test if there was suitable available.

Thank you for your help.
 
Last edited:
  • #64
I think you will want to go with something like a Pearson Chi-square Goodness-Of-Fit test given what you have said above.
 
  • #65
Hi,

I really struggling with this. Is the P-value form the Chi squared test the percentage difference between the 2 distributions? why did you choose the Chi squared test over the KS test?

Thank you
 
  • #66
Its not a percentage difference but instead a probability corresponding to some variance where p-value = P(chi-square^2 > x) for some x where the x corresponds to the test-statistic (i.e. the X^2 test statistic).

Basically the larger the deviation, the smaller the chance that the two distributions are equal and the larger the deviation, the smaller the p-value.
 
Back
Top