# Probability of Normal Distribution Generating a Sample

1. Jan 3, 2012

### verdverm

I would like to know how to calculate the probability that a normal distribution generated a sample.

More specifically, I am clustering lines so I have several assumed normal distributions. Each cluster has a mean and variance/StdDev.
of both slope and length.

Given a set of clusters (normal distributions) AND a sample line,
I would like to be able to calculate the probabilities for each cluster.

I think it is something like:
P(L|C_i) = P_len(L.len|C_i) * P_slp(L.slp|C_i)
I don't know how to calculate the two RHS probabilities.

Tony

2. Jan 3, 2012

### Stephen Tashi

Unless you adopt a Bayesian approach, you can't calculate "the probability that a normal distribution generated a sample".

Judging by remainder of your post, you may be able to calculate something that might be loosely interpreted as "the probability of a particular sample, give that we assume a particular normal distribution generated it." If that's what you meant, then we can discuss how to do it. First let's clarify what you are trying to do.

(The type of distinction you must make is between "The probability of A given B" versus "The probability of B given A". They aren't the same. )

3. Jan 3, 2012

### verdverm

For a detailed specific reference: research.microsoft.com/pubs/144983/mod480-wang.pdf
( specifically the calculation of b_i(L) from section 3.3 )

I'm a little unclear on how the Bayesian comes into play...
perhaps because of the formula, perhaps because there are several clusters

a little clarification on the objective...

given a time series, I break it into line segments (Piecewise Linear Approximation).
Each line segment has a θ and a length.
Next I group the lines into clusters based on these values.
Then from each group/cluster we can calculate the mean and variance of the θ and length of the lines.

So at this point I have a bunch of clusters with 2 normal distributions each.
(one for θ and one for length) (joint probability???)

Now, given a new line, I want to associate a probability with each cluster.
This probability should encapsulate the likelihood that the cluster generated the new line.

Quote:
""(The type of distinction you must make is between "The probability of A given B" versus "The probability of B given A". They aren't the same. )""

I will always have the case of "The probability of LINE given CLUSTER"

4. Jan 3, 2012

### Stephen Tashi

Assuming you are attempting to define your goal, don't you see that these are contradictory statements?

Your first statement has the tortuous phrase "the probability should encapsulate the likelihood", but it amounts to saying that you want "the probability that a specific cluster generated the line given the data that defines the line". The second statement obviously refers to "the probability of the data that defines a line given the cluster than generated it".

The paper you mention assumes the reader is familiar with the context of applying "the Viterbi" algorithm. I'm not, but from a few minutes of Wikipedia surfing, this algorithm can be applied to data assumed to be from a Markov model. The Markov model has a vector of probabilities for its initial states. I suppose these might function as "prior probabilities" for a Bayesian analysis. Can you explain the probability model that the paper assumes?

5. Jan 3, 2012

### verdverm

Not contradictory given that it is an iterative algorithm...

a Hidden Markov Model (HMM) has many states, each with:
- initial probability ( to start an observation series )
- transition probabilities ( to move from one state to another state ) { Matrix }
* output probability(ies) ( the probability of generating an observation )

the idea is to determine the hidden states of the model from the observations.

In the paper, instead of the points in time being the observations, the lines that approximate the data are the observations.

so my problem is with calculating the *output probabilities*

To initialize an iterative refinement, we first segment the series using the previously mentioned PLA
Then we cluster the lines created by PLA
Next, each cluster becomes a hidden state in an initial HMM (pHMM in the paper)
The output probabilities are calculated from the cluster of lines that is associated with the state (1-1 correspondence)

The output probabilities of a state are the {mean and variance} of the {angle and length}
of the lines that comprise the cluster. ( 4 values for the output in order to calculate probabilities later)

So now we get to the iterative refinement stage after creating an initial HMM...
-- Re-segment the time series under guidance of the initial HMM
( this is where my question arises from )

given a candidate line from the new segmentation,
for each state in the HMM,
*** measure how likely it is that this state generated the candidate line [ b_i(L) in section 3.3 ]
***

measure is some how related to the two Gaussian distribution from each state ( angle & length )
and the current candidate line under consideration

The HMM will remain constant through the course of the re-segmentation
The candidate line will always be a different 'sample'

b_i(L) is used as part of a larger computation to find a new, optimal segmentation given the current HMM

the iterative process continues by

until HMM doesn't change
-- resegment with current HMM
-- create new HMM from resegmentation

I could provide sample clusters and a single line if actual numbers are desired

Best,
Tony

6. Jan 3, 2012

### chiro

If this fits any resemblence to a standard markov modeling problem (which it seems to do), then if you have the initial probabilities and the transition matrix, then what you need to do is to find the steady state solutions that should correspond to the "output probabilities".

Is the above assumption correct or is there something else that we are missing?

7. Jan 3, 2012

### verdverm

okay, i think people are looking to far into this...

the problem I am having is simply this:

given (possibly joint) gaussian probability distributions

pHMM
1 |339|
theta: 1.4544 0.2695
lens: 26.8225 6.2101
2 |24|
theta: 0.8524 0.1335
lens: 2.4693 0.5381
3 |72|
theta: -0.9516 0.2081
lens: 3.7492 0.8248
4 |21|
theta: 0.0000 0.0000
lens: 2.0000 0.0000
5 |24|
theta: -0.1932 0.2335
lens: 3.1475 0.1783
6 |21|
theta: 0.6506 0.3428
lens: 3.3084 0.0837

and given a line:

line
theta: 1.0
lens: 3.0

what is the probability that the line belongs to / was generated by / fits in with / ... each cluster:
1: ?
2: ?
3: ?
4: ?
5: ?
6: ?

I need the probability of the line with each of the clustres in the pHMM

currently I am using a hack I think
( function of the standard deviations away from the mean with domian [-3,3] and range [0,1] )

func (s *pState) calcLineGenProb( length,theta float64 ) float64 {
lDiff, tDiff := length-s.lMean, theta-s.tMean // difference from mean
lNorm, tNorm := lDiff/s.lVari, tDiff/s.tVari // normalize to Std Deviations
ret := calcZscore(lNorm) * calcZscore(tNorm) // calc hack ~= [0,1]*[0,1]
return ret // return a value close to 1 if a 'probable' line, return close to zero if an 'unlikely' line
}

// hack helper function
func calcZscore( X float64 ) float64 {
X = math.Abs(X) // only care about magnitude
if X > 3.0 || math.IsNaN(X) { return 0.000001 }
d := int(X*100.0) // index scaling
d1 := d/10 // calc vert axis
d2 := d%10 // calc horz axis
z := ZSCORE[d1][d2] // [0.5,0.9999] table of zscores from back of probability book
R := (z-0.5)*2.0 // scale to range [0,0.5] then [0,1] so that close to mean is close to 0
return 1.0 - R // invert for close to 1 for good lines
}