Register to reply 
Data to normal distribution 
Share this thread: 
#1
Feb1912, 09:51 AM

P: 77

Hi
I have a set of observed data from time intervals like 4,5,..... I want to fit this data to normal distribution. Is there any normalization process I can make to change them from integers to real numbers as long as normal distribution is continuous? Regards 


#2
Feb1912, 10:38 AM

Sci Advisor
P: 3,252

Have you tried simply converting each integer k to the corresponding "zscore" [itex] z = \frac{ (k  \mu)}{\sigma} [/itex] where [itex] \mu [/itex] is the mean of the data and [itex] \sigma [/itex] is the standard deviation of the data (or the unbiased estimator of the population standard deviation) ? If you know that the discrete data comes from measurments of a continuous quantity that are rounded off, you might be able to do a process that essentially "smears out" each discrete data point to a possible distribution of continuous data points. Then you can try to fit a normal distribution to the superposition of these distributions. This is very sophisticated technique and I don't have the details of how to do it fresh in my mind. I think the method is called "using convolution kernels". Explain more about the data. 


#3
Feb1912, 10:44 AM

P: 77

Data are just time between two arrivals of buses.
They are not measured to seconds just to minutes for example 5 minutes , 4 minutes etc. I want to fit normal distribution to this data but as long as they are discrete and not real numbers I think they need some normalization here. Any advice pls Regards 


#4
Feb1912, 11:09 AM

Sci Advisor
P: 3,252

Data to normal distribution
I can't resist observing that the exponential distribution is the one most often used for "interarrival times". What motivates your choice of a normal distribution? Have you don't some preliminary plotting of the data that suggests it is normally distributed?
You didn't say whether you had tried to fit a normal distribution by using the zscores. A simplistic way to represent data with roundoff error to the nearest minute would be to replace each observation of k minutes by a uniform distribution of "fake" observations on the interval on the interval [k0.5,k+0.5] minutes. For example, if you represented the uniform distribution as data to the nearest second, an arrival at time 3 minutes would become a set of fake data points at each second from [180  30, to 180 + 29] seconds. Then you would take the mean and standard deviation of the fake data as the parameters of the normal distribution. That is just my primitive oversimplification of a technique that I've seen used. I haven't looked at your other posts. Are you the poster who is writing a thesis? If you use this technique then you need to find the formal, correct, dignified way to go about it and find the proper terminology for it. You also must distinguish between "roundoff" and "truncation" error. For example, if a true time of 3 minutes 48 seconds is truncated to a data point of 3 minutes (instead of being rounded to a data point of 4 minutes) then you should represent a datum of 3 minutes as a uniform distribution from 3 to 4 minutes. 


#5
Feb1912, 11:48 AM

P: 199

You asked this question in another thread. You really can't model interarrival times as a normal distribution because they are not normally distributed. You really do need to think about using the appropriate distribution. The number of arrivals in a given time is a counting process such as Poisson. Then the interarrival times that result from that counting process are exponentially distributed.



#6
Feb1912, 12:43 PM

P: 77

Thank you to Stephen and Alan I appreciate it.
I know from literature that it should be modeled by Poisson if counting number of arrivals or exponential if studying inter arrivals but histogram suggest me normal distribution. Anyway thank you for advices. Regards 


#7
Feb1912, 11:06 PM

P: 199

Hi Mark,
I was thinking that you might be frustrated that we just keep telling you that your distribution isn't normal. It occurred to me that maybe we're just not understanding what your data represents. Could you elaborate? I ask because you mention bus arrival times which are not random variables at all, they are scheduled events. So I thought maybe you could be looking at the error in actual arrival times or the actual time between two scheduled arrivals given that there is error in the actual arrival times of consecutive buses. These quantities could reasonably be normally distributed. Maybe if we better understood your data we could offer some help. 


#8
Feb1912, 11:13 PM

P: 4,572

If you absolutely want to test whether your data's histogram has a good fit to a normal distribution then you need to apply what is known as a GoodnessOfFit test.
For a normal distribution you use what is known as a ShapiroWilk test which will give you a statistics which tells you how 'well' the 'fit' is. But again I want to give a note of caution to take in what the above posters have said: you need to understand your data not only from a probabilistic or histogram point of view, but more importantly from a process point of view. Understanding the underlying process and the effect that it has on describing the final distribution is going to be a lot more useful than just trying to fit things to distributions especially if you are looking at something from the point of view of the process as opposed to using results for statistical purposes like say testing whether the errors of a regression are normally distributed. 


#9
Feb2012, 01:43 AM

P: 77

Yes thank you,
Actually I am using Chisquare, Kolmogorov and other tests. The first one seems fine the others just do not fit. I am attaching with the email one sample of data collected. Of course ideally they should be scheduled each 5 minutes but actually we see that there are errors that's what I am working on. Thank you for suggestions. Regards 


#10
Feb2012, 12:47 PM

P: 199

I thought that might be what you were actually doing. Now the statisticians can jump in.



#11
Feb2012, 06:00 PM

P: 300

Please describe exactly what you want. Then we can help. 


#12
Feb2912, 01:38 PM

P: 77

The idea is that observed data are interarrival times.
The observers didn't get them in the exact seconds but kind of rounded them in minutes for examples 7.04 7.08 etc Now to fit this data in a common used distribution for example exponential is impossible as long as it is continuous distribution. How to arrange the estimation of error for this data or any other approach to this situation? Regards 


#13
Feb2912, 03:19 PM

Sci Advisor
P: 3,252

Fit the data to the discrete distribution that is implied by rounding the continuous distribution. You have to define "kind of rounded" precisely. Did they round or truncate? 


#14
Mar112, 01:18 AM

P: 77

Thank you for your explanation.
The data was rounded to upper floor meaning that : 6.7 was taken as 7 4.3 was taken as 4 Regards 


Register to reply 
Related Discussions  
Comparing normal distribution divided by normal distribution  Set Theory, Logic, Probability, Statistics  12  
Is 2 times normal distribution still a normal distribution please?  Set Theory, Logic, Probability, Statistics  4  
Confidence Intervals: tdistribution or normal distribution?  Set Theory, Logic, Probability, Statistics  4  
Probability  Applications Normal Distribution  Analysis of Data  Calculus & Beyond Homework  1  
Phi normal distribution (how to look normal tables )  Set Theory, Logic, Probability, Statistics  3 