How can I normalize integer data for a better fit to a normal distribution?

Mark J. · Feb 19, 2012

Hi
I have a set of observed data from time intervals like 4,5,...
I want to fit this data to normal distribution.
Is there any normalization process I can make to change them from integers to real numbers as long as normal distribution is continuous?

Regards

Stephen Tashi · Feb 19, 2012

Mark J. said:

Hi
Is there any normalization process I can make to change them from integers to real numbers as long as normal distribution is continuous?

If the meaning of your question is "Is there a way to fit a normal distribution to discrete data that can be proven correct or optimal by mathematics" , the answer is No, not without more information about what data is.

Have you tried simply converting each integer k to the corresponding "z-score" [itex] z = \frac{ (k - \mu)}{\sigma} [/itex] where [itex] \mu [/itex] is the mean of the data and [itex] \sigma [/itex] is the standard deviation of the data (or the unbiased estimator of the population standard deviation) ?

If you know that the discrete data comes from measurments of a continuous quantity that are rounded off, you might be able to do a process that essentially "smears out" each discrete data point to a possible distribution of continuous data points. Then you can try to fit a normal distribution to the superposition of these distributions. This is very sophisticated technique and I don't have the details of how to do it fresh in my mind. I think the method is called "using convolution kernels".

Explain more about the data.

Mark J. · Feb 19, 2012

Data are just time between two arrivals of buses.
They are not measured to seconds just to minutes for example 5 minutes , 4 minutes etc.
I want to fit normal distribution to this data but as long as they are discrete and not real numbers I think they need some normalization here.
Any advice pls
Regards

Stephen Tashi · Feb 19, 2012

I can't resist observing that the exponential distribution is the one most often used for "interarrival times". What motivates your choice of a normal distribution? Have you don't some preliminary plotting of the data that suggests it is normally distributed?

You didn't say whether you had tried to fit a normal distribution by using the z-scores.

A simplistic way to represent data with roundoff error to the nearest minute would be to replace each observation of k minutes by a uniform distribution of "fake" observations on the interval on the interval [k-0.5,k+0.5] minutes. For example, if you represented the uniform distribution as data to the nearest second, an arrival at time 3 minutes would become a set of fake data points at each second from [180 - 30, to 180 + 29] seconds. Then you would take the mean and standard deviation of the fake data as the parameters of the normal distribution.

That is just my primitive oversimplification of a technique that I've seen used. I haven't looked at your other posts. Are you the poster who is writing a thesis? If you use this technique then you need to find the formal, correct, dignified way to go about it and find the proper terminology for it.

You also must distinguish between "roundoff" and "truncation" error. For example, if a true time of 3 minutes 48 seconds is truncated to a data point of 3 minutes (instead of being rounded to a data point of 4 minutes) then you should represent a datum of 3 minutes as a uniform distribution from 3 to 4 minutes.

Mark J. said:

Data are just time between two arrivals of buses.
They are not measured to seconds just to minutes for example 5 minutes , 4 minutes etc.
I want to fit normal distribution to this data but as long as they are discrete and not real numbers I think they need some normalization here.
Any advice pls
Regards

alan2 · Feb 19, 2012

You asked this question in another thread. You really can't model inter-arrival times as a normal distribution because they are not normally distributed. You really do need to think about using the appropriate distribution. The number of arrivals in a given time is a counting process such as Poisson. Then the inter-arrival times that result from that counting process are exponentially distributed.

Mark J. · Feb 19, 2012

Thank you to Stephen and Alan I appreciate it.
I know from literature that it should be modeled by Poisson if counting number of arrivals or exponential if studying inter arrivals but histogram suggest me normal distribution.
Anyway thank you for advices.
Regards

alan2 · Feb 19, 2012

Hi Mark,

I was thinking that you might be frustrated that we just keep telling you that your distribution isn't normal. It occurred to me that maybe we're just not understanding what your data represents. Could you elaborate? I ask because you mention bus arrival times which are not random variables at all, they are scheduled events. So I thought maybe you could be looking at the error in actual arrival times or the actual time between two scheduled arrivals given that there is error in the actual arrival times of consecutive buses. These quantities could reasonably be normally distributed. Maybe if we better understood your data we could offer some help.

chiro · Feb 19, 2012

If you absolutely want to test whether your data's histogram has a good fit to a normal distribution then you need to apply what is known as a Goodness-Of-Fit test.

For a normal distribution you use what is known as a Shapiro-Wilk test which will give you a statistics which tells you how 'well' the 'fit' is.

But again I want to give a note of caution to take in what the above posters have said: you need to understand your data not only from a probabilistic or histogram point of view, but more importantly from a process point of view.

Understanding the underlying process and the effect that it has on describing the final distribution is going to be a lot more useful than just trying to fit things to distributions especially if you are looking at something from the point of view of the process as opposed to using results for statistical purposes like say testing whether the errors of a regression are normally distributed.

Mark J. · Feb 20, 2012

Yes thank you,
Actually I am using Chi-square, Kolmogorov and other tests.
The first one seems fine the others just do not fit.
I am attaching with the email one sample of data collected.
Of course ideally they should be scheduled each 5 minutes but actually we see that there are errors that's what I am working on.
Thank you for suggestions.
Regards

alan2 · Feb 20, 2012

I thought that might be what you were actually doing. Now the statisticians can jump in.

moonman239 · Feb 20, 2012

Mark J. said:

Yes thank you,
Actually I am using Chi-square, Kolmogorov and other tests.
The first one seems fine the others just do not fit.
I am attaching with the email one sample of data collected.
Of course ideally they should be scheduled each 5 minutes but actually we see that there are errors that's what I am working on.
Thank you for suggestions.
Regards

I still don't get it. Am I the only one that doesn't?

Please describe exactly what you want. Then we can help.

Mark J. · Feb 29, 2012

The idea is that observed data are inter-arrival times.
The observers didn't get them in the exact seconds but kind of rounded them in minutes for examples 7.04 7.08 etc
Now to fit this data in a common used distribution for example exponential is impossible as long as it is continuous distribution.
How to arrange the estimation of error for this data or any other approach to this situation?
Regards

Stephen Tashi · Feb 29, 2012

Mark J. said:

Now to fit this data in a common used distribution for example exponential is impossible as long as it is continuous distribution.

It should be straightforward to fit a continuous distribution to the data once you have decided what criteria to use for a fit. A continuous density function f(x) implies a discrete distribution if the data was rounded. For example, the probability of observing x = 7 minutes given that data was rounded to the nearest minute is [itex] \int_{6.5}^{7.5} {f(x) dx} [/itex].

Fit the data to the discrete distribution that is implied by rounding the continuous distribution.

You have to define "kind of rounded" precisely. Did they round or truncate?

Mark J. · Mar 1, 2012

Thank you for your explanation.
The data was rounded to upper floor meaning that :

6.7 was taken as 7
4.3 was taken as 4

Regards

How can I normalize integer data for a better fit to a normal distribution?

What is a normal distribution?

Why is it important to have data that follows a normal distribution?

How do you determine if data follows a normal distribution?

Can data be transformed to follow a normal distribution?

What are the potential consequences of using statistical tests that assume normality on non-normal data?

Similar threads

Hot Threads

Recent Insights