Finding an approximate CDF/PDF from a large data set

woodssnoop · Aug 5, 2011

I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

mathman · Aug 5, 2011

If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

woodssnoop · Aug 5, 2011

If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

I know I should be able to do a least-squares regression on my data to smooth it out once I get it plotted, but I feel like I my logic for plotting the CDF and finding the analytical CDF is flawed. Right now I am plotting P(x[itex]\leq[/itex]a) = CDF = (number of data points with the value of "a" or less)/1000 and the only function that seems to fit is a high order polynomial (e.g. order(f(x))[itex]\geq[/itex]20). A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.

Stephen Tashi · Aug 5, 2011

woodssnoop said:

A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.

If you fit a function f(x) to a histogram of data whose range is [itex] A \leq x \leq B [/itex] then you must decide whether you believe the curve fit applies to values of x outside the range [A,B]. If you feel values outside that range are impossible then you define the f(x) to be zero for [itex] x < A [/itex] and f(x) = 0 for [itex] x > B [/itex], so the integral of f(x) "from minus infinity to infinity" is only the integral of f(x) "from A to B".

Simiarly, if you fit the function F(x) to a cumulative histogram of the data, you define F(x) = 0 for [itex] x < A [/itex] and F(x) = 1 for [itex] x > B [/itex]. You must use a curve fitting method that produces an F(x) that is a non-decrreasing function and it must never exceed 1.

Unless you are using a curve fit that passes a curve f(x) exactly through each point on the histogram, you will still have to "normalize" the function f(x) in order to have a PDF. If you decide the possible data values are in the range from A to B then divide f(x) by [itex] \int_A^B f(x) dx [/itex] to get a PDF.

woodssnoop · Aug 10, 2011

Alright, thanks for the help. I think I got it.

epik · Aug 10, 2011

woodssnoop said:

I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

I might be wrong but you need to do a Chi-squared test.First do a CDF to figure approximately what distro you have.You might have a common distro normal/exponential in which case there is no point fitting a function and creating a distro of your own.You can check wiki on how the CDFs of these distros look like.

In case the CDF looks like a known distro,find the average/variance and get the PDF according to your parameters.Then you do a chi-squared test with a confidence interval to verify it.

Again,I could be wrong.

bpet · Aug 10, 2011

If none of the standard distributions fit your data then another option is kernel density estimation.

Finding an approximate CDF/PDF from a large data set

Related to Finding an approximate CDF/PDF from a large data set

1. How do you find the approximate CDF/PDF from a large data set?

2. Why is it important to find the CDF/PDF from a large data set?

3. What are the assumptions made when finding an approximate CDF/PDF from a large data set?

4. Can you use any statistical software to find the approximate CDF/PDF?

5. How does the sample size affect the accuracy of the approximate CDF/PDF?

Similar threads

Hot Threads

Recent Insights