Finding an approximate CDF/PDF from a large data set

woodssnoop · Aug 5, 2011

I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

mathman · Aug 5, 2011

If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

woodssnoop · Aug 5, 2011

If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

I know I should be able to do a least-squares regression on my data to smooth it out once I get it plotted, but I feel like I my logic for plotting the CDF and finding the analytical CDF is flawed. Right now I am plotting P(x[itex]\leq[/itex]a) = CDF = (number of data points with the value of "a" or less)/1000 and the only function that seems to fit is a high order polynomial (e.g. order(f(x))[itex]\geq[/itex]20). A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.

Stephen Tashi · Aug 5, 2011

woodssnoop said:

A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.

If you fit a function f(x) to a histogram of data whose range is [itex]A \leq x \leq B[/itex] then you must decide whether you believe the curve fit applies to values of x outside the range [A,B]. If you feel values outside that range are impossible then you define the f(x) to be zero for [itex]x < A[/itex] and f(x) = 0 for [itex]x > B[/itex], so the integral of f(x) "from minus infinity to infinity" is only the integral of f(x) "from A to B".

Simiarly, if you fit the function F(x) to a cumulative histogram of the data, you define F(x) = 0 for [itex]x < A[/itex] and F(x) = 1 for [itex]x > B[/itex]. You must use a curve fitting method that produces an F(x) that is a non-decrreasing function and it must never exceed 1.

Unless you are using a curve fit that passes a curve f(x) exactly through each point on the histogram, you will still have to "normalize" the function f(x) in order to have a PDF. If you decide the possible data values are in the range from A to B then divide f(x) by [itex]\int_A^B f(x) dx[/itex] to get a PDF.

woodssnoop · Aug 10, 2011

Alright, thanks for the help. I think I got it.

epik · Aug 10, 2011

woodssnoop said:

I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

I might be wrong but you need to do a Chi-squared test.First do a CDF to figure approximately what distro you have.You might have a common distro normal/exponential in which case there is no point fitting a function and creating a distro of your own.You can check wiki on how the CDFs of these distros look like.

In case the CDF looks like a known distro,find the average/variance and get the PDF according to your parameters.Then you do a chi-squared test with a confidence interval to verify it.

Again,I could be wrong.

bpet · Aug 10, 2011

If none of the standard distributions fit your data then another option is kernel density estimation.

Finding an approximate CDF/PDF from a large data set

Graduate Expected numbers of cards of a last color remaining

Undergrad The problem of points

Graduate Probability puzzle

Undergrad The countability paradox of computable numbers

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Finding an approximate CDF/PDF from a large data set

Similar threads