Finding an approximate CDF/PDF from a large data set

  • Context: Graduate 
  • Thread starter Thread starter woodssnoop
  • Start date Start date
  • Tags Tags
    Approximate Data Set
Click For Summary
SUMMARY

This discussion focuses on the process of deriving a Probability Density Function (PDF) from a dataset of 1000 energy calculations for random orientations of a molecule in a chemical system. The user seeks to plot the Cumulative Distribution Function (CDF) and fit it to a function, f(x), to obtain the PDF through differentiation. Key points include the necessity of smoothing the CDF for accurate PDF representation, the importance of normalization, and the potential use of Chi-squared tests to validate distribution assumptions. Kernel density estimation is also suggested as an alternative if standard distributions do not fit the data.

PREREQUISITES
  • Understanding of Cumulative Distribution Functions (CDF) and Probability Density Functions (PDF)
  • Familiarity with curve fitting techniques and least-squares regression
  • Knowledge of statistical tests, specifically the Chi-squared test
  • Experience with kernel density estimation methods
NEXT STEPS
  • Learn about kernel density estimation techniques for non-parametric PDF estimation
  • Study the principles of least-squares regression for curve fitting
  • Research Chi-squared tests and their application in statistical analysis
  • Explore the characteristics of common distributions such as normal and exponential
USEFUL FOR

Data scientists, statisticians, and researchers working with statistical modeling and analysis of chemical systems will benefit from this discussion.

woodssnoop
Messages
9
Reaction score
0
I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.
 
Physics news on Phys.org
If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.
 
If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

I know I should be able to do a least-squares regression on my data to smooth it out once I get it plotted, but I feel like I my logic for plotting the CDF and finding the analytical CDF is flawed. Right now I am plotting P(x\leqa) = CDF = (number of data points with the value of "a" or less)/1000 and the only function that seems to fit is a high order polynomial (e.g. order(f(x))\geq20). A CDF of this type will never have the property such that \int f'(x) dx ,x=-inf..inf = 1.
 
woodssnoop said:
A CDF of this type will never have the property such that \int f'(x) dx ,x=-inf..inf = 1.

If you fit a function f(x) to a histogram of data whose range is A \leq x \leq B then you must decide whether you believe the curve fit applies to values of x outside the range [A,B]. If you feel values outside that range are impossible then you define the f(x) to be zero for x < A and f(x) = 0 for x > B, so the integral of f(x) "from minus infinity to infinity" is only the integral of f(x) "from A to B".

Simiarly, if you fit the function F(x) to a cumulative histogram of the data, you define F(x) = 0 for x < A and F(x) = 1 for x > B. You must use a curve fitting method that produces an F(x) that is a non-decrreasing function and it must never exceed 1.

Unless you are using a curve fit that passes a curve f(x) exactly through each point on the histogram, you will still have to "normalize" the function f(x) in order to have a PDF. If you decide the possible data values are in the range from A to B then divide f(x) by \int_A^B f(x) dx to get a PDF.
 
Alright, thanks for the help. I think I got it.
 
woodssnoop said:
I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

I might be wrong but you need to do a Chi-squared test.First do a CDF to figure approximately what distro you have.You might have a common distro normal/exponential in which case there is no point fitting a function and creating a distro of your own.You can check wiki on how the CDFs of these distros look like.

In case the CDF looks like a known distro,find the average/variance and get the PDF according to your parameters.Then you do a chi-squared test with a confidence interval to verify it.

Again,I could be wrong.
 
If none of the standard distributions fit your data then another option is kernel density estimation.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
Replies
4
Views
2K
  • · Replies 5 ·
Replies
5
Views
6K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
2
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K