Finding an approximate CDF/PDF from a large data set

  • Context: Graduate 
  • Thread starter Thread starter woodssnoop
  • Start date Start date
  • Tags Tags
    Approximate Data Set
Click For Summary

Discussion Overview

The discussion revolves around methods for approximating the cumulative distribution function (CDF) and probability density function (PDF) from a large data set, specifically in the context of a chemical system with energy calculations for random molecular orientations. Participants explore various techniques for fitting and smoothing data to derive these functions.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant questions whether fitting a function to the plotted CDF and differentiating it will yield the PDF, expressing uncertainty about their approach.
  • Another participant suggests that a straightforward plot of the CDF will consist of straight line segments and emphasizes the need for smoothing to obtain a decent PDF.
  • A participant expresses concern that fitting a high-order polynomial to the CDF may not satisfy the normalization condition required for a PDF.
  • There is a discussion about the necessity of defining the fitted function to be zero outside the range of the data, which impacts the normalization of the PDF.
  • One participant proposes using a Chi-squared test to determine if the data fits a known distribution before attempting to create a custom distribution.
  • Another participant mentions kernel density estimation as an alternative if standard distributions do not fit the data.

Areas of Agreement / Disagreement

Participants express differing views on the best approach to fitting the CDF and deriving the PDF, with no consensus reached on a single method. There is recognition of the need for normalization and the potential use of statistical tests, but opinions vary on the specifics of implementation.

Contextual Notes

Participants highlight limitations regarding the assumptions made in fitting functions to the data, the potential for high-order polynomial fits to violate properties of PDFs, and the need for careful consideration of the data range when defining fitted functions.

Who May Find This Useful

This discussion may be useful for researchers and practitioners in fields involving statistical analysis of data, particularly those working with empirical data sets in chemistry or related disciplines.

woodssnoop
Messages
9
Reaction score
0
I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.
 
Physics news on Phys.org
If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.
 
If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

I know I should be able to do a least-squares regression on my data to smooth it out once I get it plotted, but I feel like I my logic for plotting the CDF and finding the analytical CDF is flawed. Right now I am plotting P(x[itex]\leq[/itex]a) = CDF = (number of data points with the value of "a" or less)/1000 and the only function that seems to fit is a high order polynomial (e.g. order(f(x))[itex]\geq[/itex]20). A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.
 
woodssnoop said:
A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.

If you fit a function f(x) to a histogram of data whose range is [itex]A \leq x \leq B[/itex] then you must decide whether you believe the curve fit applies to values of x outside the range [A,B]. If you feel values outside that range are impossible then you define the f(x) to be zero for [itex]x < A[/itex] and f(x) = 0 for [itex]x > B[/itex], so the integral of f(x) "from minus infinity to infinity" is only the integral of f(x) "from A to B".

Simiarly, if you fit the function F(x) to a cumulative histogram of the data, you define F(x) = 0 for [itex]x < A[/itex] and F(x) = 1 for [itex]x > B[/itex]. You must use a curve fitting method that produces an F(x) that is a non-decrreasing function and it must never exceed 1.

Unless you are using a curve fit that passes a curve f(x) exactly through each point on the histogram, you will still have to "normalize" the function f(x) in order to have a PDF. If you decide the possible data values are in the range from A to B then divide f(x) by [itex]\int_A^B f(x) dx[/itex] to get a PDF.
 
Alright, thanks for the help. I think I got it.
 
woodssnoop said:
I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

I might be wrong but you need to do a Chi-squared test.First do a CDF to figure approximately what distro you have.You might have a common distro normal/exponential in which case there is no point fitting a function and creating a distro of your own.You can check wiki on how the CDFs of these distros look like.

In case the CDF looks like a known distro,find the average/variance and get the PDF according to your parameters.Then you do a chi-squared test with a confidence interval to verify it.

Again,I could be wrong.
 
If none of the standard distributions fit your data then another option is kernel density estimation.
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
2K
Replies
4
Views
3K
  • · Replies 5 ·
Replies
5
Views
6K
  • · Replies 5 ·
Replies
5
Views
3K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
2
Views
3K
  • · Replies 1 ·
Replies
1
Views
1K
  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 7 ·
Replies
7
Views
3K