Finding an approximate CDF/PDF from a large data set

In summary: This involves creating a smoothed PDF by placing a kernel function over each data point and then summing all the kernels to get the overall PDF. This can be done using various methods such as Gaussian kernels or triangular kernels.In summary, the conversation discusses finding a PDF of a data set by plotting the CDF and fitting it to a function. The method involves smoothing out the plot and using a least-squares regression or kernel density estimation. It is also suggested to perform a Chi-squared test or check for a known distribution before creating a custom PDF.
  • #1
woodssnoop
10
0
I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.
 
Physics news on Phys.org
  • #2
If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.
 
  • #3
If I understand you correctly, you have 1000 data points, which can be used to approximate a CDF. A straightforward plot will give you a graph consisting of a series of straight line segments. To get a decent PDF you will need to smooth it out.

I know I should be able to do a least-squares regression on my data to smooth it out once I get it plotted, but I feel like I my logic for plotting the CDF and finding the analytical CDF is flawed. Right now I am plotting P(x[itex]\leq[/itex]a) = CDF = (number of data points with the value of "a" or less)/1000 and the only function that seems to fit is a high order polynomial (e.g. order(f(x))[itex]\geq[/itex]20). A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.
 
  • #4
woodssnoop said:
A CDF of this type will never have the property such that [itex]\int[/itex] f'(x) dx ,x=-inf..inf = 1.

If you fit a function f(x) to a histogram of data whose range is [itex] A \leq x \leq B [/itex] then you must decide whether you believe the curve fit applies to values of x outside the range [A,B]. If you feel values outside that range are impossible then you define the f(x) to be zero for [itex] x < A [/itex] and f(x) = 0 for [itex] x > B [/itex], so the integral of f(x) "from minus infinity to infinity" is only the integral of f(x) "from A to B".

Simiarly, if you fit the function F(x) to a cumulative histogram of the data, you define F(x) = 0 for [itex] x < A [/itex] and F(x) = 1 for [itex] x > B [/itex]. You must use a curve fitting method that produces an F(x) that is a non-decrreasing function and it must never exceed 1.

Unless you are using a curve fit that passes a curve f(x) exactly through each point on the histogram, you will still have to "normalize" the function f(x) in order to have a PDF. If you decide the possible data values are in the range from A to B then divide f(x) by [itex] \int_A^B f(x) dx [/itex] to get a PDF.
 
  • #5
Alright, thanks for the help. I think I got it.
 
  • #6
woodssnoop said:
I am trying to reproduce the results of a colleague and I am having difficulty understanding how to find a PDF of a data set. The calculations were preformed on a chemical system and the energy for a 1000 random orientation of the molecule was calculated. Am I right in thinking that if I plot the CDF of the data set and fit the plotted CDF to a function, f(x), that df(x)/dx = PDF of the data set?

Thank you in advance for the help.

I might be wrong but you need to do a Chi-squared test.First do a CDF to figure approximately what distro you have.You might have a common distro normal/exponential in which case there is no point fitting a function and creating a distro of your own.You can check wiki on how the CDFs of these distros look like.

In case the CDF looks like a known distro,find the average/variance and get the PDF according to your parameters.Then you do a chi-squared test with a confidence interval to verify it.

Again,I could be wrong.
 
  • #7
If none of the standard distributions fit your data then another option is kernel density estimation.
 

Related to Finding an approximate CDF/PDF from a large data set

1. How do you find the approximate CDF/PDF from a large data set?

To find the approximate CDF/PDF from a large data set, you can follow these steps:
1. Sort the data in ascending order.
2. Assign ranks to each data point, starting from 1 for the smallest value.
3. Calculate the percentage of data points that fall below each rank (CDF).
4. Plot the CDF against the corresponding data points to get an approximation of the CDF curve.
5. To get the PDF, take the derivative of the CDF curve.

2. Why is it important to find the CDF/PDF from a large data set?

Finding the CDF/PDF from a large data set allows us to understand the distribution of the data and make predictions about future data points. It also helps in identifying outliers and understanding the overall pattern of the data.

3. What are the assumptions made when finding an approximate CDF/PDF from a large data set?

The main assumption is that the data follows a certain distribution, such as a normal distribution. If this assumption is not met, the resulting CDF/PDF may not accurately represent the data.

4. Can you use any statistical software to find the approximate CDF/PDF?

Yes, there are many statistical software programs that have built-in functions for finding the CDF/PDF of a data set. Some examples include R, Python, and MATLAB.

5. How does the sample size affect the accuracy of the approximate CDF/PDF?

The larger the sample size, the more accurate the approximate CDF/PDF will be. This is because as the sample size increases, the data points are more representative of the entire population, leading to a more accurate estimation of the CDF/PDF curve.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
906
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
5K
  • Set Theory, Logic, Probability, Statistics
Replies
5
Views
2K
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
2
Views
3K
  • STEM Educators and Teaching
Replies
5
Views
813
  • Calculus and Beyond Homework Help
Replies
4
Views
1K
Back
Top