# Kernel density estimation for non-I.I.D. data

• belliott4488
In summary, the conversation discusses the use of Kernel Density Estimation (KDE) for analyzing data on the spatial distribution of debris pieces. The speaker is unsure if KDE is appropriate due to the possible non-identical distribution of the debris pieces. Other methods, such as using a grid of volume cells and calculating mean number density, are also considered. The conversation concludes with a suggestion to treat the "something of value" as a standard particle for analysis.
belliott4488
I am wondering if Kernel Density Estimation (KDE) is appropriate for some data analysis I'm working on. I have a simulated process that produces a large number N of pieces of debris, and I want to know how these objects are distributed spatially. In other words, I'd like to estimate a density function for the number of debris pieces per unit volume.

I can treat the debris pieces as being non-interacting, so I think it's safe to treat each one's location at any time as independent of all the others. If it were also true that all the pieces were identically distributed, then I think it would be okay to treat their positions at a time t as independent trials of one random variable and to apply the techniques of KDE to estimate a density function at each time t (?).

My concern is that the debris pieces are possibly (probably?) not identically distributed, in other words, they will have varying physical characteristics that will most likely affect the way they are distributed, e.g. heavy fragments might follow different trajectories than ligher ones.

If that is the case, is it valid to apply KDE to a single debris sample? Would I have to use a different kernel function for different fragments (in principle)?

In fact, I will have multiple samples (the output of a Monte Carlo simulation), so I believe I can combine the results from different MC realizations using techniques suitable for I.I.D. data, but I'm still not sure how to treat the results from each individual run. Different numbers of debris pieces will typically be produced by different MC runs, so there's no way to identify the same fragment from run to run or to treat its positions as multiple trials of the same variable.

Thanks for any suggestions.

As far as I know, kernel density estimation is a non-parametric method which generally assumes i.i.d. samples of a random variable. What exactly is your random variable: particle weight, distribution of particles over an area or both? Are you assuming a Gaussian distribution? You will need to specify some kind of weighted categorical classification schema for the debris particles before you can apply a smoothing function.

Last edited:
SW VandeCarr said:
As far as I know, kernel density estimation is a non-parametric method which generally assumes i.i.d. samples of a random variable. What exactly is your random variable: particle weight, distribution of particles over an area or both? Are you assuming a Gaussian distribution? You will need to specify some kind of weighted categorical classification schema for the debris particles before you can apply a smoothing function.
The only variable I'm concerned with is particle number density as a function of position and time. This is for safety analysis, and the safety requirements are specified in terms only of the probability of collision (between a debris fragment and something of value) with debris of any size over some minimum threshold. Size and mass of the debris fragments don't matter as far as the probability calculation is concerned.

I would probably assume a Gaussian distribution in the absence of any better knowledge. Perhaps I should find out if I can characterize the distributions of fragments as a function of some parameter I don't otherwise care about, like mass. The Monte Carlo simulation I used draws velocity vectors from known distributions for the debris fragments, and the distributions are mass-dependent, so maybe that would help.

What I'm completely unsure about is whether this is even do-able using KDE, since I would now be using different kernel distributions for different fragments. I could possibly divide my population into some number of mass bins and estimate the distributions separately ... but I think I'd be gaining more in complexity than I would in accuracy.

The brute-force alternative I've been planning to use is just to divide the volume of space in question into a grid of volume cells and simply to calculate a mean number density for each cell as a function of time - basically a 3-D histogram. I could then pass an object of known cross-sectional area through this grid and calculate the probability of collision. I'll have the usual problems of picking an optimal grid size, but since I'm potentially dealing with a large number of fragments (10^7), I think I can find one that will work.

Any ideas for a better approach?

Thanks,
Bruce

belliott4488 said:
This is for safety analysis, and the safety requirements are specified in terms only of the probability of collision (between a debris fragment and something of value) with debris of any size over some minimum threshold. Size and mass of the debris fragments don't matter as far as the probability calculation is concerned.

What I'm completely unsure about is whether this is even do-able using KDE.

Then you're dealing with a dynamical system in 3 space, a classical Hamiltonian system. IMHO KDE is not the way to go.
The brute-force alternative I've been planning to use is just to divide the volume of space in question into a grid of volume cells and simply to calculate a mean number density for each cell as a function of time - basically a 3-D histogram. I could then pass an object of known cross-sectional area through this grid and calculate the probability of collision. I'll have the usual problems of picking an optimal grid size, but since I'm potentially dealing with a large number of fragments (10^7), I think I can find one that will work.
Thanks,
Bruce
If you assume the position-momentum vectors are random with a uniform (not Gaussian) distribution and assign unit mass to all particles, the problem would seem fairly straightforward. If you want the probability of collision per unit time, you will need to specify particle scalar speed and particle density. The probability of collision (any overlap of particle cross-sections up to and including 2r) is a direct function of particle density and the cross sectional radius.

EDIT: I'm not sure what parameters you want to assign to "something of value", but possibly you could initially treat it as just another standard particle of unit mass and cross section (I'm assuming all particles are spherical in the general case) and then modify your simulation according to some specific specifications.

Last edited:
SW VandeCarr said:
If you assume the position-momentum vectors are random with a uniform (not Gaussian) distribution and assign unit mass to all particles, the problem would seem fairly straightforward. If you want the probability of collision per unit time, you will need to specify particle scalar speed and particle density. The probability of collision (any overlap of particle cross-sections up to and including 2r) is a direct function of particle density and the cross sectional radius.

EDIT: I'm not sure what parameters you want to assign to "something of value", but possibly you could initially treat it as just another standard particle of unit mass and cross section (I'm assuming all particles are spherical in the general case) and then modify your simulation according to some specific specifications.
Okay, I need to provide a little more detail. I'm analyzing the results of collisions in space that produce a large number of debris fragments, and the risk imposed on orbiting satellites. I have a model for the debris production, which is what determines the distributions of mass, velocity vectors, size, etc. for the fragments (it's an empirical model that is tuned to match observations from ground tests, so I have no control over those distributions). The specifics of the collisions I'm investigating are such the the debris will re-enter the atmosphere after less than an orbit. For that reason, I can get away with a simple force model - just a simple gravity model.

I can propagate all the fragments until they have re-entered, but storing all of their position data gets to be very resource-intensive, especially when I look at a few hundred Monte Carlo simulations of the collision dynamics. That's why I want to reduce the problem to a time-dependent number density field that statistically summarizes the outcome of the simulation runs. I can then take a satellite of known cross-sectional area and orbit, and propagate it through the debris field and calculate the probability of collision.

The question is whether there's a better way to characterize the density than the basic 3-D histogram approach of storing densities for a grid of volume cells.

SW VandeCarr said:
I still don't believe that KDE is useful here. I'm aware of its application with planar histograms, but not "3D histograms".
I'm coming to the same conclusion. I've seen papers suggesting that KDE approaches become more difficult in more that one dimension because of the need to choose some kind of "directionality" for both the kernel function and the associated bandwidth selection. In my case that's very likely to be an issue since my debris fragments are not uniformly distributed in space but rather move radially out from the center of mass.

SW VandeCarr said:
Thanks for the link; it certainly looks relevant, so I'll give it a good look.

belliott4488 said:
I can propagate all the fragments until they have re-entered, but storing all of their position data gets to be very resource-intensive, especially when I look at a few hundred Monte Carlo simulations of the collision dynamics.

If the fragments all originate approximately from a single point then perhaps you could precompute the subset of initial velocity vectors for which the trajectory crosses the orbit of the satellite - that could reduce the number of fragments you have to track to test for actual collision with the satellite.

belliott4488 said:
In my case that's very likely to be an issue since my debris fragments are not uniformly distributed in space but rather move radially out from the center of mass.

That's not the way fragments from a breakup of an orbiting object behave. They will follow in the orbit of the original object. As far as the calculation of individual vectors, it seems very labor intensive for little gain. The approach outlined in the linked paper is much simpler and is based on real data (which is probably much better in 2009). The most important variables are mean particle density and mean cross sectional radius for a given orbit level.

This assumes your main interest is the probability of collision over a given time interval for some orbital level.

EDIT: In my earlier posts, I did not know you were dealing with orbiting objects. Nevertheless the similarity to gas kinetics is still generally true for a given orbital level since any orbit must only lie in a plane which contains the orbited object's (ie: the Earth) center of mass. Otherwise an orbit is largely unconstrained (polar, equatorial, or anything in between.)

Last edited:
SW VandeCarr said:
That's not the way fragments from a breakup of an orbiting object behave. They will follow in the orbit of the original object.
I'm not sure what kind of breakup you have in mind, but I am talking about debris from collisions between two objects. In such cases, then yes, up to variations in momentum transfer, the center of mass of the cloud of debris from each object tends to follow the original trajectory of the object. As each cloud expands, however, the debris fragments move radially out from the C-O-M, as seen in the C-O-M frame of reference. That's all I was referring to when I said that the fragments are not distributed uniformly in space; there's a point of origin (where the collision occurred), and two sets of debris that move away from it, with lighter fragments typically moving faster than heavier ones.
SW VandeCarr said:
As far as the calculation of individual vectors, it seems very labor intensive for little gain. The approach outlined in the linked paper is much simpler and is based on real data (which is probably much better in 2009). The most important variables are mean particle density and mean cross sectional radius for a given orbit level.
The spatial density used in that paper is for the background debris density, that is, the steady-state density of long-lived orbital debris. I am dealing with a very different situation, namely very short-lived debris that is concentrated in a specific location at a specific time. In short, when planning for a specific event involving a collision of two objects, I want to know what the risk posed to orbiting satellites. The background debris density is relevant only as a reference level that tells me at what point I may cease consideration of the additional debris created by my event.
SW VandeCarr said:
This assumes your main interest is the probability of collision over a given time interval for some orbital level.

EDIT: In my earlier posts, I did not know you were dealing with orbiting objects. Nevertheless the similarity to gas kinetics is still generally true for a given orbital level since any orbit must only lie in a plane which contains the orbited object's (ie: the Earth) center of mass. Otherwise an orbit is largely unconstrained (polar, equatorial, or anything in between.)
I doubt gas kinetics would apply in my case for the reasons I stated above. Specifically, the debris fragments typically will not collide with each other (all having originated at the same point), and thus will not behave like a gas at all. They're essentially following independent trajectories.

This is all somewhat off-topic, however. All I really wanted to know was if there is a better way to capture the time-dependent spatial density than simply to define a grid of volume cells and to count the number of fragments in each cell at any given time.

bpet said:
If the fragments all originate approximately from a single point then perhaps you could precompute the subset of initial velocity vectors for which the trajectory crosses the orbit of the satellite - that could reduce the number of fragments you have to track to test for actual collision with the satellite.
Unfortunately, the problem is a little more complicated than that. I need to consider all orbiting objects currently being tracked and to select a time for which a specific event that produces debris will not put any of those objects at risk of damage. Thus, the time at which my debris is produced is not known a priori (it's sort of what I'm ultimately seeking to find), nor can I say with certainty which satellite trajectories will be relevant.

Now, once I have identified the satellites that might pass through my debris cloud and propagate them through it, you're right - I need consider only those parts of the cloud that the satellite passes through. This is in the second part, however, i.e. the calculation of the probability of collision. I still need to have characterized the spatial density function before this step.

## 1. What is kernel density estimation for non-I.I.D. data?

Kernel density estimation (KDE) is a non-parametric method used to estimate the probability density function of a random variable. It is often used when the data is not independent and identically distributed (non-I.I.D.), meaning that the data points are not all from the same distribution. KDE uses a kernel function to smooth out the data and estimate the underlying probability density function.

## 2. How does KDE differ from other density estimation methods?

KDE differs from other density estimation methods, such as histogram or parametric methods, in that it does not make any assumptions about the underlying distribution of the data. This makes it a more flexible and versatile method, as it can be applied to a wide range of data types and distributions without the need for prior knowledge.

## 3. What is a kernel function and how does it work in KDE?

A kernel function is a mathematical function that is used to smooth out the data in KDE. It acts as a weighting function, assigning greater weight to data points that are closer to the point being estimated and lower weight to points that are further away. This helps to create a smooth estimate of the underlying probability density function.

## 4. Can KDE be used for high-dimensional data?

Yes, KDE can be used for high-dimensional data, but it may face some limitations. As the dimensionality of the data increases, the performance of KDE may decrease due to the curse of dimensionality. This means that the data becomes more sparse and the smoothing effect of the kernel function is less effective. In these cases, alternative methods may be more suitable.

## 5. How do I choose the appropriate bandwidth for KDE?

The bandwidth in KDE controls the amount of smoothing applied to the data. A larger bandwidth will result in a smoother estimate, while a smaller bandwidth will capture more of the variability in the data. The choice of bandwidth can greatly affect the performance of KDE, so it is important to choose an appropriate value. This can be done through trial and error or using cross-validation techniques to find the optimal bandwidth for the data.

• Set Theory, Logic, Probability, Statistics
Replies
7
Views
810
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
731
• Set Theory, Logic, Probability, Statistics
Replies
2
Views
599
• Set Theory, Logic, Probability, Statistics
Replies
3
Views
406
• Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
12
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
812
• Set Theory, Logic, Probability, Statistics
Replies
24
Views
2K
• Set Theory, Logic, Probability, Statistics
Replies
2
Views
1K
• Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K