How to calculate the bandwidth for a Kernel density estimation

Click For Summary

Discussion Overview

The discussion centers around the calculation of bandwidth for kernel density estimation (KDE), particularly in the context of estimating the density of distances obtained from a laser scanner. Participants explore various methods for estimating bandwidth, including the plug-in method and the K-nearest neighbor approach, while also discussing the implications of bandwidth on density estimation accuracy.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant describes their struggle with estimating bandwidth for KDE and mentions methods like the mean integrated squared error (MISE) and the plug-in method.
  • Another participant references a paper detailing the "Solve the Equation" plug-in method, suggesting it as a potential resource.
  • Concerns are raised about the computational intensity of the plug-in method and the choice of software, with suggestions to avoid Excel for such analyses.
  • Discussion includes the potential use of maximum likelihood estimation (MLE) as an alternative to KDE, with some participants questioning the appropriateness of MLE for their specific data characteristics.
  • A participant proposes using the K-nearest neighbor approach to determine bandwidth, suggesting that the variance of selected distances could serve as the bandwidth for each distance.
  • There is mention of the importance of automatic bandwidth estimation and its significant influence on the resulting density estimation.

Areas of Agreement / Disagreement

Participants express differing views on the best methods for bandwidth estimation, with no consensus reached on the superiority of one approach over another. Some favor the plug-in method while others consider K-nearest neighbors or MLE as alternatives. The discussion remains unresolved regarding the optimal method for the specific application described.

Contextual Notes

Participants note the complexity of the bandwidth estimation process and the need for clarity on specific steps within the methods discussed, such as the computation of certain parameters in the plug-in method. There is also an acknowledgment of the limitations of various software tools in handling KDE.

Who May Find This Useful

Researchers and practitioners involved in statistical analysis, particularly those working with kernel density estimation in fields such as computer vision, data analysis, and applied mathematics.

Jack Miller
Messages
4
Reaction score
4
Hello everybody!

Since some time I am trying the estimate the density of a set of numbers (in my case the numbers are distances to some object from a laser scanner).

As I read, that the kernel density estimation technique is a basic approach for that kind of problem. Different Kernels can be applied, e.g. a Gaussian kernel.
However the Variance/Bandwidth must be estimated for a correct project of the density from the given data(-->here the set of distances)

There are some measures like the "mean integrated squared error". In that case a set of test bandwidth are calculated and the one with the lowest MISE is a good choice for the bandwidth.
However there are other methods like the "plug in" method which results to an optimal bandwidth. Does anybody knows how to use that "plug in" method or knows some other method? I just stucked and that point of estimated the optimal bandwidth.


Any help or comment, I would be really glad!
 
  • Like
Likes   Reactions: Gail Higginbottom
Physics news on Phys.org
This paper describes the "Solve the Equation" plug in method which differs from the direct plug in approach in that the pilot bandwidth is written as a function of the kernel bandwidth h. (see section 5.2).

http://www.umiacs.umd.edu/~vikas/publications/CS-TR-4774.pdf
 
  • Like
Likes   Reactions: Gail Higginbottom
Thanks a lot for your reply. I have read about the plug - in method. To be honest I couldn't completely figure out how to use it. For instance in the section you mentioned -- 5.2. -- on page 11. I am not sure how to compute step (1), how to estimate σ^. or what is H(.) in step 3?

By the way... Currently I am using a Gaussian kernel. As input I have a vector of about 100-150 double numbers.

Is there maybe some easier approximation available?
Or maybe some example?

I apologize, for those questions. I m trying since a week to figure out to calculate the bandwidth, even I tried with histogram smoothing which was not really successful.
 
  • Like
Likes   Reactions: Gail Higginbottom
From your original question, I got the impression that you were more familiar with the method. It's computationally intensive. What software are you running? My experience a while back, working in a multidisciplinary group, was with Minitab. I'm sure there are other good packages for this (SAS), but stay away from Excel. It has a bad reputation according to some experienced participants in this forum.

I assume you have some reasons for using KDE instead of parametric methods. However, if all you want to do is find the most likely distribution for a data set, MLE works pretty well unless your data has a lot of noise.

One motivation for KDE is for smoothing noisy data. A forerunner equation for determining the distribution of some random variable is:

dX_t=\mu (X_t) dt + \sigma(X_t)dW_t where W_t is Standard Brownian Noise.

You should have seen this equation if you are attempting KDE:

f_h(x)=\frac{1}{nh}\sum_{i=1}^{n} K \frac{(x-x_i)}{h}

where K is your kernel density function which integrates to 1. In your case it is Gaussian.

h is your bandwidth which is chosen to optimize between smoothness (larger) and reduced bias (smaller). Sigma is your standard deviation parameter. If I remember correctly, your software should let you vary the bandwidth and display the resulting plot with the MSE.

My google searches have revealed a lot material on this. We can't give tutorials on a subject like KDE but we might be able to help with specific questions. However, you don't need to understand the details of the computations to use the software.
 
Last edited:
  • Like
Likes   Reactions: scottdave and (deleted member)
Thanks a lot for the answer.

There are many toolboxes in Matlab or in Python. Even I tried to use some applets.
It has shown that the bandwidth is a major factor that influences the estimated density of a distribution. Moreover the bandwidth estimation must run automatically.

I am suppose to use C++. I would like to give a short description about the project:
It is about object recognition in 3d. For each object I generate a 3d mesh which consists a set of nodes.
The nodes are connected by edges. Afterwards I compute for each node the shortest distances to all other nodes (all-pair-shortest-distance).
Then I compute for each node the mean distance (a kind of centrality measure for that node). This set of mean distances from all nodes
characterizes the object.
Then it is possible to compare sets of distances from different object to differentiate them.
I am follwing the approach which is described here: http://graphics.stanford.edu/courses/cs468-08-fall/pdf/hamza-krim.pdf .

In order to compare the set from distances from different objects, the density of distances from each object is estimated and then compared.
I am wanted to use KDE as also proposed in the paper(link).
However the optimal bandwidth in order to compute the a density distribution of the distances is a crucial parameter.

But you mentioned MLE. Does it mean, that I just compute the expected value and variance from the given distances of an object?
For smoothing you mentioned to use forerunner equation (Brownian noise).

I read that there is K-nearest neighbor approach (http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver2_5.html). As I understood for each distance, I should compute the K-nearest neighbors and then compute the variance of these selected distances.
This variance is the bandwidth for this distance. So in your KDE equation above: there will be not h, but a h_i which corresponds to the variance for K-nearest neighbors of x_i.
Do you agree?I am glad for the feedback you already gave me!
 
  • Like
Likes   Reactions: Gail Higginbottom
Jack Miller said:
Thanks a lot for the answer.

There are many toolboxes in Matlab or in Python. Even I tried to use some applets.
It has shown that the bandwidth is a major factor that influences the estimated density of a distribution. Moreover the bandwidth estimation must run automatically.

Good. This would seem to solve your problem in terms of optimizing h.


.
Then it is possible to compare sets of distances from different object to differentiate them.
I am follwing the approach which is described here: http://graphics.stanford.edu/courses/cs468-08-fall/pdf/hamza-krim.pdf .

This is a fascinating application. Various 3D objects will have distinctive geodesic shape distributions.

In order to compare the set from distances from different objects, the density of distances from each object is estimated and then compared.
I am wanted to use KDE as also proposed in the paper(link).
However the optimal bandwidth in order to compute the a density distribution of the distances is a crucial parameter.

Of course.

But you mentioned MLE. Does it mean, that I just compute the expected value and variance from the given distances of an object?

I don't think so. After reading your link, it's clear you should use KDE.

For smoothing you mentioned to use forerunner equation (Brownian noise).

No. That equation is not very useful for direct application of KDE. With a Gaussian kernel the KDE equation (also given in both the Raykar paper and in your link) becomes:

\hat p(x)=\frac{1}{N\sqrt{2\pi h^2}} \sum_{i=1}^{N}e^{-(X-x_{i})^{2}/2h^2}

I read that there is K-nearest neighbor approach (http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver2_5.html). As I understood for each distance, I should compute the K-nearest neighbors and then compute the variance of these selected distances.
This variance is the bandwidth for this distance. So in your KDE equation above: there will be not h, but a h_i which corresponds to the variance for K-nearest neighbors of x_i.
Do you agree?

I don't follow. You're summing over all i in order to obtain an estimation. You are going to have only one selected bandwidth for a given estimation \hat p(x). Your aim is to minimize the integrated MSE for selections of h optimized for smoothness. From what you said above, your program should do this for you, or at least present you with choices.

intMSE(\hat p,p)=E[\int_R (\hat p(x)-p(x))^2 dx].

The Raykar link I posted is very good IMO. You should take the time to try to read it all even if you may not have the background to follow all of it. For example, do you know what a characteristic function (or Fourier transform of a PDF) is? If you have the right software for doing KDE you really don't need to worry about the computational details. The programs I remember displayed plots in evolution and calculated MSEs for you.

EDIT: Part of the problem I think you're having is evident in in your question. How do you "calculate" bandwidth? There is no necessarily "correct" bandwidth; only more or less optimal bandwidths given the objective.
 
Last edited:
  • Like
Likes   Reactions: Gail Higginbottom
Hi,

I read the paper and tried to understand, especially section 5.2. . But all details of the bandwidth estimation are not completely clear for me. However this is a requirement if I need to estimate the bandwidth by myself.

As mentioned in the previous post, I found some sources which are computing the bandwidth. Unfortunately, they are applets or libraries in python, Matlab or R. I didn't find anything in C/C++ which is necessary for me, since the system which I have to integrate my part(object recogniton) is in C++.

I found on the Raykar website (http://www.umiacs.umd.edu/~vikas/Software/optimal_bw/optimal_bw_code.htm) some code in C++(http://www.umiacs.umd.edu/~vikas/Software/optimal_bw/source_code.zip).

It seems very promising! : "FAST OPTIMAL BANDWIDTH ESTIMATION FOR UNIVARIATE KDE - CODE" :-)

This is the header which provides a single function (which I think implements the approach of the paper you posted!):"class UnivariateDensityDerivative{

public:

//constructor

UnivariateDensityDerivative(int NSources,

int MTargets,

double *pSources,

double *pTargets,

double Bandwidth,

int Order,

double *pDensityDerivative);
//destructor

~UnivariateDensityDerivative();
//function to evaluate the Density Derivative

void Evaluate();
//function to evaluate the Hermite polynomial.

double hermite(double x, int r);
private:

int N; //number of sources.

int M; //number of targets.

double *px; //pointer to sources, (N).

double *py; //pointer to the targets, (M).

double h; //the source bandwidth.

int r; //the rth density derivative.

double *pD; //pointer the the evaluated Density Derivative, (M).



}; "

I am not 100% sure how to use that function.
The function requires the sources points and targets points. I suppose that these points are used to evaluate the accuracy of the density estimation.
Moreover I think, these points represent the "distances" in my context(see previouse posts).
Also sources and targets points are from the same distribution; maybe I can even use the same points for source and target in the function.

There is a bandwidth parameter. It seems that I need to provide a bandwidth, so this function will not provide an optimal bandwidth, rather I think it will return a kind of fitness of the provided bandwidth.
But how is the fitness represented? Maybe the last parameter "pDensityDerivative" does this. It is are kind of "return" parameter as I understood when I had a look at the code. But how can I interprete it? I couldn't find the answer for it.

I am so thankful and appreciate it, that you are trying to help me and I mean it!
 
Last edited by a moderator:
  • Like
Likes   Reactions: Gail Higginbottom
Jack Miller said:
Hi,

I read the paper and tried to understand, especially section 5.2. . But all details of the bandwidth estimation are not completely clear for me. However this is a requirement if I need to estimate the bandwidth by myself.


It seems very promising! : "FAST OPTIMAL BANDWIDTH ESTIMATION FOR UNIVARIATE KDE - CODE" :-)


There is a bandwidth parameter. It seems that I need to provide a bandwidth, so this function will not provide an optimal bandwidth, rather I think it will return a kind of fitness of the provided bandwidth.
But how is the fitness represented? Maybe the last parameter "pDensityDerivative" does this. It is are kind of "return" parameter as I understood when I had a look at the code. But how can I interprete it? I couldn't find the answer for it.

I am so thankful and appreciate it, that you are trying to help me and I mean it!

You need not have copied the program because I really can't help you with it. I'm not programmer. It's obvious that you cannot do this kind of asymptotic estimation by hand, so you will need to find a way to compute optimal h. All I can tell you is that Rykar defines a series of estimators beginning with equation 32 that lead to a solution with zero bias. However you may not want zero bias at the cost of smoothing depending on your data set and requirements for applications of your solution for related problems. I'm sorry I couldn't help you more, but this is a computing problem for you, not so much a math problem.

I might suggest you send a private message to one of the moderators to move your question to the Computer Forum. DO NOT REPOST this question there yourself however. That's a violation of PF rules and you may be cited with an infraction. The moderators' names appear at the bottom of the page in each forum (click on the name and then on "Contact Info'). I any case, I can't promise that they can do any better for you than you have done for yourself by finding the program you have but someone might be able to answer your questions regarding it.
 
Last edited:
  • Like
Likes   Reactions: Gail Higginbottom

Similar threads

  • · Replies 23 ·
Replies
23
Views
4K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 10 ·
Replies
10
Views
6K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
2
Views
2K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 0 ·
Replies
0
Views
3K
  • · Replies 12 ·
Replies
12
Views
3K