# Data smoothing/filtering

Hi everyone,

The task is to process some 1D data which represents a sum of a useful signal and
a background (I've attached few typical samples).
The data are in fact the amplitudes of the Fourier spectra of some signal.
The goal is to separate a background from useful signal at each frequency.
The common feature in these data is that in each case the amplitude of a background
signal is a smooth function of frequency (n).
The red lines show an approximate boundary between both contributions in each case.
As one can see from the plots, the useful signal exist only at low
frequencies, whereas at higher frequencies only background exist.
Does anyone can give a good advise how to separate backround from such particular type of data?
Accuracy is of great importance, speed of processing is not crusial parameter,
any method (data smoothing, extrapolation, filtering, etc) can be used.

#### Attachments

• 75.2 KB Views: 122

chiro
Hey SuchMuch and welcome to the forums.

For this problem I'm assuming that you have only used your lower frequencies for transmission of data and that the noise is 'mixed in' with the data at these frequencies.

Usually in data transmission we assume that we have a particular noise model and for many purposes, the model used to describe noise is the White-Noise model.

http://en.wikipedia.org/wiki/White_noise

Now of course with your model, you have only used a finite frequency band for your data, so we will assume then that you have a flat probability distribution across the low band of the spectrum which is where your data is located.

For one I need to know if there is any kind of internal structuring of your data.

To put this into context, what is usually done when data is sent across noisy channels is that it is structured in a certain way so that it can be corrected. The actual way of structuring the data depends on the structure of the data itself and the noise properties of the channel.

For example if you know that the noise will have a maximum amplitude for each band, what you can do is scale the amplitude by this factor which means that if there is noise, you just subtract the right amount from your signal and then you divide by a constant and that will give you the right amplitude. I'm assuming in this model that you have allocated your data in a 'quantum' manner where intervals correspond to one value.

The other way is that you can create redundancy between the different frequencies and if there is an error, then depending on the error and it's nature, you can use the error correction protocol to recover not only what the error is, but also how to correct it and get the uncorrected data back.

These protocols that are built, are built depending on the data itself and the noise model for the channel. You need to tell us this because if you have not taken this into account and just sent the data in a limited band with no structural engineering of the data to take into account the noise, then you will be at a loss when you go to recover the noiseless form of the data in a general situation.

The things to consider besides the structure if the data is also if the data is conditionally structured: if the data is conditionally structured like for example a sentence in a human language, then you can incorporate markovian models that help recover errors using the nature of the data itself.

Thanks, chiro!

I'm not really a specialist in this area, so I'm not sure if I understood everything properly

The data I'm dealing with is the result of simulation of light scattering from different objects.
The data I analize (the plots I've shown) are the Fourier spectra of the scattered signal.
These are statistically averaged data that converged to some their average values.

In general case, I have no apriori information about the objects from which the scattered signal was obtained.
Without going into detail, this means that any apriori information about the amplitude of the
useful signal and backround is not available. However, due to a specific model I use, the useful signal normally exist only at low frequencies,
and this seems to be the only information available.

For some cases I was able to introduce statistical model for estimation of the background signal,
however not for a general case (too complicated, and perhaps not worth of an effort). I also do not think that the white noise
model can be adopted in this case.

Empirically, looking at different spectra obtained for different objects, I can definitely say the following:

1) One thing about the structure of the data. Normally, in a single dataset I have few hundreds of spectrums.
They are ordered, and loooking at the subsequent spectrums one can say, that the magnitude of background is varying very slowly
from one spectrum to another, as compared to the useful signal (I've attached a few more plots). May be this information can be used somehow.
As I said, there is no any other information about internal structuring inside each separate spectrum. In addition, I can not introduce any structuring,
since in reality this will be an unknown measured signal.

2) Both, the amplitudes of background and useful signal are different for different objects (different data sets). Since in reality
I would not have any apriori information about the object, I can not estimate the level of background or useful signal in advance.

3) Looking at different spectra I can see, that background signal is a smooth function of frequency. This is always true.
The red curve on my plots approximately indicates the level of background. However the shape (or envelope) of this curve can be
different, in different spectrums (as you can see from the plots, sometimes it has Gaussian-like shape, sometimes it is almost linear).

Hope this can help.

#### Attachments

• 57.7 KB Views: 168
If you know the useful data is below some cut-off frequency and that the background is a smooth curve you can model the background and fit it to all data points above that cut-off frequency. Then you could subtract this best-fit curve from all your data points to get some idea of what the spectrum looks like with the background subtracted.

The question is, what curve can you use? With no a priori data, you might have to test a few different models and use the best one.

If you know the useful data is below some cut-off frequency and that the background is a smooth curve you can model the background and fit it to all data points above that cut-off frequency. Then you could subtract this best-fit curve from all your data points to get some idea of what the spectrum looks like with the background subtracted.

The question is, what curve can you use? With no a priori data, you might have to test a few different models and use the best one.

Thanks, MikeyW. Something similar I'm doing at the moment. For many spectra in the set even a linear approximation gives good results. However, it looks like in general case one has to use different curves for different spectra. I tried to fit polinomials of different orders, which did not work very well. However, this might be a good idea to try some other curves as well. I've forgotten to mention, that the cut-off frequency is also not the same for every spectrum in the set. Therefore, this approach requires some ''manual'' work, i.e. approximate determination of the cut-off frequency for each spectrum. Likely (as I mention before), the spectums are ordered and the cut-off frequency slowly increases with the spectrum number, allowing to specify its values "half-automatically".

I will try to do more in this direction, however it would be nice to use some approach for background subtraction with automated functionality...

One method I've seen is to find the cumulative sum under this spectrum, and then define a cut-off as the frequency below which you have, for example, 30% of the area, so 70% of the area under the curve lies to the right of the cut-off. It might work nicely, or maybe not.

I can see why a polynomial would fail as they tend to go to infinity as your n increases. Yours appears to tend to zero from some smooth global maximum. I would suggest that the function is fairly tricky to guess because it is "hidden" near n=0 by the useful data. NOt knowing exactly what sort of scattering is going on or what n signifies it's hard to guess, but that curve doesn't look too dissimilar to a blackbody curve, which would be given by Planck's law. Maybe that's a reasonable curve to test.

One method I've seen is to find the cumulative sum under this spectrum, and then define a cut-off as the frequency below which you have, for example, 30% of the area, so 70% of the area under the curve lies to the right of the cut-off. It might work nicely, or maybe not.

I can see why a polynomial would fail as they tend to go to infinity as your n increases. Yours appears to tend to zero from some smooth global maximum. I would suggest that the function is fairly tricky to guess because it is "hidden" near n=0 by the useful data. NOt knowing exactly what sort of scattering is going on or what n signifies it's hard to guess, but that curve doesn't look too dissimilar to a blackbody curve, which would be given by Planck's law. Maybe that's a reasonable curve to test.

Thank you again, MikeyW. Interesting observation about the Planck's law. I was just thinking about another curve, the Rayleigh probability density function, which is somehow similar to the one you mentioned above.