# Stats Related Question

1. Jul 22, 2012

### Jarven

Hey, I have never taken any stats course but I desperately need the answers to my questions checked out.

We have a dataset with 5 independent and dozens of observational dependent variables, including location. The independent and dependent variables are sampled asynchronously! (the variables are logs of activities, location, type of language being used, voice samples, and some survey data). Some datapoints are better than others - but we don't know which those are. Observations took place over 3 months at more or less regular intervals. If it were a continuous signal we'd find that the sampling rate was below the Nyquist rate.

1. What techniques you would use to determine if this data-set has some signal or is all noise? Note, you are free to explore statistical approaches in the frequency (Fourier or other transform) domain as well.

The first technique I would use to determine whether the data-set contains a signal is to perform an ensemble averaging. This technique is utilized under the assumption that the noise is completely random and the source(s) of the signal produce consistent data points. If a sufficient amount of data-sets were collected over the 3 month period, the ensemble average would significantly reduce noise and make the signal apparent, assuming a signal exists.

Secondly, creating a frequency spectrum of the data-set using Fourier transform shall be useful in identifying white noise. If the amplitude of frequency appears to be equal within a discrete set of frequencies then it is possible to dismiss that range as noise. The remaining frequencies which do not exhibit properties of white noise are subject to a Fourier inverse transform and the signal is reconstructed and is subject to further modifications such as smoothing. If the white noise spans the entire domain of frequencies then we can assume a signal does not exist.

2. How can you use the data even though it is sampled below the nyquist rate?

Assuming the difference between the lower and upper range of the signal frequencies is less than that of its lower range, it is definitely possible to use this data. The data does not need to be sampled at twice the upper frequency of the signal but can be sampled at twice the bandwidth of the signal without detrimental effects from aliasing.

Is what I wrote right? Am I missing stuff. Can you point me in the right direction?

I have never learned any of the topics encompassed by the question and currently my knowledge for the answers come from Wikipedia.

2. Jul 22, 2012

### chiro

Hey Jarven and welcome to the forums.

For the first question, the thing you need to answer first off with regards to signals is if there is a known signal structure or whether you are just trying to establish whether any actual signal exists.

If you have a specific signal structure, then you can utilize this known structure to detect noise, especially if the internal structure itself is designed with a specific noise characteristic of the channel itself in mind.

To look at how this is studied you should consider the kind of stuff that Claude Shannon looked at, and what electrical engineers deal with, particularly in the construction of codes over noisy channels.

Also take a look at this:

http://en.wikipedia.org/wiki/Kalman_filter

The frequency domain is a good way to say, take a signal and remove the high-frequency information to get something smoothed, but again the best way to approach this IMO (especially if you are constructing a signal structure) is to look at the design of optimal codes that create a situation for easy detection of noise, but more importantly the ability to correct the errors if they are found.

The field for this is known as Error Correcting/Corrected Codes or ECCs. The codes themselves mean that you often send a lot more information than you have to (i.e. more redundancy), but as you add more redundancy in the right way, you minimize the probability of noise corrupting your actual information to the point where the probability becomes so small as not to be an issue.

In terms of the second question, I would approach it in the above manner with regards to the noise properties of the channel.

The decoding hardware and the capacity will dictate the bandwidth of your channel, but it's important to also keep in mind the structure of the information (if it has a structure) as well as the noise definition for the channel.

The detection of whether noise is present from an unstructured signal (at least in the way that you don't know the structure) is kind of paradoxical in one sense. However you could for example use entropy as a way to hypothesize whether a signal is just 'noise' or not since things that are structured often have patterns to them which suggest a lower than otherwise entropy.

So if I had to point to some resources, look up the work by Claude Shannon, Error-Correcting Codes, Information Theory, Markovian Probability in both discrete and continuous time spaces, Integral Transforms for Signal Processing including Fourier Analysis and Wavelets, and Probability and Statistical Theory especially for Hypothesis Testing with regards to testing whether a Signal or a Time-Series is considered "random" (and you should get a source on how randomness is defined in different contexts).