Picture worth many words.
Here is a spectrogram of me saying "Eating Pigs Makes Men Fat.":
Several things to note:
1) Predetermined, frequency-band rules for representing the voice are meaningless.
2) This is windowed analysis. I told RX to use 2048 frequency bands and muliple window widths, since higher frequencies resolve in less time. In addition, the windows overlap in time, to improve time detail.
3) THe waveform is displayd in blue.
4) The weights that make up each (tiny) window, at the moment of transformation are coefficients of infinitely repeating periodic functions. In the spectrogram, all other periods are discarded. We pump out enough windows to "image" the whole recording. Each window is a fragment of the overall picture. You must determine how to bin the coefficients to satisfy the exact question you are asking about the data.
For example, in this recording, I am speaking with a gravely throat voice. You can see in the waveform that the last vowel is stuttered with peaks: the motor-like rattle of my throat. But the spectrogram window is too long to capture it. The frequencies are correct, and better represent the vowel. Reduce the # of bands, and the stutter is visible, but the frequency representation is junk.
The function is unknown. It does not become known by transformation. Hopefully you can see, however, how it could be useful to analyze signals in this representation.
There are an infinite ways to break each recording up into individual weights to fill the 2D image. You can hang yourself letting the transform lead you around by the nose. Instead, define those parameters which make what you want to do possible.
5) In this case, the function is finite. I opened my mouth, and for a time there was sound. The words did not repeat forever, after the recording stopped. But I may have to record many many versions of each sound, from different speakers, if I want to have enough information to identify one of these words reliably.
6) Sustained horizontal lines are harmonic content. In general, vertical columns of (what look like TV static) are noise.
7) We (and not he spectrogram) know what the function is. For example, the letter "t" is almost all noise. However, as far as we are concerned, it is "signal." To identify it, I might look for noise of a particular duration, distribution across the spectrum, relationship to surrounding signal, etc.
8) To call the voice a function, I must construct both the rule sets for identifying each element of speech, and collect the core data (e.g. samples of real voices) which those rules will use to compare new data. If my model is not both descriptive and predictive, it is not a scientific model.
Can the voice really be represented by one frequency modulating another?9) Although in general audio software lags behind image processing, from a mathematical and physical point of view the way forward is plain:
Between the waveform (amplitude/time) and a (series of) spectral representation(s) of my choice, the question of properly modeling the voice is a topological one. I hope it is clear from the image just how useful a spectral image can be for defining "similar" signals. The spectrogram can be processed as an image. One giant question is how then to rebin my data as I work, so that image processing entails acceptable audio in the linear waveform.
10) The comment about periodic signals not being like real instruments is misleading. Pure sine waves sound nothing like instruments because they're not anything like natural harmonic (periodic) functions. Vibrating strings and columns of air are the same thing as a series of harmonics. Even if the materials were ideal, these would not sound the way a synthesizer produces them. Then, of course, the signal is in continuous convolution with the amplification apparatus, etc. It's not because the light falling on your desk is poorly modeled by both a straight-line vector to the sun and by a static, single radius around the sun of one wavelength, that "real light" does not behave like a particle and a wave.
The point is to model the phenomenon in sufficient detail to distinguish description from storytelling.I am not attempting to sound authoritative. These are all open questions for me. It's hard to find discussions about the subject where the questions are left open. Everyone seems eager to shut the door on slender assumptions. I'd like to keep them open until we can describe the problems, and not tell little tales about them which shut their mouths prematurely.
Of course, if I have made faulty assumptions myself, I want to know. It's tough going to see straight, and I do it alone, so don't be shy. Corrections are precious.
Cheers.