Stephen Tashi said:
That's a very general description and the only way I can visualize it is as the usual sort of experiment in a very limited environment where you are trying to accomplish a specialized task like finding a dog in a picture. Is there some more fundamental way to approach it? - something that has some hope of being general purpose?
Well the problem really is classification and data mining looks at this in general ways.
Because there are so many ways to classify something and because humans are used to optimizing classifications for particular purposes, the actual thinking required can be foreign for this kind of purpose and thus it's made harder.
This example is highlighted by yourself: you automatically constrained the criteria as a specific thing like a dog. You also talked about specific concepts like occlusion and a bush.
This is not an attack on yourself: this is just how humans are geared to think, and its basically the paradox of analysis in general.
Analyzing something means you have to not only structure it, but break it down. It also means that you need to create a finite-description of something that is otherwise potentially infinite. This means that when you have information, you take something and break it down so that it's manageable for you to work with.
All humans do it and in fact, all the standard models of computation require this with the idea that you have memory which is divisible and in the CPU you have registers that is like a form of "working memory" (I know this isn't 100% accurate but bear with me). We do computations on things that have a finite amount of information, and we store the results in something that has a finite amount of information.
If you want to look at general approaches, you have to resist the need to instantly classify and think about ways to "think more like an autistic person" than "think like a non-autistic person". Austistic people have a propensity to take in tonnes of information while non-autistic people have a tendency to filter most of the information and get something that has cleaned a lot of the detail.
Now, to answer your question, the thing must be that the decomposition is determined by the algorithm itself and that the statistical framework is used to determine how to accept or reject notions of "similarity" through general estimators regarding whether there is a difference between either coeffecients or whether functions of coeffecients are statistically significantly different from 0.
The thing is not to explicitly state the decomposition, but to let the algorithm derive and define that is within statistical limits.
The estimators would be non-parametric (like say a median over a mean), and one approach to this is via what you termed before as a "neural network". The network itself is changing the structure of the data when the network itself changes and how it even uses the data for analysis.
Remember, that you have to take away the idea of trying to force your own constraints and instead just lose them almost like how an autistic person can do this (but unfortunately have a lot of trouble with scenarios of extreme stimulation resulting in situations where they are overwhelmed with stimuli that others would simply filter out early).
I agree with the importance of statistics. There are many academic papers relating Fourier analysis to problems of computer vision. However, I'm skeptical of the applicability of the signal-and-noise metaphor. I'm particularly skeptical of frequency domain analysis. The way I look at it frequency domain analysis is that its wonderful when you want to take a phenomena where superposition works (like electromagnetic signals). You pick a basis based on frequency (it can be frequency in time or in space or some other dimension of the data) and you represent things in that basis. But in a typical scene of daily life, objects are not obeying the superposition principle. Objects hide each other. Also, I don't think interpreting an average quality image from a modern digital camera is really a problem of dealing with "noise" - not noise in the sense of some additive disturbance that is superimposed on the image. (The type of image where "noise" is a believable problem would be an image of store dispaly window where there are reflections from things outside the store on the window.)
The reason why I mentioned frequency analysis is that this analysis is able to take apart information that is entangled, like things that are spatially entangled (i.e. multiple things that are not topologically simple but overlapping in the same region). This is one of the main points of using frequency decompositions and it's the reason why for example you can do amazing things with audio like remove noise and clicks and still get a nice result.
It's not solely for noise: it's a general purpose way to handle decompositions of signal data where you don't have simply connected pieces of information in a constrained spatial location.
Unfortunately though, this is not intuitive for a lot of sensory work especially when it comes to "forced analysis" because again while the intuition is able to discern these details, the use of "forced analysis" ends up looking at simple ways like "dividing things spatially" or using "hierarchical division" instead of looking at a more "natural" way like through a frequency analysis (or analog).
It's not intuitive to use frequency analysis when you are trying to do a forced analysis, and it doesn't help when the brain is filtering out all these details so that you are able to actually analyze (again: the example of the computer).
Perhaps this is too simplistic, but I think of it this way: Suppose the picture shows a car parked in front of bush so it obscures part of the bush. If you represent this the pixels in the image as some sort of superposition of pixels in "component pictures" then the part of the picture that car-that bush has the "signal" of being a car, not the signal of being part car and part bush. So the car isn't additive noise when you're trying to identify bushes.
Again you have filtered out most of the actual data in order to make sense of what is going on: it's not a fault of you, it's just how a lot of people think. You are dividing the image into pixels and thus this is automatically constraining things that you would otherwise miss: it's the product of a particular choice of analysis.
Also the other thing that you take for granted is what I would call "relative memory". This is something that we take for granted, but a computer may not (unless it has been trained extensively).
Humans have relative memory in many instances: it's not all representations are the same. Relative memory means that you can use the relationships of all things in order to classify.
This idea that information is in a vacuum is not right: All information is relative. In fact all of language is relative: for example in order to define something, you need to define its complement. If something doesn't have a complement, it can't be defined because there is nothing to compare it to. In order for mathematics to even make sense you need variation: without variation and the ability to compare, you can't have mathematics.
You have all this information that is completely relative to something else and we just take for granted that all of this even exists so that we are able to differentiate between a "sloppy A" and a "printed A". The point is not what the A is, or even means: the point is that the A has relativity to everything else and what that relativity actually is in terms of analysis.
I think you're making the mistake that the decomposition itself has to be isomorphic to all the elements being spatially divisible (i.e. you look at the superposition with regard to completely isolated areas like groups of pixels rather than the entirety of the image itself as an undivided entity). The decomposition itself does produce information that is mutually exclusive in its final form (i.e. basis vectors are orthonormal and thus independent), but that does not mean that it deals with information that is spatially simple.
Supose we go to the frequency domain and do spatial Fourier transforms. Then the part of the image that is car hiding bush, in a manner of speaking, does have combined effect with the part of the image that is all bush and the part of the image that is car-not-hiding bush. But everything in the whole image is part of this combined effect. So it's hard to believe that this is a good general purpose method of isolating particular things.
Again, you are analyzing based on looking at things whereby the information you consider is topologically a simple region and all things are mutually exclusive (i.e the way we "atomize" information and structures): the Fourier transform takes an entangled form of the entire signal and extrapolates information that describes the nature of these entanglements.
The best way to think about this is think of what a hologram looks like in terms of the representation, and how it is possible to analyze such a representation.
You can't look at things in a divided way: the underlying fundamental structure is completely one and indivisible.
Wavelets are a more plausible approach since they limit the spatial influence of a given area of the image. However, the garden variety Wavelet is senstive to changes in images that human vsion regards as minor. For example if you move the car a foot forward you change a boundary between car and bush. Fourier analysis is also sensitive to such changes. In an effort to combat sensitivity, one can resort to "steerable filters" where the basis, in a manner of speaking, tries to adjust its position to fit the image. (At leaset this involves somewhat new mathematics.) I'm not aware of any breakthroughs in the applicability of the above techniques to image recognition - or even to the problem of similarity detection mentioned in the original post.
The point again is to consider the relativity of things rather than some hard criteria.
The whole thing of comparison is deciding what the language is that describes the comparison. Once a language is found, and an actual sentence corresponding to the two things (or more) that are going to compared, then one looks at what the language actually corresponds to in terms the representation of the structure that is stored in the neural network (or some other analog).
The language itself is created and the general approach is to consider how you would define the norm between two "representations" and how a non-parametric estimator can be used to ascertain a boundary for rejection or acceptance: (i.e. a stimulation or absence thereof).