I would recommend Principles of Nano-Optics, by Novotny and Hecht.
I'll try to give an explanation to give you at least an idea of how it works:
In near field optical microscopy a probe of subwavelenth dimensions (which usually can be modeled as a dipole) is placed in very close proximity to a sample so that it couples to the sample via the near field, including all k-vectors and not just those that satisfy the propagating photon dispersion relation.
The high k-vector modes that have interacted with the sample now include information about small ('subwavelength') scale structure. The key is to read this information from the outgoing field at the detector, here your detection window consists only of propagating wave k-vectors, so not the higher ones which correspond to evanescent modes that don't reach the detector. The key is that in fact you can read the information from interaction of high-k modes from the photons arriving at the detector. The general idea is:
i)You have a source field which propagates freely until it is at the sample. Sample and source are very close so the evanescent modes reach the sample.
ii)The interaction is taken into account by multiplying a transfer/transmission function with the source field at the position of the sample, this gives you the field just after interaction. The transfer function contains info about the sample and this you want to obtain, call this function T for now.
iii)Now the field propagates to the detector. And you detect only propagating photons, so the part of the outgoing field that has wavevector k' with -k<k'<+k, where k=\omega/c
Now consider the fourierspectrum of the field at ii). The product turns into convolution. If you consider a single frequency component of the source field this introduces a deltafunction in the convolution integral and the consequence is a shift. The result is that you have performed a translation on the sample spectrum (FT of the transfer function) and parts of this spectrum which would otherwise be outside of the detection window are now translated to within the detection window.
In theory, if you do this for all frequency components, you 'scan' the entire sample spectrum. After you have summed all single frequency contributions you know the response of the sample at all(!) wavevectors (meaning: you can construct T)
In practice I think you will have to do difficult data analysis, because while all the information is in principle at hand you really have to extract it from your measurements. And of course there are probably practical limitations and approximations involved.
I only know this global outline, but I am sure that Novotny and Hecht treat specific experimental setups and they have all the equations included.
Cheers,