When you say we're not even half way there, it's not clear to me where "there" is, or (with apologies to Gertrude Stein) if there's even a there there. To put it another way, what problem are you really trying to solve?
One can certainly imagine an imager built on top of some sort of neurmorphic substrate with a massively parallel interconnect. This might be very useful for applications like missile guidance or other machine vision problems, or in building something like an artificial retina. For the larger set of applications, though, data acquired by an imager need to be conveyed in fairly raw form to a physically distinct entity, and that degree of parallelism isn't an option. So, we're stuck with some form of serialized transmission. If the addressing sequence is not defined in advance, that is, if the pixels are sent in apparently random order, then the overhead of sending addresses along with the pixel data becomes a heavy burden. If you had a 4k x 4k imager you'd need 24 bits for each address. If you have 24 bits of pixel data, then you've just doubled the bandwidth requirement, but to what end?
One of the benefits of raster scanning is that it preserves locality of reference. This means that you can do on-the-fly processing without assembling full frames, which is advantageous in terms of latency and storage requirements. Running a filter over rasterized pixel sequence requires storing only a few lines worth of pixel data. On a random sequence, it would require storing the entire frame to be sure you had all of the pixels for each iteration of the filter.
Unlike the early days of television, it is a rare case nowadays that images are conveyed from a sensor to a display without alteration. Let's say you have some non-raster sequence that you've determined is optimal for extracting data from the sensor. How would you composite that stream with another one, which I guess would have a completely different address sequence? Even if the two streams had the same address sequence, would that still be the optimal sequence for the composite result? What happens after you've applied scaling or other transforms?
Perhaps I'm missing your point. Is your idea to do away with the entire notion of video as a sequence of frames? I can imagine this in some sort of special case with a one-to-one mapping between a sensor and a display, analogous to a coherent fiber optic bundle, for example. How this would work in a more general case is much less clear to me. Disregarding the rather onerous addressing overhead mentioned above, I can sort of see how you might do compositing and spatial transforms, but it would seem to break anything that relies on locality of reference, such as spatial filtering. (I haven't even begun to try to get my head around temporal filtering in such a system.)
Bear in mind that of all the pixels in the universe, a significant (and rapidly increasing, I expect) portion of those captured by imagers are never displayed for human eyes, and likewise many of those displayed for human viewing never originated from real-world image capture. Coming up with an entirely new video paradigm that is optimized for direct sensor-to-display architectures seems like a misdirected effort.