CPUs are designed to do everything "kind of" well. GPUs are designed to do some select things really well. As it turns out, it just so happens that these select things that require vast quantities of processing power just so happen to be the kind of things fit for a GPU, so it works out nicely. The supposed limitations of the GPU are largely inconsequential in practice. In fact modern CPUs are becoming much more like GPUs, using non uniform memory access and parallel construction. Simply put CPUs are slow in parallel applications because they are largely today, and were entirely in the past, serial processors. The architecture has very few arithmetic elements and a lot of cache with a lot of optimization elements for serial execution.
I would wager a generation or two after Haswell, but I do think it's a largely irrelevant question. If you're looking for floating point performance in CPUs, you're looking in the wrong place. If you want to reliably execute arbitrary threads, get a CPU. If you want to perform huge matrix computations, get a GPU. If a CPU is to become as fast as a GPU, it needs to become a GPU. This is why Intel's Larrabee met it's inevitable demise. It was a bunch of CPUs pretending to be GPUs. Slap on an unscalable ring bus and you've got fail. If Intel wants to compete with vector processors it needs to create a device which is a vector processor by it's very nature. This requires non-uniformity and many more arithmetic elements at the expense of other architectural features. Intel will have to drop it's fairy-tale programming model. A vector processor will never be developed for like a single threaded application model. I suppose the sooner developers come to grips with this the sooner Intel will stop trying to push out impractical hardware. Fortunately from what I've read their next attempt will be somewhat closer to reality, unfortunately this means paying tribute to the devil and do what Nvidia and ATI have been doing for years. For example, notice that their new on-package GPU on the new processors is actually a GPU, not another CPU.
50 AFLOPS (AwesomeFLOPS). Sorry, but that question has no serious answer :)
It's not x86 that makes programming easier. In fact there's nothing easy about SSE optimization. They simply wanted to obfuscate the parallel elements of the hardware to the programmer. They also apparently wanted to avoid making real floating point vector hardware. Both of those culminated in inevitable failure.
One real advantage Larrabee could have had if Intel didn't set unrealistic architectural goals is unified memory addressing. This does not (and nothing ever will) remove the parallel element from developing parallel applications, but it would have made memory management much easier for the developer. This isn't generally a desirable thing when you're developing performance critical applications, but it would have been a convenience for certain problems, particularly when prototyping. However Fermi also offers unified addressing, but only in CUDA, not for general programming.