ElliotSmith said:
The scientific limit on to how small you can make a functionally viable transistor is very fast approaching and should hit a stone wall within the next 10 years or less. How will electronic engineers and computer scientists compensate for this problem?...Are there any workarounds on the table being discussed and researched for this issue?...
There are four main limits re CPU performance:
(1) Clock speed scaling, related to Dennard Scaling, which plateaued in the mid-2000s. This prevents much faster clock speeds:
https://en.wikipedia.org/wiki/Dennard_scaling
(2) Hardware ILP (Instruction Level Parallelism) limits: A superscalar out-of-order CPU cannot execute more than approx. eight instructions in parallel. The latest CPUs (Haswell, IBM Power8) are already at this limit. You cannot go beyond about an 8-wide CPU because of several issues: dependency checking, register renaming, etc. These tasks escalate (at least) quadratically, and there's no way around them for a conventional out-of-order superscalar machine. There will likely never be a 16-wide superscalar out-of-order CPU.
(3) Software ILP limits on existing code: Even given infinite superscalar resources, existing code will typically not have over 8 independent instructions in any group. If the intrinsic parallelism isn't present in a single-threaded code path, nothing can be done. Newly-written software and compilers can theoretically generate higher ILP code but if the hardware is limited to 8, there's no compelling reason to undertake this.
(4) Multicore CPUs limited by (a) Heat: The highest-end Intel Xeon E5-2699 v3 has 18 cores but the clock speed of each core is limited by TDP:
https://en.wikipedia.org/wiki/Thermal_design_power
(b) Amdahl's Law. As core counts increase to 18 and beyond, even a tiny fraction of serialized code will "poison" the speedup and cap improvement:
https://en.wikipedia.org/wiki/Amdahl's_law
(c) Coding practices: It's harder to write effective multi-threaded code, however newer software frameworks help some.
While transistor scaling will continue for a while, increasingly heat will limit how much of that functional capacity can be simultaneously used. This is called the "dark silicon" problem. You can have lots of on-chip functionality but it cannot all be simultaneously be used. See paper "Dark Silicon and the end of Multicore Scaling":
https://www.google.com/url?sa=t&rct...=k_D1De2gUp79VwMVcTIdwQ&bvm=bv.84349003,d.eXY
What can be done? There are several possibilities along different lines:
(1) Increasingly harness high transistor counts for specialized functional units. E.g, Intel core CPUs since Sandy Bridge have had a Quick Sync dedicated video transcoder:
https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video This is about 4-5x faster than other methods. Intel's Skylake CPU will have a greatly improved Quick Sync which handles many additional codecs. Given sufficient transistor budgets you can envision similar specialized units for diverse tasks. These could simply sit idle until called on, then render great performance in that narrow area. This general direction is integrated heterogeneous processing.
(2) Enhance existing instruction set with specialized instructions for justifiable cases. E.g, Intel Haswell CPUs have 256-bit vector instructions and Skylake will have AVX-512 instructions. In part due to these instructions a Xeon E5-2699 v3 can do about 800 linpack gigaflops, which is about 10,000 faster than the original Cray-1. Obviously that requires vectorization of code, but that's a well-established practice.
(3) Use more aggressive architectural methods to squeeze out additional single-thread performance. Although most items have already been exploited, a few are left, such as data speculation. Data speculation differs from control speculation, which is currently used to predict a branch. In theory data speculation could provide an additional 2x performance on single-threaded code, but it would require significant added complexity. See "Limits of Instruction Level Parallelism with Data Speculation":
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.47.9196&rep=rep1&type=pdf
(4) Use VLIW (Very Long Instruction Word) methods. This side steps the hardware limits on dependency checking, etc by doing it at compile time. In theory a progressively wider CPU could be designed as technology improves which could run single-threaded code 32 or more instructions wide. This approach was unsuccesfully attempted by Intel with Itanium and CPU architects still debate whether a fresh approach would work. A group at Stanford is actively pursuing bringing a VLIW-like CPU to the commercial market. It is called the Mill CPU:
http://millcomputing.com/ VLIW approaches require software be re-written, but using conventional techniques and languages, not different paradigms like vectorization, multiple threads, etc.