Can CPUs Be Designed for Efficient Parallel Processing Like GPUs?

Dissident Dan · Jan 11, 2005

I was recently thinking about how GPUs can do enough calculations to make CPUs cry even though they run at lower clock speeds. My understanding is that this is because they can do so many operations in parallel because the operations are repetitions of many the same types of operations that are independent of each other.

I was recently thinking that it would be very useful if regular, old CPUs could be designed similarly...of course, they would need programming languages designed for their functionality to take full advantage of it. Suppose you could have a CPU with many physical processing paths for both integer and floating point operations. You could blaze through many different instructions in a single clock cycle.

The key thing to remembers is that the operations being performed in parallel must be independent of each other: there can be no overlap between the inputs and outpus of the instructions. In other words, the instructions can't operate on the same data, and the output from one operation can't be the input of another operation.

Actually, languages like Java (without pointers) could probably be adapted very easily since one could write a program that builds dependency trees for the statements in the source programs. However, in program languages that use pointers, this would be impossible to get 100% correct, as far as I can tell, because you could not tell if two instructions are reading/writing the same memory areas at compile time.

Of course, pointers are very useful. So, I started thinking about the possibility of the programmer specifying which computations are independent of each other (or conversely, specifying which are dependent, which would probably be easier both for programmer and compiler). Another possibility is assigning statements to separate "parallel groups" sort of similar to threads, but different from threads in that statements in one block of code could be assigned to different parallel groups, whereas threads are used at a much higher level.

Of course, this could work for multi-processor systems as well as parallel processing within single CPU cores. (The programmer would probably take a different approach in each situation, though, given bus speed limitations inherent in SMP sytems.)

Are there any languages out there that allow the programmer to specify characteristics like this in existence? If not, what do you think of my ideas for parallel processing?

chroot · Jan 11, 2005

Your ideas have been around for at least four decades.

The Cray machines, Connection Machines, and other supercomputers extensively used vector processing (now called SIMD, or single-instruction, multiple-data computing).

Even your lowly Pentium has a number of indepedent functional blocks for things like adding and multiplying. It is the compiler's job to order the instructions so to make the best use of those functional units, and the registers and cache on the chip. The processor's control path(s) then handle(s) assigning operands to each functional unit when your program is run.

The language itself is frankly irrelevant. Modern compilers (particularly those designed for large supercomputing machines) will automatically identify parallelizable blocks and run them independently.

- Warren

graphic7 · Jan 11, 2005

SSE, SSE2, and MMX are all SIMD CPU extentions for x86 processors. Altivec is one of the better features of recent G4 processors, and in my opinion, one of the better (if not the best) SIMD implementations. UltraSparcs have a more MMX-like implemention, called VIS.

master_coda · Jan 11, 2005

chroot said:

The language itself is frankly irrelevant. Modern compilers (particularly those designed for large supercomputing machines) will automatically identify parallelizable blocks and run them independently.

Note that this isn't always true, even with really good compilers. Compilers still have to fall back onto heuristics a great deal, and the compiler will sometimes make the wrong tradeoffs and optimize the wrong part of the code at the expense of a more important part.

However, once that level of optimization becomes necessary it's generally just easiest to hand-code the instructions, rather than jumping through a bunch of hoops to try to trick the compiler into doing what you want.

Introducing special optimization mechanisms into a language generally isn't helpful when you really, really need the extra speed; you have to step outside the language anyway. And when you don't need that extra speed they just make it easier to prematurely optimize a program.

Dissident Dan · Jan 11, 2005

chroot said:

The language itself is frankly irrelevant. Modern compilers (particularly those designed for large supercomputing machines) will automatically identify parallelizable blocks and run them independently.

- Warren

How would it know whether or not it could parallelize the following?:

int a;
int *pA = &a;

a = 5;
*pA = 7;

(This is contains useless code, I know, but it's just for demonstration)

------------------------------------

Apparently, the P4 has 2 DDR integer addition units and an integer multiplication unit and only 1 FP arithmetic unit. This allows some possibility for parallelization, but I'm thinking more on the order of dozens of similar units for different types of operations.

The supercomputers that you mentioned seem to be very much what I am thinking of. I'll have to look into SIMD (includine SSE, etc. that graphic7 mentioned).

It would be nice to have that kind of funcationality in typical home CPUs, as we do with typical home GPUs.

imabug · Jan 12, 2005

There are a number of extensions for all the more common programming languages that let you do parallel and distributed processing. there's even a whole newsgroup devoted to parallel computing discussions.

Some of the parallel computing links I keep in my bookmarks:
http://www.csm.ornl.gov/pvm/pvm_home.html - Parallel Virtual Machine, been around for at least 10 years.
MPI - Message Passing Interface, also been around at least 10 years
http://www.cs.vu.nl/vakgroepen/cs/orca.html
http://dir.yahoo.com/Science/Computer_Science/Supercomputing_and_Parallel_Computing/

Lately most of the focus seems to be on distributed computing applications (spreading the computing workload over separate networked CPUs) rather than conventional parallel computing (spreading the computing workload over multple CPUs in the same computer)

master_coda · Jan 12, 2005

Dissident Dan said:

int a;
int *pA = &a;

a = 5;
*pA = 7;

If you want a compiler to optimize something, why would you deliberately introduce aliasing? Wouldn't it just be easier to write the code properly in a form that the compiler can optimize rather than writing it using pointer tricks and then adding optimization hints?

Dissident Dan · Jan 12, 2005

I'm not saying that I would write a program like that, it was just for demonstration. A compiler needs to be robust...errors are not acceptable. It needs to be able to deal with any input that fits the specification of the language. If the compiler tries to read the code and incorrectly applies an optimization technique, it breaks the program.

As I said, that was just a simple example. There could be much more complex programs that introduce aliasing without you even realizing (and sometimes you are forced into it), especially when you are talking about interacting with different software systems, such as the OS or different commercial packages included in programs like physics engines, sound system software, etc.

chroot · Jan 12, 2005

The most basic kind of compiler parses its input into a tree, then repeatedly applies a set of transformation rules that manipulate the tree, gradually moving it closer to the structure of the target language. Finally, the compiler's back-end walks the intermediate tree and emits the target language.

Since each of the manipulation steps is simple and well-defined (i.e. their effects and side-effects can be mathematically proven to be closed), the entire compilation is mathematically guaranteed. Now, truth be told, real compilers will do more than simple tree-manipulation to achieve high-quality output, but nearly all compilers begin with this kind of processing.

- Warren

chroot · Jan 12, 2005

And generally, the issues you're talking about are not relevant to compilation. How programs or processes share information is not particularly relevant to compiler-level optimization -- after all, one program is certainly not going to pass another a pointer.

- Warren

Dissident Dan · Jan 12, 2005

chroot said:

The most basic kind of compiler parses its input into a tree, then repeatedly applies a set of transformation rules that manipulate the tree, gradually moving it closer to the structure of the target language. Finally, the compiler's back-end walks the intermediate tree and emits the target language.

Yes, I know this. I don't see how it relates to the issues I'm talking about.

chroot said:

And generally, the issues you're talking about are not relevant to compilation. How programs or processes share information is not particularly relevant to compiler-level optimization -- after all, one program is certainly not going to pass another a pointer.

If the compiler is going to apply parallelization optimizations, either on the scale I am talking about or even SSE, it would need to know whether statements are independent.

Suppose you passed a pointer a function, and the function then used that value plus some offset to access a memory area. Suppose also that there is a loop (either inside or outside the function) that starts with that pointer and goes through some number of memory cells that it wants to try to parallelize with the first operation. To be safe, it would have to just not parallelize it. Otherwise, it opens itself up to errors.

chroot · Jan 12, 2005

Dissident Dan,

Right... the compiler would not be able to parallelize it.

Which, as master_coda said, means that the programmer must bear the burden of writing code that can be optimized. Some code can't be parallelized, even though it can be shown mathematically to be identical to another representation which can be parallelized.

- Warren

Dissident Dan · Jan 12, 2005

Thanks, everyone, for all your responses.

I have some questions about compiler optimizations and SIMDs.

Do the /G3,/G4, etc. MSVC++ compiler options (or similar optimizations for other compilers), (which tell it to optimize for P-Pro - PIII and P4 architectures, respectively) automatically cause the compiler to try to apply MMX, SSE, SSE2 optimizations?

I know that you can also do something like "arch:SSE" in MSVC++. If my first question is true, is either option more advantageous than the other in any regards?

Also, I read a little about explicitly programming SSE instructions. Do programmers generally or ever achieve speed advantages by doing this by hand over the compiler optimizations?

master_coda · Jan 12, 2005

Dissident Dan said:

As I said, that was just a simple example. There could be much more complex programs that introduce aliasing without you even realizing (and sometimes you are forced into it), especially when you are talking about interacting with different software systems, such as the OS or different commercial packages included in programs like physics engines, sound system software, etc.

An option to tell the compiler "these operations are independent" won't help you in this case. In fact, it will probably make things a lot worse. The compiler will go ahead and optimze the operations like you told it to, and now you've got a broken program because it turns out that operations you thought were independent really weren't.

Most of the time if your code can't be easily rewritten into a form the compiler can optimize then there's a good reason; optimizing it will probably break it.

chroot · Jan 12, 2005

Dissident Dan,

In my experience the MS compiler is not very good at identifying opportunities to use SSE. I have done quite a bit of hand-coding to write matrix arithmetic and digital filters with SSE, and was never able to get the compiler to generate code that came anywhere close to being as efficient.

And yes, a hand-coded asm program that uses SSE for a FIR digital filter, for example, will run about 6x as fast as a hand-coded asm program that uses the normal integer unit. The SSE3 instructions can do as many as eight integer multiplications in one clock cycle, IIRC.

- Warren

master_coda · Jan 12, 2005

Dissident Dan said:

Thanks, everyone, for all your responses.

I have some questions about compiler optimizations and SIMDs.

Do the /G3,/G4, etc. MSVC++ compiler options (or similar optimizations for other compilers), (which tell it to optimize for P-Pro - PIII and P4 architectures, respectively) automatically cause the compiler to try to apply MMX, SSE, SSE2 optimizations?

I know that you can also do something like "arch:SSE" in MSVC++. If my first question is true, is either option more advantageous than the other in any regards?

Also, I read a little about explicitly programming SSE instructions. Do programmers generally or ever achieve speed advantages by doing this by hand over the compiler optimizations?

Note that /G3 and /G4 will optimize for 386s and 486s, respectively. The options you're referring to are /G6 and /G7. I don't believe that setting either of those options will generate SIMD instructions.

Setting "arch:SSE" or "arch:SSE2" will override any /Gx flags you specify. The compiler will generate SSE code if you specify them, although it probably won't do it as often as it could. In fact, MSVC++ probably won't use them very often at all; I've heard that it's SIMD optimizations are very poor. Intel's compiler is supposed to be much better, but I don't know anyone personally who has any experience optimizing with it.

I have seen people get better performance hand-optimizing SIMD code (the weak MSVC++ optimizations probably have something to do with that). It's probably not useful as often with a better compiler, but there's almost always a extra speed that can be cranked out by hand it you need to.

Dissident Dan · Jan 12, 2005

master_coda said:

An option to tell the compiler "these operations are independent" won't help you in this case. In fact, it will probably make things a lot worse. The compiler will go ahead and optimze the operations like you told it to, and now you've got a broken program because it turns out that operations you thought were independent really weren't.

Well, I would assume that there would be parallelizable situations that would be similar enough to unparallelizable situations that the compiler wouldn't be able to tell the difference, but the programmer could. I would think that a compiler would not be able to verify parallelizability (wow, that had a lot of syllables) on most operations using pointers, yet in some of these case, the programmer would know that they were parallelizable.

As vectors are often treated like pointers, this could be limiting to optimization.

I suppose that features such as SSE offer the best opportunity for relatively easy, comprehensible code.

chroot · Jan 12, 2005

The ostensible purpose of the 'const' keyword in C is to inform the compiler of what data does not change, allowing it to perform better optimization. In reality, code that requires multiple threads of execution for acceptable performance should just be written as such to start with, rather than trying to rely on the compiler to figure it out. High-level algorithm design is a much, much easier problem to solve than instruction-level optimization.

- Warren

philocrazy · Jan 12, 2005

you can do parrallel programming now
both linux and windows support multiple cpus
it all depends on the hardware architecture and op-system architecture

The Myth

Dissident Dan · Jan 12, 2005

chroot,

You've been very helpful. Thanks for putting up with me.

philo,

The problem with multiple CPUs, as I see it, is that the bus transfers to the different CPUs can severely hamper performance. SMP seems to be best suited to programs where there are rather lengthy independent sets of instructions, so that you can run each for a while before the data computed by one CPU has to be used in conjunction with data computed by the other.

This doesn't seem to mesh well with something like matrix multiplication, but I may be wrong.

philocrazy · Jan 12, 2005

Man are you sure you can manage all that?
a simple program like windows takes years and lots of people to manage
write first something like a linux variant to get a taste of serial management
before you move to what else you want to do
cheers

The Myth (of serial programming)

master_coda · Jan 12, 2005

Dissident Dan said:

Well, I would assume that there would be parallelizable situations that would be similar enough to unparallelizable situations that the compiler wouldn't be able to tell the difference, but the programmer could. I would think that a compiler would not be able to verify parallelizability (wow, that had a lot of syllables) on most operations using pointers, yet in some of these case, the programmer would know that they were parallelizable.

I'm sure there are situations where this is true; I'm just usually pretty reluctant to furthur complicate a language just to deal with some situations where the compiler doesn't generate optimal code. Especially since I'm sure that a large portion of the programmers who "know" that their code can be parallelized aren't as clever as they think. All too often people already treat parallel computing as some sort of panacea that magically makes all programs go fast.

imabug · Jan 13, 2005

Dissident Dan said:

philo,

The problem with multiple CPUs, as I see it, is that the bus transfers to the different CPUs can severely hamper performance. SMP seems to be best suited to programs where there are rather lengthy independent sets of instructions, so that you can run each for a while before the data computed by one CPU has to be used in conjunction with data computed by the other.

This doesn't seem to mesh well with something like matrix multiplication, but I may be wrong.

one of the reasons why parallel and distributed programming is difficult is because it's usually up to the programmer(s) to write efficient parallelizable algorithms to ensure that all of the CPUs get used and that none get starved for data. Parallel programming is not new by any measure, and a lot of the issues you bring up have been thought of by others already and covered in the numerous texts on parallel programming.

The inter-node (CPU) links are always an issue in any multi-CPU computer.

The folks over at distributed.net have a http://www.distributed.net/research/ with a lot of texts covrering parallel and distributed programming. You should also pay a visit to your library to see if they have any books that cover the topic if you're really interested in it. There's also plenty of reading out on the net on the topic too.

O'Reilly's High Performance Computing i's a very easy to read introductory level book that has a lot of information on programming for speed. I wouldn't call it definitive, but it will definitely introduce you to a lot of techniques and concepts.

Can CPUs Be Designed for Efficient Parallel Processing Like GPUs?

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

France to ditch Windows for Linux

Is This Music AI?

Gmail AI summaries

Help me build my server with a laptop that has a broken screen

Warning: Bad actors may already be in store-now-decrypt-later mode

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect