Parallel Programming Languages

  1. I was recently thinking about how GPUs can do enough calculations to make CPUs cry even though they run at lower clock speeds. My understanding is that this is because they can do so many operations in parallel because the operations are repetitions of many the same types of operations that are independent of each other.

    I was recently thinking that it would be very useful if regular, old CPUs could be designed similarly...of course, they would need programming languages designed for their functionality to take full advantage of it. Suppose you could have a CPU with many physical processing paths for both integer and floating point operations. You could blaze through many different instructions in a single clock cycle.

    The key thing to remembers is that the operations being performed in parallel must be independent of each other: there can be no overlap between the inputs and outpus of the instructions. In other words, the instructions can't operate on the same data, and the output from one operation can't be the input of another operation.

    Actually, langauges like Java (without pointers) could probably be adapted very easily since one could write a program that builds dependency trees for the statements in the source programs. However, in program languages that use pointers, this would be impossible to get 100% correct, as far as I can tell, because you could not tell if two instructions are reading/writing the same memory areas at compile time.

    Of course, pointers are very useful. So, I started thinking about the possibility of the programmer specifying which computations are independent of each other (or conversely, specifying which are dependent, which would probably be easier both for programmer and compiler). Another possibility is assigning statements to separate "parallel groups" sort of similar to threads, but different from threads in that statements in one block of code could be assigned to different parallel groups, whereas threads are used at a much higher level.

    Of course, this could work for multi-processor systems as well as parallel processing within single CPU cores. (The programmer would probably take a different approach in each situation, though, given bus speed limitations inherent in SMP sytems.)

    Are there any languages out there that allow the programmer to specify characteristics like this in existence? If not, what do you think of my ideas for parallel processing?
  2. jcsd
  3. chroot

    chroot 10,427
    Staff Emeritus
    Science Advisor
    Gold Member

    Your ideas have been around for at least four decades.

    The Cray machines, Connection Machines, and other supercomputers extensively used vector processing (now called SIMD, or single-instruction, multiple-data computing).

    Even your lowly Pentium has a number of indepedent functional blocks for things like adding and multiplying. It is the compiler's job to order the instructions so to make the best use of those functional units, and the registers and cache on the chip. The processor's control path(s) then handle(s) assigning operands to each functional unit when your program is run.

    The language itself is frankly irrelevant. Modern compilers (particularly those designed for large supercomputing machines) will automatically identify parallelizable blocks and run them independently.

    - Warren
  4. graphic7

    graphic7 560
    Gold Member

    SSE, SSE2, and MMX are all SIMD CPU extentions for x86 processors. Altivec is one of the better features of recent G4 processors, and in my opinion, one of the better (if not the best) SIMD implementations. UltraSparcs have a more MMX-like implemention, called VIS.
  5. Note that this isn't always true, even with really good compilers. Compilers still have to fall back onto heuristics a great deal, and the compiler will sometimes make the wrong tradeoffs and optimize the wrong part of the code at the expense of a more important part.

    However, once that level of optimization becomes necessary it's generally just easiest to hand-code the instructions, rather than jumping through a bunch of hoops to try to trick the compiler into doing what you want.

    Introducing special optimization mechanisms into a language generally isn't helpful when you really, really need the extra speed; you have to step outside the language anyway. And when you don't need that extra speed they just make it easier to prematurely optimize a program.
  6. How would it know whether or not it could parallelize the following?:

    (This is contains useless code, I know, but it's just for demonstration)


    Apparently, the P4 has 2 DDR integer addition units and an integer multiplication unit and only 1 FP arithmetic unit. This allows some possibility for parallelization, but I'm thinking more on the order of dozens of similar units for different types of operations.

    The supercomputers that you mentioned seem to be very much what I am thinking of. I'll have to look into SIMD (includine SSE, etc. that graphic7 mentioned).

    It would be nice to have that kind of funcationality in typical home CPUs, as we do with typical home GPUs.
  7. There are a number of extensions for all the more common programming languages that let you do parallel and distributed processing. there's even a whole newsgroup devoted to parallel computing discussions.

    Some of the parallel computing links I keep in my bookmarks:
    PVM - Parallel Virtual Machine, been around for at least 10 years.
    MPI - Message Passing Interface, also been around at least 10 years
    Supercomputing and Parallel Computing at Yahoo!

    Lately most of the focus seems to be on distributed computing applications (spreading the computing workload over separate networked CPUs) rather than conventional parallel computing (spreading the computing workload over multple CPUs in the same computer)
    Last edited: Jan 12, 2005
  8. If you want a compiler to optimize something, why would you deliberately introduce aliasing? Wouldn't it just be easier to write the code properly in a form that the compiler can optimize rather than writing it using pointer tricks and then adding optimization hints?
  9. I'm not saying that I would write a program like that, it was just for demonstration. A compiler needs to be robust...errors are not acceptable. It needs to be able to deal with any input that fits the specification of the language. If the compiler tries to read the code and incorrectly applies an optimization technique, it breaks the program.

    As I said, that was just a simple example. There could be much more complex programs that introduce aliasing without you even realizing (and sometimes you are forced into it), especially when you are talking about interacting with different software systems, such as the OS or different commercial packages included in programs like physics engines, sound system software, etc.
  10. chroot

    chroot 10,427
    Staff Emeritus
    Science Advisor
    Gold Member

    The most basic kind of compiler parses its input into a tree, then repeatedly applies a set of transformation rules that manipulate the tree, gradually moving it closer to the structure of the target language. Finally, the compiler's back-end walks the intermediate tree and emits the target language.

    Since each of the manipulation steps is simple and well-defined (i.e. their effects and side-effects can be mathematically proven to be closed), the entire compilation is mathematically guaranteed. Now, truth be told, real compilers will do more than simple tree-manipulation to achieve high-quality output, but nearly all compilers begin with this kind of processing.

    - Warren
  11. chroot

    chroot 10,427
    Staff Emeritus
    Science Advisor
    Gold Member

    And generally, the issues you're talking about are not relevant to compilation. How programs or processes share information is not particularly relevant to compiler-level optimization -- after all, one program is certainly not going to pass another a pointer.

    - Warren
  12. Yes, I know this. I don't see how it relates to the issues I'm talking about.

    If the compiler is going to apply parallelization optimizations, either on the scale I am talking about or even SSE, it would need to know whether statements are independent.

    Suppose you passed a pointer a function, and the function then used that value plus some offset to access a memory area. Suppose also that there is a loop (either inside or outside the function) that starts with that pointer and goes through some number of memory cells that it wants to try to parallelize with the first operation. To be safe, it would have to just not parallelize it. Otherwise, it opens itself up to errors.
  13. chroot

    chroot 10,427
    Staff Emeritus
    Science Advisor
    Gold Member

    Dissident Dan,

    Right... the compiler would not be able to parallelize it.

    Which, as master_coda said, means that the programmer must bear the burden of writing code that can be optimized. Some code can't be parallelized, even though it can be shown mathematically to be identical to another representation which can be parallelized.

    - Warren
  14. Thanks, everyone, for all your responses.

    I have some questions about compiler optimizations and SIMDs.

    Do the /G3,/G4, etc. MSVC++ compiler options (or similar optimizations for other compilers), (which tell it to optimize for P-Pro - PIII and P4 architectures, respectively) automatically cause the compiler to try to apply MMX, SSE, SSE2 optimizations?

    I know that you can also do something like "arch:SSE" in MSVC++. If my first question is true, is either option more advantageous than the other in any regards?

    Also, I read a little about explicitly programming SSE instructions. Do programmers generally or ever achieve speed advantages by doing this by hand over the compiler optimizations?
  15. An option to tell the compiler "these operations are independent" won't help you in this case. In fact, it will probably make things a lot worse. The compiler will go ahead and optimze the operations like you told it to, and now you've got a broken program because it turns out that operations you thought were independant really weren't.

    Most of the time if your code can't be easily rewritten into a form the compiler can optimize then there's a good reason; optimizing it will probably break it.
  16. chroot

    chroot 10,427
    Staff Emeritus
    Science Advisor
    Gold Member

    Dissident Dan,

    In my experience the MS compiler is not very good at identifying opportunities to use SSE. I have done quite a bit of hand-coding to write matrix arithmetic and digital filters with SSE, and was never able to get the compiler to generate code that came anywhere close to being as efficient.

    And yes, a hand-coded asm program that uses SSE for a FIR digital filter, for example, will run about 6x as fast as a hand-coded asm program that uses the normal integer unit. The SSE3 instructions can do as many as eight integer multiplications in one clock cycle, IIRC.

    - Warren
  17. Note that /G3 and /G4 will optimize for 386s and 486s, respectively. The options you're refering to are /G6 and /G7. I don't believe that setting either of those options will generate SIMD instructions.

    Setting "arch:SSE" or "arch:SSE2" will override any /Gx flags you specify. The compiler will generate SSE code if you specify them, although it probably won't do it as often as it could. In fact, MSVC++ probably won't use them very often at all; I've heard that it's SIMD optimizations are very poor. Intel's compiler is supposed to be much better, but I don't know anyone personally who has any experience optimizing with it.

    I have seen people get better performance hand-optimizing SIMD code (the weak MSVC++ optimizations probably have something to do with that). It's probably not useful as often with a better compiler, but there's almost always a extra speed that can be cranked out by hand it you need to.
  18. Well, I would assume that there would be parallelizable situations that would be similar enough to unparallelizable situations that the compiler wouldn't be able to tell the difference, but the programmer could. I would think that a compiler would not be able to verify parallelizability (wow, that had a lot of syllables) on most operations using pointers, yet in some of these case, the programmer would know that they were parallelizable.

    As vectors are often treated like pointers, this could be limiting to optimization.

    I suppose that features such as SSE offer the best opportunity for relatively easy, comprehensible code.
  19. chroot

    chroot 10,427
    Staff Emeritus
    Science Advisor
    Gold Member

    The ostensible purpose of the 'const' keyword in C is to inform the compiler of what data does not change, allowing it to perform better optimization. In reality, code that requires multiple threads of execution for acceptable performance should just be written as such to start with, rather than trying to rely on the compiler to figure it out. High-level algorithm design is a much, much easier problem to solve than instruction-level optimization.

    - Warren
  20. you can do parrallel programming now
    both linux and windows support multiple cpus
    it all depends on the hardware architecture and op-system architecture

    The Myth
  21. chroot,

    You've been very helpful. Thanks for putting up with me. :smile:


    The problem with multiple CPUs, as I see it, is that the bus transfers to the different CPUs can severely hamper performance. SMP seems to be best suited to programs where there are rather lengthy independent sets of instructions, so that you can run each for a while before the data computed by one CPU has to be used in conjunction with data computed by the other.

    This doesn't seem to mesh well with something like matrix multiplication, but I may be wrong.
Know someone interested in this topic? Share a link to this question via email, Google+, Twitter, or Facebook

Have something to add?