Why Have Pipelining? Benefits & Advantages

BicycleTree · May 16, 2005

I'm not sure this is the right forum for this.

Anyway, I just finished a course in computer organization and design, and there's one thread that's left hanging. If a pipelined design only increases throughput, and single-cycle datapaths are faster because they don't need temporary registers, why not just have several single-cycle datapaths working in parallel instead of a pipelined multistage design? Five parallel single-cycle datapaths should be faster than a five-stage pipelined datapath, and have better latency too. Is it just because of chip space?

Also, one disadvantage posed for the single-stage datapath was that all instructions would have to have the same temporal length, because the clock cycle has to be long enough for the instruction that takes the most time. But couldn't you have many clock cycles and just not use the majority of them, so you would still have an edge available no matter when the next instruction is ready to start?

russ_watters · May 16, 2005

BicycleTree said:

But couldn't you have many clock cycles and just not use the majority of them, so you would still have an edge available no matter when the next instruction is ready to start?

No: The faster the clock cycles, the more heat is dissipated, since every time you flip a transistor, it releases a little bit of heat. Intel increased their pipeline length (from the p3 to the p4) to allow for faster clock speeds, with shorter pipelines. Regardless, though, they are now moving toward more parallel processing with the introduction of dual core chips.

BicycleTree · May 16, 2005

Heat dissipation wouldn't be a problem--or wouldn't be an unusual problem--because for a single cycle design with fixed instruction duration the clock cycle is much longer anyway. Increasing the clock speed for a single cycle design would basically just be bringing it up to normal.

faust9 · May 17, 2005

The point of pipelining is to limit the number of wasted clock cycles. You can add as much latency as you want if your goal is to cool the processor by using wait cycles though you are going in the reverse direction of the intel engineers by doing so.

Read this for a good explanation of why pipling is used(to reduce toe tapping of the processor):
http://www.hitex.co.uk/arm/lpc2000book/index.html

Now, how can you say heat dissipation wouldn't be a problem? What are you basing this idea on?

BicycleTree · May 17, 2005

I understand the basics of pipelining. My point is that while a pipelined datapath has more throughput than a single-cycle datapath, it also has more latency due to the temporary registers it must use.

A single-cycle datapath with a clock speed that enables the control unit to start the new instruction when the old one ends instead of after a fixed length of time would have less latency than a pipelined datapath. If you put many of these in parallel, they would simulate the effect of pipelines, and have better latency too.

I'm only suggesting increasing the clock rate up to the clock rate of a normal, pipelined chip. In effect, you would be replacing multiple stages with a single stage of variable length, but that only varies in multiples of the length of a pipelined stage (finer variation would require faster clock). So there would not be any heat problem due to clock rate that a normal, pipelined chip doesn't have to deal with.

faust9 · May 17, 2005

I don't think you understand why pipelines are used. Read the reference above for a clearer picture of why processors are pipelined. Replace flash with HD or slow ram(the ram speed is a fraction of clock speed isn't it?) and you'll understand why your proposal will either a) slow the processor or b) exceed the limits of your storage media thus slowing the processor.

Your proposal for using parallel paths adds unnecessary complexity to a system IMHO. You proposed, as I take it, using distinct parallel paths instead of a pipeline; however, to do so would require more transistors, you're not taking away any registers by paralleling, and an added need for a register counter (a PC to determine which of the three or so parallel registers to goto next) will be created. This also brings up the question of "how do I handle a jump?" Your stack would become more complicated due to the added need to account for the multiple data paths.

BicycleTree · May 17, 2005

Apparently I have to go through a registration process to read the information on that site, so if you want to say something please provide another link.

Pipelines require extra temporary registers--invisible to the programmer--to store the results of operations after each phase of the pipeline. Storing and loading from these registers takes significant time, so I have been told.

Pipelining is just a system of parallel processing anyway; any complexities/conflicts from using parallel single-cycle datapaths would also result from using pipelines.

I don't know what you're talking about with "three or so parallel registers." You could have only one register file that the single-cycle datapaths take turns accessing.

How wouldn't you handle a jump? You just load the PC with the new value and insert a few NOPs to wait until it goes there, just like with a pipeline.

I am sure there are good reasons why my scheme is not used, but you are not providing them.

faust9 · May 17, 2005

BicycleTree said:

Apparently I have to go through a registration process to read the information on that site, so if you want to say something please provide another link.

Pipelines require extra temporary registers--invisible to the programmer--to store the results of operations after each phase of the pipeline. Storing and loading from these registers takes significant time, so I have been told.

Pipelining is just a system of parallel processing anyway; any complexities/conflicts from using parallel single-cycle datapaths would also result from using pipelines.

I don't know what you're talking about with "three or so parallel registers." You could have only one register file that the single-cycle datapaths take turns accessing.

How wouldn't you handle a jump? You just load the PC with the new value and insert a few NOPs to wait until it goes there, just like with a pipeline.

I am sure there are good reasons why my scheme is not used, but you are not providing them.

A three stage pipeline is very common---for speed and performance gains---hence the use of three parallel paths in my discussion.

Just register for the book. Make some stuff up if you don't feel like giving your correct information.

How is it that you bounce around between not adding latency and adding latency? Adding a NOP is adding latency. You are attempting to slow a process down on the processor side instead of addressing the real bottleneck---Data access.

If, as you proposed in the first post of this thread "why not just have several single-cycle datapaths working in parallel" then you still end up bottle necked at the single register at the end(the one you just proposed). More importantly, your last description is essentially how a data pipe works anyway. Take the time to read the link. Just skip to the description of how the ARM7 pipeline works.

The problem with your scheme is it doesn't address the slowest part of the process; moreover, you are attempting to add latency with NOP's to satisfy some arbitrary timing. RISC processors tend to complete an entire instruction in a single instruction cycle anyway (few and easy instructions execute much quicker along with pipelining). x86 processors OTH have various execution times per instruction (from 12 to 48 or something like that---I don't work with x86) so, moving data from location A to an end processor register can easily be accomplished in less than 12 instruction clock cycles (some 8051's use 3 to 12'ish instruction cycles per instruction). Hell, it can be done in a couple of real clock cycles.

Getting the data to the processor from whatever external source that is used IS the slowest part of the system. You cannot read/write ram faster than the ram is designed for. Adding a faster process(trying to read ram via 5 parallel data paths) will not help you because you are limited by external resources anyway. Adding multiple paths will add complexity no matter how you look at it even if they all end up dumping their data into a single register at the end.

Here, a little more info:
http://cse.stanford.edu/class/sophomore-college/projects-00/risc/pipelining/
http://tonywesley.com/pipetop.html

Personally, I don't see your proposal reducing the complexity of a processor in any way; however, I do see a lot of wasted clock cycles.

My 2 cents.

egsmith · May 17, 2005

You should really read faust9's post before reading this. His post is really good.

BicycleTree said:

Pipelines require extra temporary registers--invisible to the programmer--to store the results of operations after each phase of the pipeline. Storing and loading from these registers takes significant time, so I have been told.

You need to state what significant time means and it's significance compared to what. Without quantify these relations the post if not very meaningful. Also, you are not being very clear on what it is you want to maximize.

BicycleTree said:

single-cycle datapaths are faster because they don't need temporary registers

From reading your posts I think this is the statement that is causing you the most problems. This is only true for very particular architectures and only in one sense. Basically, like with "significant," you are leaving out the important details.

The statement should read:
The time for one instruction to complete in single-cycle datapaths is often less than that of a pipelined datapath because pipelined datapaths require temporary registers.

If we are concerned with the time to complete many instructions then we note pipelined archs can reduce the work required per cycle and thus reduce the cycle time. The few additional cycles which are required due to various technical reasons do not overwhelm the time saved by reducing the cycle time. Thus the time to complete many instructions is less for pipelined archs than for single-cycle archs.

Now, back to the original question. Which do you think is more important in a typical computing scenario, the time to complete one instruction or the time to complete many many instructions? (note: the more instructions considered the more time is saved in a pipelined arch.)

Once upon a time all CPUs were single cycle. Now you would have to look pretty hard to find one, and I'll bet it's performance is only superior in one special case.

BicycleTree · May 18, 2005

Faust:
Where did you get the idea that I was trying to reduce the complexity of the processor? I have been talking from the beginning about reducing latency, and I haven't mentioned reducing complexity.

A pipeline has exactly the same problem with branches as parallel single-cycle datapaths would have. There is also stuff--branch prediction hardware--that helps pipelines work faster in the presence of branches, so they don't need to be supplied with NOPs, and I see no reason why branch-prediction hardware could not be incorporated into a parallel single-cycle design.

RISC processors tend to complete an entire instruction in a single instruction cycle anyway (few and easy instructions execute much quicker along with pipelining).

If an instruction completes its execution within a single cycle, then it cannot be pipelined (cannot derive any gain from pipelining) because pipelining requires that the instruction be broken into multiple stages. Or am I misinterpreting what you mean here?

so, moving data from location A to an end processor register can easily be accomplished in less than 12 instruction clock cycles (some 8051's use 3 to 12'ish instruction cycles per instruction). Hell, it can be done in a couple of real clock cycles.

Is there a difference between an instruction clock cycle and a real clock cycle?

If the registers are the bottleneck, then that would be one good reason why parallel single-cycle datapaths are not used. So would it then be accurate to say that most processor speed improvements depend on improvements to register access time? This would be surprising if true, but I don't know whether it is--is it? You're saying that the speed of the processor is relatively unimportant because of the slowness of memory?

On that site, there is also the problem that the download is 8.8 mb and I am on a 56k modem. A registration process, an 8.8 mb download, and all to get at material that you _claim_ is relevant. It seems pretty design-specific in this case; it is a guide solely for the arm7 based Philips microcontrollers rather than general theory. So going to that site would require that I fill out the registration form, spend two or three hours (or likely more) downloading the guide, then reading through a ton of device-specific material in hopes of some theory somewhere in the middle. For these reasons, I am not going to waste my time on that site. If there's stuff in the guide that I should know that can't be found elsewhere, just post it here, or pm it to me if you're worried about copyright infringement.

The first of your other two links is stuff I already know, and the second one doesn't display.

egsmith:
The only rough figure I have was when I asked the professor about how long register storage and access might take, and he said it would probably be several times the amount of time it takes a signal to go through an ALU. He is apparently very knowledgeable, though I couldn't say for sure.

I know that individual single-cycle datapaths have less throughput than pipelined datapaths. That's why I suggested using several parallel single-cycle datapaths.

Personally, I suspect that the reason parallel single-cycle datapaths are not used is because they take up chip area, and lower latency is usually not so important compared to throughput-per-unit-area on the chip. I'm looking for confirmation of this, and possibly other issues that might be easier to handle with a pipeline.

faust9 · May 18, 2005

BicycleTree said:

Faust:
Where did you get the idea that I was trying to reduce the complexity of the processor? I have been talking from the beginning about reducing latency, and I haven't mentioned reducing complexity.

Adding complexity tends to slow things down and increases costs. The costs associated with transistor on a processor die are exponential in that 1 transistor may cost x while 100 may cost x^n.

A pipeline has exactly the same problem with branches as parallel single-cycle datapaths would have. There is also stuff--branch prediction hardware--that helps pipelines work faster in the presence of branches, so they don't need to be supplied with NOPs, and I see no reason why branch-prediction hardware could not be incorporated into a parallel single-cycle design.

Possibly. In retrospect on this you are probably right.

If an instruction completes its execution within a single cycle, then it cannot be pipelined (cannot derive any gain from pipelining) because pipelining requires that the instruction be broken into multiple stages. Or am I misinterpreting what you mean here?

This is false, I said RISC processors tend to complete an instruction in a single instruction cycle. That is because a single instruction may actually consume 12 clock cycles but the effect of pipelining is instructions are completed once every clock cycle after the first instruction completes.

Is there a difference between an instruction clock cycle and a real clock cycle?

Clock cycles or machine cycles are the frequency of the system clock used to drive the processor. For instance, the external system clock on most(I'd like to say all buut then there is always an odd ball design that gets thrown out thus making me look bafoonish :) PC's coincides with the clock speed of the ram. The processor takes that clock signal and multiplies it to 3.4GHz or whatever the latest and greatest processor is(PPC is BTW). This is done via an internal PLL within the processor. The output of the PLL at 3.4GHz is the machine cycle or as I referred to it the 'real clock cycle.'

An instruction clock cycle is the number of machine cycles needed to complete a single instruction. An example is the Motorola HC11. The Machine clock is 4x faster than the instruction clock thus a single instruction takes four machine states to complete. Now, if the HC11 had a pipe then the first instruction would need 4 machine states to complete the first instruction but every successive instruction would complete in a single machine cycle therafter.

If the registers are the bottleneck, then that would be one good reason why parallel single-cycle datapaths are not used. So would it then be accurate to say that most processor speed improvements depend on improvements to register access time? This would be surprising if true, but I don't know whether it is--is it? You're saying that the speed of the processor is relatively unimportant because of the slowness of memory?

The registers are not the bottle neck. Access to ram/rom/harddrive are. The system clock on a PC hypothetically runs at 133MHz. The processor hooked up to that ram hypothetically runs a 1.4GHz. The processor can access ram much much faster than the ram(about 10.5 times fatser) can function so paralleling the number of access to ram would do little to improve things unless you bank your ram and/or add multiple addr:data buses. Banking and multiple addr:data buses have their own problems though. Increasing the access frequency of ram increases the frequency on the addr bus. The foil runs become antennas and cross talk increases(data jumping from one bus line to another). Essentially the flow of information into the processor from outside sources is the problem. It is highly impracticle to add copious amounts of ram to a processor because the costs skyrocket (see the cost differences between a Pwhatever and a comperable celeron).

On that site, there is also the problem that the download is 8.8 mb and I am on a 56k modem. A registration process, an 8.8 mb download, and all to get at material that you _claim_ is relevant. It seems pretty design-specific in this case; it is a guide solely for the arm7 based Philips microcontrollers rather than general theory. So going to that site would require that I fill out the registration form, spend two or three hours (or likely more) downloading the guide, then reading through a ton of device-specific material in hopes of some theory somewhere in the middle. For these reasons, I am not going to waste my time on that site. If there's stuff in the guide that I should know that can't be found elsewhere, just post it here, or pm it to me if you're worried about copyright infringement.

All arm7s are essentially the same. The added functions(ADC's, ports configurations, etc) are what makes one different than an other. I referenced that source because the Philips Arm7 has onchip flash with a max access frequency of about 20MHz. The Arm7 processor can chug along at 60MHz. In this case flash would be the bottleneck in the system---just like ram on your PC---but by using a datapipe the processor can overcome the speed difference between the processor and the flash.

I can't cut and paste the section (ebook security) so, you'll have to search the net if you really curious how a fetch-decode-execute pipeline can improve system performance.

The first of your other two links is stuff I already know, and the second one doesn't display.

The first link shows how a RISC processor can execute instructions in a single instruction cycle. The second may have been down or on a slow server. I don't know.

egsmith:
The only rough figure I have was when I asked the professor about how long register storage and access might take, and he said it would probably be several times the amount of time it takes a signal to go through an ALU. He is apparently very knowledgeable, though I couldn't say for sure.

I know that individual single-cycle datapaths have less throughput than pipelined datapaths. That's why I suggested using several parallel single-cycle datapaths.

Personally, I suspect that the reason parallel single-cycle datapaths are not used is because they take up chip area, and lower latency is usually not so important compared to throughput-per-unit-area on the chip. I'm looking for confirmation of this, and possibly other issues that might be easier to handle with a pipeline.

It probably has more to do with nominal gains versus increasing costs and the inability to utilize the gains due to external forces(ram/rom/HD/flash).

Hope this helped.

egsmith · May 18, 2005

BicycleTree said:

I'm looking for confirmation of this, and possibly other issues that might be easier to handle with a pipeline.

There are tons of them: some architectural, some electrical and some managerial. I'll give you a key electrical one as this is the hardest design challenge (IMHO) when designing single cycle datapaths.

The various gates and routes that make up the datapath all have different delays. Furthermore, the time through the same gates but using different combinational paths also have different delays (lets just ignore power supply and temperature variations, even within the same chip, for now). This makes determining the exact input to a combinational function at any given time extremely difficult. Therefore insuring that the combinational logic behaves properly is extremely difficult. Adding registers between small groups of combinational logic synchronizes all the outputs to a specific timing event. This means the next stage can be designed by assuming it's combinational inputs all arrive at the same time (i.e. the timing event), this reduces the complexity of the design tremendously.

I don't know if you understand that but here it is in a more high level view. (I assume you have some knowledge of basic digital design.) Because the combinational logic is so complex in a single cycle datapath the designer inadvertently creates asynchronous state machines. Getting an asynchronous state machine to function properly is much harder than getting a synchronous one to work.

As one who has designed both single cycle and pipelined datapaths I can assure you the former is much easier to get right.

Kenneth Mann · May 19, 2005

I don't see what the problem is, but I'll try to give some simple-minded illustrations of why pipelining is used. First, forget clock cycles. This can be pure confusion, because every different machine design has a different definition as to what a clock cycle is, and how many are taken to do what. There's no hard rule. One system may do in one "so called" cycle what another does in maybe five or six, and yet still execute an instruction in comparable time periods. It all depends upon how the designers choose to divide down, distribute, then use the various clock signals to drive the internal logic.

Also, chip (die) area is not a prime concern here (it is with regard to heat generation and die reliability, though). The main concern with respect to execution speed is "cell area".

Also, the access times for registers (especially simple dedicated pipeline registers) is not appreciably different from that of an ALU (which can have a bit of its own pipelining if it is capable of multiplying). A look at the diagram of the AM2901 (register file of the 2900 bit-slice processor family, shows a throughput delay of approximately 7 to 9 gate delay times, very similar to an ALU.

The great throughput advantage of pipelining, is in 'execution overlap'. To illustrate this we show the case of the execution of three sequential instructions in a very simple-minded processor, first for a 'single-cycle' processor, then for one with pipelining.

Single Cycle:

Send first instruction address
Fetch first instruction
execute first instruction
Send second instruction address
Fetch second instruction
execute second instruction
Send third instruction address
Fetch third instruction
execute third instruction

With pipelining:

Send first instr address - -
Fetch first instruction - -Send 2nd instr address -
execute first instr - - - -Fetch 2nd Instruction - Send 3rd instr Addr
- - - - - - - - - - - - - - Execute 2nd instr- - - -Fetch 3rd instruct
- - - - - - - - - - - - - - - - - - - - - - - - - - Execute 3rd Instr

In the first case, nine 'instruction phases' are used, and in the second case, only five are needed, due to the overlap which pipelining affords. Note, that this advantage is largely lost during branch and jump type operations.

Where pipelining is most effective, however is with special "pipeline processors" where the same 'sequence' of operations is performed many, many times upon a large array of values. In this case, a single instruction (which represents an entire sequence) is set-up to execute over and over, on the array of data, and a pipeline of registers (and possibly execution units) is configured to further parallel the operations. The data is then shifted through this pipeline continuously, using only the one (complex) instruction, until all the operations are processed. This is especially useful for certain 'number crunching' operations, such as digital or (especially) standard Fourier transforms. Here, the operations take place within a "two-dimensional" context, where, for example, you may have 1000 X 1000 operations. An even more extreme example is "Hologram processing", which takes place in a 'three-dimensional' reference, where we may have 1000 X 1000 X 1000 operations to process each point within a 3-D space, and then to project it onto every point within a 2-D space, 1000 X 1000 more operations may be needed. (Total, 1,000,000,000,000,000 operations.) For this type of operation, a special pipeline processor is probably the only logical choice.

KM

Sadmemo · May 21, 2005

I guess Op can also find it useful if he references Computer architecture by John L. henessy and David A. Petterson.
Things are not far away from what people have explained...

BicycleTree · May 21, 2005

I don't need any basic explanations of why pipelining is used. I said, I had a course that covered it. It actually did use a Patterson and Hennessy book (Computer Organization and Design) in the second part of it.

egsmith, it would not be entirely combinational. Only the operations of each single-cycle datapath would be combinational. The operation of each whole single-cycle datapath would still work by the clock.

faust, I know that the average execution time of a large batch of instructions can be close to 1 instruction per cycle. The execution time of each instruction still takes several cycles, and the start of each instruction would still be locked to clock cycles. Would it really be that much more complicated?

The costs associated with transistor on a processor die are exponential in that 1 transistor may cost x while 100 may cost x^n.

That's very interesting--why is that?

The other things you said were informative but I don't have comments on them.

Kenneth: That's interesting, I once read an article about scientists trying to predict the movements of large groups of stars, and it said they used custom hardware to do it. So I guess that was what they did; I never made the connection.

Also, here is another question left unanswered in the course: What occupies the rest of the space on a chip? It seems like a basic datapath, except for the control unit which we didn't cover, breaks down into only a few hundred logic gates besides registers, which wouldn't require more than a few thousand gates. Is all the rest of the space on a chip occupied by fancier operations than the simple ones our model datapath included, or is it all covered by the control unit, or is there something I'm missing?

so-crates · May 27, 2005

BicycleTree said:

Also, here is another question left unanswered in the course: What occupies the rest of the space on a chip? It seems like a basic datapath, except for the control unit which we didn't cover, breaks down into only a few hundred logic gates besides registers, which wouldn't require more than a few thousand gates. Is all the rest of the space on a chip occupied by fancier operations than the simple ones our model datapath included, or is it all covered by the control unit, or is there something I'm missing?

Theres lots of memory floating around, including several different types of caches. There also has to be other features such as bus control.

BicycleTree · May 27, 2005

So:

--space on the chip is almost irrelevant for something as small as a datapath, so the advantage of pipelining is not space
--pipelining reduces the cost of the datapath for a given throughput compared to a parallel single-cycle datapath
--pipelining does not have a throughput advantage over parallel single-cycle design
--datapath speed is almost irrelevant after a point
--most of the chip is memory and data movement

Why Have Pipelining? Benefits & Advantages

1. What is pipelining and how does it work?

2. What are the benefits of using pipelining?

3. What are the advantages of using pipelining over other techniques?

4. Are there any drawbacks to using pipelining?

5. How is pipelining used in real-world applications?

Similar threads

Hot Threads

Recent Insights