Learning materials for understanding CPUs

SchroedingersLion · Oct 8, 2020

Greetings,

I am enrolled in a graduate level course to HPC computing. The lecture, however, is often very superficial. For example, after the session to processors, I was left wondering about the differences between clock cycles, machine cycles and the difference between FLOP count and clock cycle.
The practical exercise then required basic assembly knowledge and consisted of answering questions such as "how many instructions does the assembly code issue per loop?" or "Assuming a fully pipelined architecture running at 1GHz, which can issue one instruction per cycle, and assuming no stalls, how long does each iteration take to execute"? Of course, assembly was not taught in the lecture.

Since it is a CS course and I don't have a CS background, other students might be more suited for the course. I was wondering if anyone here has a nice comprehensive script or a book that explains these topics. I don't want to become an expert, and I don't have time to learn all the tiny details or subtleties of CPU architectures, but since I will be working with HPC systems during my research in the next few years, I want to obtain a solid grasp on the hardware basics.SL

jim mcnamara · Oct 8, 2020

Merely an opinion - given your motivations, and other concerns like 'tiny details' I would strongly suggest that you are in the wrong course. Assuming you will not be coding in assembler.

If you are going to be writing high level parallelized code then learn the language(s) and job control extremely well. This is because the folks who build compilers worry about everything else.

yungman · Oct 8, 2020

If you don't have CS background, why are you taking a class that goes deep into the CPU? Even CS people might not necessary know what you are asking. I was a hardware designer long time ago, I know exactly what you are asking.

You might not want to know the detail, but your question are very detail. To count clock cycles of each instruction, you have to look at the instruction. Sometimes, trying to take short cuts and avoid digging deep might end up wasting more time and effort and you end up not knowing after all.

Besides, there are so many CPU, each of them can be very different, you really have to specify that also. Back in the days, when I designed the CPU board, I just read the datasheet of the CPU. It should have all the information there...if you can understand it. In my days, I learned the Intel 8080 processor, we just went over the datasheet line by line. It had all the clock timing, the read and write cycle, the wait state and all. We just count the clock cycle of the instruction. Like read cycle or write cycle, when the address comes out, when to latch the address, when the data comes out, when to latch the data. All according to the clock.

It took me like 2 or 3 weeks to learn it good. I don't know how the CPU are now as I quit digital designs back in early 80s and moved onto bigger and better things. You want to understand this, be prepare to invest a few weeks of your time.

Tom.G · Oct 9, 2020

If you want to dig deeper, here is a link to the technical reference for the intel 8080 CPU, circa 1973 (back when things were simpler and explained in more detail), the great-grand-daddy of the Intel CPU's in present-day home computers.
http://bitsavers.trailing-edge.com/...Microcomputer_Systems_Users_Manual_197509.pdf

Cheers,
Tom

yungman · Oct 9, 2020

Tom.G said:

If you want to dig deeper, here is a link to the technical reference for the intel 8080 CPU, circa 1973 (back when things were simpler and explained in more detail), the great-grand-daddy of the Intel CPU's in present-day home computers.
http://bitsavers.trailing-edge.com/...Microcomputer_Systems_Users_Manual_197509.pdf

Cheers,
Tom

Good old days, That's what I studied and the Z80 and 8085. Around 2-9 show the read write instruction interrupt and other cycles.

Mark44 · Oct 9, 2020

SchroedingersLion said:

I am enrolled in a graduate level course to HPC computing. The lecture, however, is often very superficial. For example, after the session to processors, I was left wondering about the differences between clock cycles, machine cycles and the difference between FLOP count and clock cycle.

Are you working out of a textbook? If so, is there a glossary of terms or an index?
I don't believe there is any difference between clock cycles and machine cycles. The CPU operations are governed by a system clock. In each clock cycle an instruction is fetched, decoded, and executed. During one clock cycle memory can be read from or written to.
FLOP is floating point operation. Floating point operations can often take lots of clock cycles to execute, more so than integer operations such as add, sub, and so on. Performance tests that compare one machine with another will usually have suites of applications that time the execution of both integer instructions and floating point instructions.

SchroedingersLion said:

The practical exercise then required basic assembly knowledge and consisted of answering questions such as "how many instructions does the assembly code issue per loop?" or "Assuming a fully pipelined architecture running at 1GHz, which can issue one instruction per cycle, and assuming no stalls, how long does each iteration take to execute"? Of course, assembly was not taught in the lecture.

Does the course have any prerequisites? From your description it appears that you are expected to have at least some knowledge of an assembly language.
To answer the questions posed above, all you need to do is to count the instructions in the loop. It makes a difference, though, if different instructions take different numbers of clock cycles to execute.

SchroedingersLion said:

Since it is a CS course and I don't have a CS background, other students might be more suited for the course.

What exactly is your background? Have you done any programming?

SchroedingersLion said:

I don't want to become an expert, and I don't have time to learn all the tiny details or subtleties of CPU architectures, but since I will be working with HPC systems during my research in the next few years, I want to obtain a solid grasp on the hardware basics.

To be able to answer questions like the ones you wrote above, you can't escape from a lot of the nitty gritty details. I'm not aware of any books titled "Assembly Langauge for Poets" or "Computer Architecture for Idiots."
What do you mean by "working with HPC systems"? You won't have a solid grasp on the hardware basics unless you have a fair idea of what happens at the level of how assembly instructions get executed in a CPU.

yungman · Oct 9, 2020

Mark44 said:

Are you working out of a textbook? If so, is there a glossary of terms or an index?
I don't believe there is any difference between clock cycles and machine cycles. The CPU operations are governed by a system clock. In each clock cycle an instruction is fetched, decoded, and executed. During one clock cycle memory can be read from or written to.
FLOP is floating point operation. Floating point operations can often take lots of clock cycles to execute, more so than integer operations such as add, sub, and so on. Performance tests that compare one machine with another will usually have suites of applications that time the execution of both integer instructions and floating point instructions.
Does the course have any prerequisites? From your description it appears that you are expected to have at least some knowledge of an assembly language.
To answer the questions posed above, all you need to do is to count the instructions in the loop. It makes a difference, though, if different instructions take different numbers of clock cycles to execute.
What exactly is your background? Have you done any programming?
To be able to answer questions like the ones you wrote above, you can't escape from a lot of the nitty gritty details. I'm not aware of any books titled "Assembly Langauge for Poets" or "Computer Architecture for Idiots."
What do you mean by "working with HPC systems"? You won't have a solid grasp on the hardware basics unless you have a fair idea of what happens at the level of how assembly instructions get executed in a CPU.

I don't think that is all true, usually takes a few clock cycle for each instruction like memory fetch. I have not read any datasheet for a long time, But when I was designing these before, the old one like 8080, Z80 and 8085 all need a few clock cycles to read, write, out and in. You need to first send out the address(see 1 below), make sure it's stable, then the /Wr goes low to write in 2 below, then goes back high(3) BEFORE the (4)address starting to change to ensure you have enough hold time so it won't accidentally change while /Wr is still low to avoid accidentally write to another location. It is a sequential operation. You can see there are two clocks to generate all that. Each M1, M2 etc are each instruction cycle. For 8080, looks like it's 3 clocks per instruction. BUT one cannot speak for any other processor. then you can have wait state.

I don't know whether the term system clock is the same is machine cycle. Usually I don't think so as you cannot depend on the system input clock as there is propagation delay to generate the clock that CPU can rely on. Remember particular CMOS circuits, the prop delay drift a lot due to temperature change. So relying on the edge of the input system clock is NOT a good practice. They rely on the internal clock to compensate the temperature drift. See the two 'phi' clocks in the timing diagram. Those are internal generated clocks that are temperature compensated.

But this is really electronics. I think if OP is NOT CS, the class might be the wrong one to study. You don't use your toe to tough and feel the water, you either dive in, or stay away. Unless programmer work with low level assemble language, most programmers don't even have to know all these.

This is CPU specific, I won't say anything until I see the specific CPU and read the datasheet.

SchroedingersLion · Oct 9, 2020

Thank you guys for your various responses.

Maybe it helps if I clarify my background or what I hoped to gain from such a course.
I studied physics (MSc) and I am now enrolled in a PhD program to applied mathematics. During my physics studies, I used C++ as well as Python and Matlab for numerical simulations and some data science. During my master thesis I was using a cluster, where I submitted the same simulation thousands of times with different random seeds to then statistically average over the results (which means I also know how to deal with submission scripts). In the first year of my PhD I did a small MPI workshop (never used OpenMP though).

My research is going to focus on distributed algorithms for Machine Learning, i.e. Machine Learning on HPC systems. What's more, my supervisor is involved in an interdisciplinary program that researches future exa-scale HPC technologies (from algorithms, to hardware, and possible applications in science and engineering) and he would like me to get involved as well.

Obvious ways for me to improve are certainly getting better at C++ (including MPI and OpenMP) and get more experience in using clusters or other HPC systems. However, I also thought it might be useful for a guy like me to understand how these systems operate on a hardware level, i.e. how processors or memory work. At least on a basic level. The course did not list any prerequisites, but the material seems to be very superficially covered, yet the CS students seem to draw more information out of the lectures than me (unless they already knew part of the material...).

My standing in the world of CS can be illustrated by this diagram:

I know about the highest level: How to use a computer, how to use high level languages and so on.
I know about the lowest level: The semiconductor physics, how a transistor works, and how logic gates are made.

I don't know nothing about the middle.

hutchphd · Oct 9, 2020

This is going to date me but at a similar stage in my physics education I had similar "simple" questions and found the Radio Shack/TI book on microprocessors(revised link) very useful. Haven't viewed it in several decades so who knows...

Edited

berkeman · Oct 9, 2020

hutchphd said:

This is going to date me but at a similar stage in my physics education I had similar "simple" questions and found the Radio Shack/TI book on microprocessors very useful. Haven't viewed it in several decades so who knows...

Link isn't working for me...

Mark44 · Oct 9, 2020

yungman said:

I don't think that is all true, usually takes a few clock cycle for each instruction like memory fetch. I have not read any datasheet for a long time, But when I was designing these before, the old one like 8080, Z80 and 8085 all need a few clock cycles to read, write, out and in.

The 8080, Z80, and 8085 are very old architectures, none of which had instruction pipelines. Some RISC architectures, such as MIPS can do loads and stores of integer values in a single cycle. Per this resource, MIPS Assembly Langauge Programming, by Robert Britton (Cal State Univ, Chico) http://index-of.es/Programming/Assembly/MIPS%20Assembly%20Language%20Programming%202003.pdf, in Appendix A, the only integer instructions that take more than one cycle are the divide and multiply instructions.
For ARMv7, the processor can issue an integer load instruction each cycle, with a latency of 3 cycles for the result to be written. Integer store instructions can be issued each cycle, with the result written to memory in 2 cycles. In effect, with the pipeline operating, a load or store can be done each cycle. See https://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/

yungman · Oct 9, 2020

SchroedingersLion said:

Thank you guys for your various responses.

Maybe it helps if I clarify my background or what I hoped to gain from such a course.
I studied physics (MSc) and I am now enrolled in a PhD program to applied mathematics. During my physics studies, I used C++ as well as Python and Matlab for numerical simulations and some data science. During my master thesis I was using a cluster, where I submitted the same simulation thousands of times with different random seeds to then statistically average over the results (which means I also know how to deal with submission scripts). In the first year of my PhD I did a small MPI workshop (never used OpenMP though).

My research is going to focus on distributed algorithms for Machine Learning, i.e. Machine Learning on HPC systems. What's more, my supervisor is involved in an interdisciplinary program that researches future exa-scale HPC technologies (from algorithms, to hardware, and possible applications in science and engineering) and he would like me to get involved as well.

Obvious ways for me to improve are certainly getting better at C++ (including MPI and OpenMP) and get more experience in using clusters or other HPC systems. However, I also thought it might be useful for a guy like me to understand how these systems operate on a hardware level, i.e. how processors or memory work. At least on a basic level. The course did not list any prerequisites, but the material seems to be very superficially covered, yet the CS students seem to draw more information out of the lectures than me (unless they already knew part of the material...).

My standing in the world of CS can be illustrated by this diagram:

View attachment 270709

I know about the highest level: How to use a computer, how to use high level languages and so on.
I know about the lowest level: The semiconductor physics, how a transistor works, and how logic gates are made.

I don't know nothing about the middle.

This put a lot more perspective to the whole big picture. What you asking about the clock and stuff has NOTHING to do with physics and very little to do with even electronics or C++, it's really a field of it's own...Digital and hardware engineering. Though it's a huge field and plenty of jobs. It is definitely worth your time to study into this no matter whether this will help your project or not. This I cannot comment as I am not familiar. I am not a programmer, I was an EE and manager of EE for decades. I just come here begging for help in learn C++, this is my crossword puzzle for retirement. I just need something more challenging for my mind.

Speaking of Physics and PhD. I spent at least a decade working in a place called Charles Evans and Assoc. It's a small company but it was quite famous in that field. They have analytical division that do sample analysis. I was in the newly formed at the time the instrumental division. Our designs are very similar to spectrometers for semiconductors like in KLA Tencor and Apply Materials. 70% of my colleagues are PhDs and a lot were Physics. We worked closely together in design and bringing up the machines. Their jobs are NOT anything like Sheldon Cooper in BBT that sit behind the desk and work out some math formula in Dark Matter! Those PhDs are very hands on, they think nothing to pull out a scope and probe the circuit board to look at stuffs to test out the instruments. In I perfect world, you have technicians and engineer to baby sit them and probe things for them to look at. In real life, you are on your own, doing all the dirty work that I won't even want to do and sent my technicians over to do it! So knowledge on electronics and mechanical engineering is very important.

We built a lot of high vacuum mechanical components into the system. In theory, again, technicians should build them. BUT in real world, guess who's the one that build them? Those PhDs! My point is you have to be diverse. Taking a class in digital electronics and CPU design will likely enhance your career even it might not help in your project right now.

Speaking of career, you might think you know exactly what you want to do, but don't be too sure. That's why so many people are not working in the field of their degree. I for one is one of those. My degree is Chemistry. I hated Chemistry so much when I got my degree, I rather be a pizza delivery than to find a job in Chemistry. I got into my career in electronics ONLY because of one Fortran class I took in college. That taught me in how computer works. I started out as test tech, then I found electronics is my ultimate passion and start studying. I had to test some circuit boards, I start writing a few line assemble language programs and it worked. It was easy to learn for me as it follow the trend of Fortran I learned, how the computer think and behave. I worked hard and did good job, pretty soon they relied on me to write those programs. Assemble and machine language is so close to digital electronics, pretty soon I started fixing their design problems. Before I knew it, they promoted me to be a junior EE! That's how I started my whole career...Even though I totally got away from digital and firmware design in the mid 80s and went onto much more challenging and satisfying field of analog and RF. But it's the class of Fortran that got me into my dream career. You never know what you end up. Learning CPU is NOT a bad thing.

Keep your eyes in semiconductor field, so many PhD in physics people work in the field. My former boss is PHD in physics. He's the CTO of the company and he invented and patented a lot of his ideas. I worked closely with him all the time. One time, I came up with an idea of how to measure timing accurate to pico seconds and can recover in less than a few usec to read again. That involved generate super accurate pure sine waves that are exactly 90deg from each other. My boss(PhD) ended up writing the program to generate the two sine wave for me in C++ because he did not trust other programmer to do it. It was a big deal in 1997 at the time. He got his hands dirty...CTO and director of the company! Those PhDs really learn electronics and programming. I trust him in designing electronics over any of my engineer that worked for me.

Regarding to Physics, One part of your Classical Physics...Electromagnetics has very close relation with microwave, antenna, transmission lines in EE. If you are good at EM, you are way ahead of the game. A lot of EE are afraid of RF, Microwave because it takes a lot more knowledge than just electronic circuit design. It's a different animal all together. I am a self studyer, I studied multi variable, differential equations and PDE all on my own and study EM, RF and antenna design. This should be easy for you as physics major also.Might be off the subject, Just want to give you some ideas from my experience.

yungman · Oct 9, 2020

Mark44 said:

The 8080, Z80, and 8085 are very old architectures, none of which had instruction pipelines. Some RISC architectures, such as MIPS can do loads and stores of integer values in a single cycle. Per this resource, MIPS Assembly Langauge Programming, by Robert Britton (Cal State Univ, Chico) http://index-of.es/Programming/Assembly/MIPS%20Assembly%20Language%20Programming%202003.pdf, in Appendix A, the only integer instructions that take more than one cycle are the divide and multiply instructions.
For ARMv7, the processor can issue an integer load instruction each cycle, with a latency of 3 cycles for the result to be written. Integer store instructions can be issued each cycle, with the result written to memory in 2 cycles. In effect, with the pipeline operating, a load or store can be done each cycle. See https://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/

As I said, have to read the data sheet. Many MPU are still 8088 based. My last design was USB controller with 8088 core. You cannot say in one sentence on all of them. The Datasheet. Everyone is their own.

Mark44 · Oct 9, 2020

yungman said:

As I said, have to read the data sheet. Many MPU are still 8088 based. My last design was USB controller with 8088 core. You cannot say in one sentence on all of them. The Datasheet.

8088 is also very old technology, dating back to the early 80s.

yungman · Oct 9, 2020

Mark44 said:

8088 is also very old technology, dating back to the early 80s.

They likely still around today. Like I said, until those are all gone, one cannot use a blanket statement.

Besides, as a long time design engineer, it is not practical to run the instruction in one clock. You need different phases. CPU might be new, the timing diagram is valid, you need address setup time, data setup time, more importantly, the address hold time and date hold time. This is ageless, This is most important part of digital design and somehow they have to generate all that. I don't care old or new, it still behave like that. It will make things a lot more complicate to generate all these if you keep one clock one instruction. They likely have to use PLL to generate a faster clock.

Mark44 · Oct 9, 2020

yungman said:

Besides, as a long time design engineer, it is not practical to run the instruction in one clock.

And yet they do, as the examples and references I gave testify.

yungman said:

You need different phases. CPU might be new, the timing diagram is valid, you need address setup time, data setup time, more importantly, the address hold time and date hold time. This is ageless, This is most important part of digital design and somehow they have to generate all that. I don't care old or new, it still behave like that.

Sure, some of the 40 year old processors that you worked on many years ago are still around, but I don't think you have kept up with the advances in technology such as pipelines and instruction parallelism, in which a register can hold multiple numbers that are processed at the same time -- the processors you mentioned are all CISC (complex instruction set computing) processors.

In addition, the OP mentioned high-performance computing (HPC). Nobody is doing HPC using 8088, 8085, or Z80 processors.

Vanadium 50 · Oct 9, 2020

First, do not listen to anything here written about 40+ year old CPUs. This will be relevant only if people start building HPCs out of these CPUs. (Which they have no plans to) Modern CPUs most certainly can issue more than one instruction per clock.

Second, you should probably talk to your instructor and find out where they are going. There's more to understanding the nuts and bolts of CPUs than can quickly be covered. Is this about understanding memory stalls and caches? How GPUs differ from CPUs? How to write high level code to compile efficiently on a HPC node? You really want to focus in on what you need.

Vanadium 50 · Oct 9, 2020

Let me guess where they are going. Here are two pieces of bad code, that are bad for different reasons. I am assuming execution on a GPU-like architecture with something like OpenMP or OpenACC or even CUDA, but it could be anything.

Bad code 1:

Code:

pi=0
for i=0 to a zillion
  if i is even
    pi = pi + 4/(2i+1)
  otherwise if i is odd
    pi = pi - 4/(2i+1)
end for

Bad code 2:

Code:

pi=0
for i=0 to a zillion, with step 2
    pi = pi + 4/(2i+1)
end for
for i=1 to a zillion, with step 2
    pi = pi - 4/(2i+1)
end for

You should understand why these are inadvisable, and more importantly, why they are differently inadvisable. If you don't, you probably need to deepen your background before taking this class.

SchroedingersLion · Oct 12, 2020

Vanadium 50 said:

First, do not listen to anything here written about 40+ year old CPUs. This will be relevant only if people start building HPCs out of these CPUs. (Which they have no plans to) Modern CPUs most certainly can issue more than one instruction per clock.

Second, you should probably talk to your instructor and find out where they are going. There's more to understanding the nuts and bolts of CPUs than can quickly be covered. Is this about understanding memory stalls and caches? How GPUs differ from CPUs? How to write high level code to compile efficiently on a HPC node? You really want to focus in on what you need.

Exactly those are the questions I would like to get answers to. I want to understand the hardware "well enough" to write efficient HPC code. Does this make sense?

Vanadium 50 said:
Let me guess where they are going. Here are two pieces of bad code, that are bad for different reasons. I am assuming execution on a GPU-like architecture with something like OpenMP or OpenACC or even CUDA, but it could be anything.

Bad code 1:
Code:
pi=0
for i=0 to a zillion
  if i is even
    pi = pi + 4/(2i+1)
  otherwise if i is odd
    pi = pi - 4/(2i+1)
end for
Bad code 2:
Code:
pi=0
for i=0 to a zillion, with step 2
    pi = pi + 4/(2i+1)
end for
for i=1 to a zillion, with step 2
    pi = pi - 4/(2i+1)
end for
You should understand why these are inadvisable, and more importantly, why they are differently inadvisable. If you don't, you probably need to deepen your background before taking this class.

The first one is bad because it calls a branching routine each iteration. What's more, the branching condition reads i%2==0 which takes a few operations to evaluate, I guess.
The second one is bad because it executes two loop structures, i.e. creates a loop variable and increases it every iteration. The effort for the loop structures is twice as much than using just one loop.

Best would probably be:

Code:

pi=0
for i=0 to a zillion, with step 2
   pi = pi +4/(2i+1) - 4/(2i+3)
end for

Since zillion is even, the term 4/(2i+3) for the last i needs to be added again at the end.
I can't explain it better on hardware level.

Rive · Oct 12, 2020

SchroedingersLion said:

I was left wondering about the differences between clock cycles, machine cycles and the difference between FLOP count and clock cycle.

Seems like you need a general 'popular science' like something about pipelined CPUs. The actual suggestions so far feels a bit more specific and are quite an overkill.

I think just a few google search will be able to answer most of your questions and give you a not completely correct but usable understanding. As you wrote you don't need much more than that.

Vanadium 50 · Oct 12, 2020

I think Rive's link is a good one, but there are a half-dozen or more ways to write inefficient code (I demonstrated two, neither of which is all that related to pipelining.) Happy code is all the same, unhappy codes are each unhappy in their own ways. I would still have the conversation with the professor I described.

Code snippet 1 is bad because, as you say, there is a branch in the loop. GPU's are not fast - they are wide. They get their speed by executing a large number - like 32 - operations at once. Branching interferes with this. To be fair, the branch in my snippet is probably simple enough that the compiler can fix this for you.

Code snippet 2 is bad not because it is inefficient, but because it is inaccurate. (Also, for the same number of operations, it doesn't matter very much if it is distributed across two loops or only one) You are taking the difference of two large numbers, so you lose precision. Related to that, the first loop is in the wrong direction: you should start at a zillion and end at zero. That way you are adding small things to other small things and not small things to big things.

Jarvis323 · Oct 13, 2020

Vanadium 50 said:

I think Rive's link is a good one, but there are a half-dozen or more ways to write inefficient code (I demonstrated two, neither of which is all that related to pipelining.) Happy code is all the same, unhappy codes are each unhappy in their own ways. I would still have the conversation with the professor I described.

Code snippet 1 is bad because, as you say, there is a branch in the loop. GPU's are not fast - they are wide. They get their speed by executing a large number - like 32 - operations at once. Branching interferes with this. To be fair, the branch in my snippet is probably simple enough that the compiler can fix this for you.

Code snippet 2 is bad not because it is inefficient, but because it is inaccurate. (Also, for the same number of operations, it doesn't matter very much if it is distributed across two loops or only one) You are taking the difference of two large numbers, so you lose precision. Related to that, the first loop is in the wrong direction: you should start at a zillion and end at zero. That way you are adding small things to other small things and not small things to big things.

One of the most important things to focus on is whether the code is written so that it can be (and checking that it is) auto-vectorized. The pattern in this example may cause issues with SIMD (same instruction multiple data), because data (or operations on it) is interleaved, but the compiler would like to issue SIMD instructions (one instruction) on blocks of it at a time. I wouldn't be surprised if this is a solved problem though.

https://en.wikipedia.org/wiki/Automatic_vectorization
https://dl.acm.org/doi/10.1145/1133981.1133997
https://en.wikipedia.org/wiki/SIMD
https://www.intel.com/content/dam/w...4-ia-32-architectures-optimization-manual.pdf
https://en.wikipedia.org/wiki/Memory_access_pattern

SchroedingersLion · Oct 13, 2020

Rive said:

Seems like you need a general 'popular science' like something about pipelined CPUs. The actual suggestions so far feels a bit more specific and are quite an overkill.

I think just a few google search will be able to answer most of your questions and give you a not completely correct but usable understanding. As you wrote you don't need much more than that.

Thanks for the link! It looks promising at least to get to know the processor.

Vanadium 50 said:

I think Rive's link is a good one, but there are a half-dozen or more ways to write inefficient code (I demonstrated two, neither of which is all that related to pipelining.) Happy code is all the same, unhappy codes are each unhappy in their own ways. I would still have the conversation with the professor I described.

Code snippet 1 is bad because, as you say, there is a branch in the loop. GPU's are not fast - they are wide. They get their speed by executing a large number - like 32 - operations at once. Branching interferes with this. To be fair, the branch in my snippet is probably simple enough that the compiler can fix this for you.

Code snippet 2 is bad not because it is inefficient, but because it is inaccurate. (Also, for the same number of operations, it doesn't matter very much if it is distributed across two loops or only one) You are taking the difference of two large numbers, so you lose precision. Related to that, the first loop is in the wrong direction: you should start at a zillion and end at zero. That way you are adding small things to other small things and not small things to big things.

Thanks for the elaborations. The professor will prbly not be able to do much for me, but I can ask him for study material suitable for me. I am just auditing the course, so there will be no drawbacks if I don't understand it all of if I drop out again.

SchroedingersLion · Oct 13, 2020

What do you think of sth like this?
https://www.pearson.com/us/higher-e...puter-Organization-6th-Edition/PGM200985.html

DrClaude · Oct 13, 2020

SchroedingersLion said:

What do you think of sth like this?
https://www.pearson.com/us/higher-e...puter-Organization-6th-Edition/PGM200985.html

That book was recommended in this post:
https://www.physicsforums.com/threads/how-exactly-do-computers-execute-code.955421/post-6065704

Mark44 · Oct 13, 2020

SchroedingersLion said:

What do you think of sth like this?
https://www.pearson.com/us/higher-e...puter-Organization-6th-Edition/PGM200985.html

Here are a couple more recommendations. Both books feature essentially the same content except for highlighting different RISC processors. Both go into considerable detail on the register set and assembly instructions, and how CPU pipelines work and are implemented, parallelism, and other topics. The MIPS edition is somewhat dated, as it was published 7 years ago.
Computer Organization and Design, MIPS edition -- publ. 2013
https://www.amazon.com/dp/0124077269/?tag=pfamazon01-20
The book above is the 5th ed. I see that there is a 6th edition that is not yet released.

Computer Organization and Design, ARM edition -- publ. 2016
https://www.amazon.com/dp/0128017333/?tag=pfamazon01-20

Learning materials for understanding CPUs

1. What are the basic components of a CPU?

2. How do CPUs execute instructions?

3. What is the difference between a single-core and multi-core CPU?

4. What is clock speed and how does it affect CPU performance?

5. How does cache memory improve CPU performance?

Similar threads

Hot Threads

Recent Insights