Why not hardwire software onto CPU chips for improved efficiency?

chroot · May 5, 2005

oldtobor said:

In our CPU there are no longer opcodes but direct high level instructions.

And the direct, high level instructions are called 'opcodes.' Apparently, you just don't like the word opcode; but any atomic operation on a processor is called an opcode.

Your hardwired BASIC interpreter will still run a fetch-decode-execute cycle; it just happens that you've built in special opcodes that facilitate BASIC. It doesn't even mean your processor will run any faster; it just means it will be easier to program, and consequently have a larger control path. As I've already explained, this is an engineering trade-off: either it's easy to program by hand, or it runs fast. You really cannot have both! You cannot increase the complexity of the control path and not take a hit in speed.

The logic circuits take care of understanding them and activating registers and counters etc. It is a true IDEAS machine that bypasses all we have always taken for granted in CPU design. With millions of transistors available I think it is feasable. Then we only have ONE FUNKY HIGH LEVEL LANGUAGE that takes care of all, all software is built up starting from a higher level.

The reason processors are getting 'dumber' (i.e. moving from CISC to RISC to VLIW to TTA) is not because programmers enjoy programming dumb chips; the reason is that dumb chips are fast. The bottom line is simple: there are 10,000 users for every programmer. Making the programmer's job a little more difficult is of no consequence; what matters is that the user's machine runs faster. You seem to missing this.

No more debugging nightmares or incompatible software. Of course industry and academia may not really want to simplify software for "cultural - economical" reasons...

I fail to see how a hardwired BASIC interpreter would eliminate debugging. Are you suggesting I could not write an algorithm that wouldn't work on such a chip? :rofl:

And how would it eliminate incompatible software? Most software incompatibilities lie in data structures. Are you suggesting data structures will no longer exist in your world? Or that the only data structures anyone will ever be able to use are arrays? :rofl:

- Warren

master_coda · May 5, 2005

chroot said:

The reason processors are getting 'dumber' (i.e. moving from CISC to RISC to VLIW to TTA) is not because programmers enjoy programming dumb chips; the reason is that dumb chips are fast. The bottom line is simple: there are 10,000 users for every programmer. Making the programmer's job a little more difficult is of no consequence; what matters is that the user's machine runs faster. You seem to missing this.

In fact, I don't even agree that chips getting "dumber" makes things harder for programmers. It just makes things (slightly) harder for the compilers/interpreters.

I suppose it might make things harder for compiler/interpreter writers as well; but it's a lot easier to build a smarter compiler than it is to build a smarter chip.

chroot · May 5, 2005

master_coda said:

I suppose it might make things harder for compiler/interpreter writers as well; but it's a lot easier to build a smarter compiler than it is to build a smarter chip.

Exactly. I just wanted to illucidate the trade-off for those reading the thread. The trade-off, of course, is a no-brainer.

- Warren

Bjørn Bæverfjord · May 5, 2005

However, as the complexity of your algorithm goes up (say, all the way to an algorithm that will interpret Perl), the advantages disappear.

So then we can just divide it into optimal steps that each are 100 times faster than the original. Since each step is 100 times faster then the total result will be 100 times faster. The point is to make it optimal, not to make it as slow as possible to support your view.

For example the original Microsoft BASIC has a floatingpoint format that is not compatible with any modern CPU. A Pentium 4 would use 50 instructions to do something that can be done in a single clockcycle. The clock frequency would be the same because the job is not more complicated than what the P4 does, it is just different. This speedup you will find everywhere a program does something that is not directly supported by the CPU.

master_coda · May 5, 2005

Bjørn Bæverfjord said:

So then we can just divide it into optimal steps that each are 100 times faster than the original. Since each step is 100 times faster then the total result will be 100 times faster. The point is to make it optimal, not to make it as slow as possible to support your view.

For example the original Microsoft BASIC has a floatingpoint format that is not compatible with any modern CPU. A Pentium 4 would use 50 instructions to do something that can be done in a single clockcycle. The clock frequency would be the same because the job is not more complicated than what the P4 does, it is just different. This speedup you will find everywhere a program does something that is not directly supported by the CPU.

So how will you increase the speed of the split function in Perl by 100 times? That's a relatively simple function in Perl.

BicycleTree · May 5, 2005

I don't know Perl and I only have some basics of hardware design but looking at the split function that doesn't look complicated to do. You could read the string into a bunch of parallel logic circuits, each circuit primed with the delimeter character. Each circuit would get one character of the string and the address of that character. It would look at the input character and see if it is the same as the delimeter character; if it is, it write a reference to the subsequent character into the memory for the function's output. Also, a reference to the first character in the string would go directly to memory. In case of multiple delimeter characters in a row, it would only take a few levels of combinational circuits to only write the output of the last delimeter processor in the row of those trying to write. The speedup would probably not be 100x but it would be faster.

master_coda · May 5, 2005

BicycleTree said:

I don't know Perl and I only have some basics of hardware design but looking at the split function that doesn't look complicated to do. You could read the string into a bunch of parallel logic circuits, each circuit primed with the delimeter character. Each circuit would get one character of the string and the address of that character. It would look at the input character and see if it is the same as the delimeter character; if it is, it write a reference to the subsequent character into the memory for the function's output. Also, a reference to the first character in the string would go directly to memory. In case of multiple delimeter characters in a row, it would only take a few levels of combinational circuits to only write the output of the last delimeter processor in the row of those trying to write. The speedup would probably not be 100x but it would be faster.

This description is no good. All of the important parts of the algorithm are glossed over by "write a reference ... into the memory for the function's output" and "a few levels of combinatorial circuits...", these are non-trivial parts of the algorithm. Without more details, I can't decide if your algorithm even works; I certainly can't determine if this algorithm is faster, and how much real estate this circuit is going to eat up.

BicycleTree · May 6, 2005

Well, I said I do not know more than basic hardware design. One course over 2 semesters is all I have. So I don't know exactly how writing the references would operate; it should be in parallel somehow or it would be a bottleneck, but I don't know exactly how that would work. Also the reading of the string would have to be in parallel for each character or there would probably be no advantage. You'd probably have to design a special kind of memory for the reading and writing.

The combinational circuits would be based on the write assert bits from the character processors. I guess the tentative write assert bit for each character processor (processor X) would be XOR'd with the tentative write assert bit from processor X+1, and the result would be the write assert for processor X. So that would only be 1 level of combinational circuits.

Each processor would basically be an adder to test for equality, which is very fast.

The part of this process I do not fully understand is the memory; the other parts definitely could be speeded up. But I don't know exactly how modern memory is designed and whether these things I am saying are possible. The only kind of memory that I understand a very simple schematic design which would not work for this.

BicycleTree · May 6, 2005

Actually, since the processors are going to have a dedicated function, they could just be a level of XNOR gates, one for each bit of the character, followed by two levels of AND gates, with the output of the second level being the write assert bit.

nameta9 · May 6, 2005

Well if anything, I see that the old robot's proposal is interesting at the least. I think I would approach the problem in a much simpler way by simply implementing a high level instruction set for the CPU. So as you have a typical subtract instruction that is composed of an opcode and 2 operands, you could have a FOR NEXT instruction that is made up of an opcode and 2 operands saying the start (I=0) and end (TO 50) (and maybe a 3rd saying the step). Or you can have a string accumulator like PERL $_ and regular expression instructions, and maybe SQR and RND assembler level instructions etc. You could end up having an almost 1 to 1 correspondence between a high level language and it's underlying assembler translation. Anyways in this research I would start out simple, implementing an equivalent 4K ROM BASIC like instruction set for the CPU and then extend it. Start with VHDL and ASICs. You might want to also look up to see if a SOFTWARE TO HARDWARE CONVERTER exists or could be designed. Take any small program and the converter could convert the entire program into a bunch of combinational circuits.

master_coda · May 6, 2005

nameta9 said:

Well if anything, I see that the old robot's proposal is interesting at the least. I think I would approach the problem in a much simpler way by simply implementing a high level instruction set for the CPU. So as you have a typical subtract instruction that is composed of an opcode and 2 operands, you could have a FOR NEXT instruction that is made up of an opcode and 2 operands saying the start (I=0) and end (TO 50) (and maybe a 3rd saying the step). Or you can have a string accumulator like PERL $_ and regular expression instructions, and maybe SQR and RND assembler level instructions etc. You could end up having an almost 1 to 1 correspondence between a high level language and it's underlying assembler translation. Anyways in this research I would start out simple, implementing an equivalent 4K ROM BASIC like instruction set for the CPU and then extend it. Start with VHDL and ASICs. You might want to also look up to see if a SOFTWARE TO HARDWARE CONVERTER exists or could be designed. Take any small program and the converter could convert the entire program into a bunch of combinational circuits.

We already have instructions for computing values that can easily be implemented in hardware, square roots, trig functions, etc. Even a for next instruction isn't really useful; existing instructions like loop provide the same functionality.

However, other instructions just can't be practically implemented. For example, almost any complex string operations would be a waste of time; the algorithm you're implementing will almost certainly spend most of its time loading the string from memory. You can try and keep as much of the string on the cpu (or at least as close as possible) as is possible, which is the whole point of things like cache, but that approach doesn't scale; you can't just stick an arbitrarily large chunk of memory onto a cpu.

Still, if you want to implement a chip with a complete high-level instruction set, go right ahead; if you're into hardware design, you might enjoy it. It's just not a particularly useful idea.

BicycleTree · May 6, 2005

Ah, there was an error in my earlier suggestion. The tentative write enable bit from processor X should not be XOR'd with the tentative write enable bit from processor X+1; the correct operation is (tentative WE for X) AND NOT(tentative WE for X+1).

The question is whether a single memory can be written to and read from in parallel, practically speaking. You can't just say that it definitely would be a bottleneck. It might be possible and practical, and then a string operation could be performed requiring no more time than an add operation.

Do graphics processors have memory that is accessed 1 word at a time or are they set up in parallel? That would be the most similar application I can think of.

BicycleTree · May 6, 2005

At least, it appears to be possible.
http://www.google.com/search?hl=en&q=parallel+memory

master_coda · May 6, 2005

BicycleTree said:

Ah, there was an error in my earlier suggestion. The tentative write enable bit from processor X should not be XOR'd with the tentative write enable bit from processor X+1; the correct operation is (tentative WE for X) AND NOT(tentative WE for X+1).

The question is whether a single memory can be written to and read from in parallel, practically speaking. You can't just say that it definitely would be a bottleneck. It might be possible and practical, and then a string operation could be performed requiring no more time than an add operation.

Do graphics processors have memory that is accessed 1 word at a time or are they set up in parallel? That would be the most similar application I can think of.

Memory access definitely would be a bottleneck. You can write blocks of memory in parallel, but there's always an upper bound on the amount you can write at once. And there's also always an upper bound on the amount of a string your chip can process at one time.

Add operations are fast because the numbers you're adding have a fixed size. Working with strings is more like working with arbitrary integer; an operation that you can't perform in constant time.

faust9 · May 6, 2005

How are you equating parallel memory to your parallel process? You could have done a search for parallel processing and came up with more applicable information. Heck the Amiga I hade in 1989 utilized limited parallel processing by offloading most functions to support chips all coordinated by the main 68K processor. That would be a more analogous situation to your proposition than parallel memory IMHO.

Now, I have to ask the question again: "Why bother?" Why try to hard code even something like an programming language? Why spend time and money trying to develop a transistor array on some die to do what computer manufacturers stopped doing in the late 80's. I don't see a way(this is me mind you) that one could implement a higher level language that operates faster than directly accessing the hundreds(or less depending on the processor) of instructions already placed on current chips. Simply saying "Well, remove the opcodes and replace with a direct HLL" minimizes the fact that current computer technology relies on only two states. You can build computers with more states, You can even build analog computers if you want---but those cost more with minimal improvements. Analog computers used to be common place in engine control units; however, a 20Mhz processor, and a SAR ADC is a more than sufficient replacement and costs less in the end.

I see a lot of minimizing of difficulties here---sure, Basic can run from a ROM. Basic is easy. Basic didn't include much of the functionality of modern HLL. Basic was slow (assembly or C programs STILL ran faster on Apple II's and Commodore C64 than basic programs). The use of this as an example is a canard at best because the Basic went by the way side due to its cost and relative performance. Software basic was and is faster.

So, Why bother? Why bother locking yourself into a project that is not easily user updatable---ever flashed the BIOS on a PC? Most users will never, ever, update their bios and that says a lot about this idea. Forcing such an inconvenience to fill a hole in word or to upgrade ones on die interpretor is less than plausible IMHO. Why bother? You won't see a speed improvement (I reiterate the RISC mantra, that being more is most definitely not better) from adding an HLL to your die. You might see a speed improvement from placing a program on a Flash, but the speed improvement will only show itself as a loadtime improvement, not a runtime improvement.

Those are my thoughts.

BicycleTree · May 6, 2005

Well, you would break the string into blocks of characters, maybe 30 or so characters each. 30 or fewer characters and it would be time 1, 31-60 characters and it would be time 2, and so on. Not constant time except for small strings but much faster than with standard instructions.

The advantage of specialized instructions would be of about the same type as the advantage of many parallel processors.

chroot · May 6, 2005

Bjørn Bæverfjord said:

So then we can just divide it into optimal steps that each are 100 times faster than the original. Since each step is 100 times faster then the total result will be 100 times faster. The point is to make it optimal, not to make it as slow as possible to support your view.

You apparently did not read my previous posts. You cannot simply stuff a chip with hundreds of instructions dedicated to hundreds of different specialized functions; the resulting chip would be enormous, expensive, and slow.

You do not seem to grasp the engineering trade-off in play here: a more complicated control path cannot run faster than a simpler one.

- Warren

Bjørn Bæverfjord · May 6, 2005

You apparently did not read my previous posts. You cannot simply stuff a chip with hundreds of instructions dedicated to hundreds of different specialized functions; the resulting chip would be enormous, expensive, and slow.

I read your post and what you wrote was completely wrong. The Pentium 4 has many hundred instructions many of which are useless. The simple act of removing the things that are never used and changing the remaining instructions so they do the things that a specific program needs most will result in a large speed increase. There will be a simpler control path and each instruction will do more useful work.

As I said before the point was to make an optimal solution, not to be incompetent just to win an argument.

chroot · May 6, 2005

Bjørn Bæverfjord said:

I read your post and what you wrote was completely wrong. The Pentium 4 has many hundred instructions many of which are useless.

:rofl: This is the single stupidest thing I've ever heard. The P4 only has ~250 instructions in the first place, so you're saying 80% of the are "useless?" Do you have any idea how carefully instruction sets are selected, and how much communication there is between compiler designers (i.e. Microsoft) and the teams which design the processors?

The simple act of removing the things that are never used and changing the remaining instructions so they do the things that a specific program needs most will result in a large speed increase. There will be a simpler control path and each instruction will do more useful work.

WOW! What a concept! Let's call it, I don't know... RISC! What makes your post particularly entertaining, Bjørn, is that you began by arguing that CPUs should have more instructions for directly supporting languages like BASIC, but have now completely reversed your position and now support RISC architectures. Bravo! Great show!

- Warren

master_coda · May 6, 2005

Bjørn Bæverfjord said:

I read your post and what you wrote was completely wrong. The Pentium 4 has many hundred instructions many of which are useless. The simple act of removing the things that are never used and changing the remaining instructions so they do the things that a specific program needs most will result in a large speed increase. There will be a simpler control path and each instruction will do more useful work.

As I said before the point was to make an optimal solution, not to be incompetent just to win an argument.

You're assuming that the new instructions you add are not going to increase the complexity more than the instructions you just removed. That's a very poor assumption.

Why not hardwire software onto CPU chips for improved efficiency?

Similar threads

Hot Threads

Recent Insights