Why not hardwire software onto CPU chips for improved efficiency?

  • Thread starter Thread starter nameta9
  • Start date Start date
  • Tags Tags
    Software
Click For Summary
The discussion centers on the feasibility of hardwiring software, such as operating systems and applications, directly onto CPU chips to enhance efficiency. Participants express skepticism, citing the high costs and technical challenges associated with hardcoding software, especially given the rapid evolution and frequent updates of modern applications. The conversation highlights that while some firmware is already embedded in chips, most user software is too complex and variable to be effectively hardwired. Concerns about security vulnerabilities and the implications of permanently embedding potentially flawed software are also raised. Ultimately, the consensus is that hardcoding software into CPUs is impractical due to the dynamic nature of software development and the significant costs involved.
  • #31
Hurkyl said:
(3) You're not even guaranteed to run faster. :-p CPUs are very well optimized devices -- I would not expect your result to be any better than simply compiling the program to machine language to run on the CPU. If you're using a FPGA for reconfigurability, instead of an ASIC, the speed discrepancy will be even greater!

Actually, if you translate a CPU bound program into VHDL, you probably will get a speed increase. But translating a program into VHDL can be a very non-trivial task; and it's a waste of time if your program is I/O bound anyway.
 
Computer science news on Phys.org
  • #32
The goal is not speed but simplifying software. A CPU that can only be programmed directly in a BASIC variant simplifies everything, there are no longer compilers, and it is easy to debug. I would add all those funky features of PERL like associative arrays, regular expressions etc.
 
  • #33
oldtobor said:
The goal is not speed but simplifying software. A CPU that can only be programmed directly in a BASIC variant simplifies everything, there are no longer compilers, and it is easy to debug. I would add all those funky features of PERL like associative arrays, regular expressions etc.

No, compilers are still there. You just have to compile into BASIC instead of assembly language.

A BASIC chip wouldn't make it any easier to write programs using BASIC, and it definitely wouldn't make it easier to write programs in other languages. The only possible advantage of using a physical machine instead of a virtual one is speed.
 
  • #34
No way Jose'. The compiler exists because it has to convert high level language to op codes. In our CPU there are no longer opcodes but direct high level instructions. The logic circuits take care of understanding them and activating registers and counters etc. It is a true IDEAS machine that bypasses all we have always taken for granted in CPU design. With millions of transistors available I think it is feasable. Then we only have ONE FUNKY HIGH LEVEL LANGUAGE that takes care of all, all software is built up starting from a higher level.

You have a register group that takes care of the FOR instruction, another for the NEXT another for the GOTO etc. You just write the program, the chip reads it from RAM and immediately executes it. No more debugging nightmares or incompatible software. Of course industry and academia may not really want to simplify software for "cultural - economical" reasons...
 
  • #35
Bjørn Bæverfjord said:
All the functions of BASIC is can be translated to VHDL and get a massive increase of speed. If each function gets a huge speed increase, then the BASIC program will inherit the exact same speed increase without any change of the program. It is assembly that is hard to make faster, not higher level languages.
Microarchitecture, like all engineering pursuits, is about trade-offs. Sure, you can build a small algorithm like an LFSR into an FPGA and run it at the maximum toggle-rate of the FPGA, and it will likely be faster than the equivalent algorithm running on a general-purpose computer which requires many instructions. However, as the complexity of your algorithm goes up (say, all the way to an algorithm that will interpret Perl), the advantages disappear. At some level of complexity, it will no longer be able to compete with the common opcoded CPU architecture.

This is the reason modern microarchitecture is learning more and more toward simpler hardware. First, there were CISC (complex instruction-set computing) chips, programmed mostly by hand. Today, RISC (reduced instruction-set computing) chips have center stage; they have simpler control paths and more function units per unit die area. Next, VLIW (very long instruction word) CPUs will take off. VLIW eliminates most of the control path, relying on sophisticated compilers to generate very long continuous runs of instructions that directly control the chip's function units. Eventually TTA (tag-triggered architecture) might take over, which eliminates the control path completely.

What this means, of course, is that complexity is being moved out of the chip, and into the compiler. The advantages of this are numerous: the biggest stumbling block for today's microprocessors is die size; it takes a long time to send signals back and forth across a very large chip. In the early days of microprocessors, the control path used to dominate the chip's area, but why have a control path when you don't actually need one? Why not use that chip area for more function units to actually get things done? It eliminates much of the cross-chip communication that limits clock speeds, and uses the die area more effectively.

Furthermore, putting the complexity in the compiler rather than in the chip means that the arduous task of scheduling instructions and doing branch predictions happens at compile-time rather than at run-time. What that means is simple: it will take longer to compile your program, but much less time to execute it. Since programs are compiled once and run many times, this is certainly the way to go.
On the same silicon process and the same die size there will be a large performance increase. Even by just making a new CPU that has special instructions that are specific for the needs of the language will give a nice speed increase.
This is true, but the answer is not to make a billion specialized instructions for every possible need; the trade-off is that your bloated chip now runs at 100 kHz.

- Warren
 
Last edited:
  • #36
oldtobor said:
In our CPU there are no longer opcodes but direct high level instructions.
And the direct, high level instructions are called 'opcodes.' Apparently, you just don't like the word opcode; but any atomic operation on a processor is called an opcode.

Your hardwired BASIC interpreter will still run a fetch-decode-execute cycle; it just happens that you've built in special opcodes that facilitate BASIC. It doesn't even mean your processor will run any faster; it just means it will be easier to program, and consequently have a larger control path. As I've already explained, this is an engineering trade-off: either it's easy to program by hand, or it runs fast. You really cannot have both! You cannot increase the complexity of the control path and not take a hit in speed.
The logic circuits take care of understanding them and activating registers and counters etc. It is a true IDEAS machine that bypasses all we have always taken for granted in CPU design. With millions of transistors available I think it is feasable. Then we only have ONE FUNKY HIGH LEVEL LANGUAGE that takes care of all, all software is built up starting from a higher level.
The reason processors are getting 'dumber' (i.e. moving from CISC to RISC to VLIW to TTA) is not because programmers enjoy programming dumb chips; the reason is that dumb chips are fast. The bottom line is simple: there are 10,000 users for every programmer. Making the programmer's job a little more difficult is of no consequence; what matters is that the user's machine runs faster. You seem to missing this.
No more debugging nightmares or incompatible software. Of course industry and academia may not really want to simplify software for "cultural - economical" reasons...
I fail to see how a hardwired BASIC interpreter would eliminate debugging. Are you suggesting I could not write an algorithm that wouldn't work on such a chip? :smile:

And how would it eliminate incompatible software? Most software incompatibilities lie in data structures. Are you suggesting data structures will no longer exist in your world? Or that the only data structures anyone will ever be able to use are arrays? :smile:

- Warren
 
  • #37
chroot said:
The reason processors are getting 'dumber' (i.e. moving from CISC to RISC to VLIW to TTA) is not because programmers enjoy programming dumb chips; the reason is that dumb chips are fast. The bottom line is simple: there are 10,000 users for every programmer. Making the programmer's job a little more difficult is of no consequence; what matters is that the user's machine runs faster. You seem to missing this.

In fact, I don't even agree that chips getting "dumber" makes things harder for programmers. It just makes things (slightly) harder for the compilers/interpreters.

I suppose it might make things harder for compiler/interpreter writers as well; but it's a lot easier to build a smarter compiler than it is to build a smarter chip.
 
  • #38
master_coda said:
I suppose it might make things harder for compiler/interpreter writers as well; but it's a lot easier to build a smarter compiler than it is to build a smarter chip.
Exactly. I just wanted to illucidate the trade-off for those reading the thread. The trade-off, of course, is a no-brainer.

- Warren
 
  • #39
However, as the complexity of your algorithm goes up (say, all the way to an algorithm that will interpret Perl), the advantages disappear.
So then we can just divide it into optimal steps that each are 100 times faster than the original. Since each step is 100 times faster then the total result will be 100 times faster. The point is to make it optimal, not to make it as slow as possible to support your view.

For example the original Microsoft BASIC has a floatingpoint format that is not compatible with any modern CPU. A Pentium 4 would use 50 instructions to do something that can be done in a single clockcycle. The clock frequency would be the same because the job is not more complicated than what the P4 does, it is just different. This speedup you will find everywhere a program does something that is not directly supported by the CPU.
 
  • #40
Bjørn Bæverfjord said:
So then we can just divide it into optimal steps that each are 100 times faster than the original. Since each step is 100 times faster then the total result will be 100 times faster. The point is to make it optimal, not to make it as slow as possible to support your view.

For example the original Microsoft BASIC has a floatingpoint format that is not compatible with any modern CPU. A Pentium 4 would use 50 instructions to do something that can be done in a single clockcycle. The clock frequency would be the same because the job is not more complicated than what the P4 does, it is just different. This speedup you will find everywhere a program does something that is not directly supported by the CPU.

So how will you increase the speed of the split function in Perl by 100 times? That's a relatively simple function in Perl.
 
  • #41
I don't know Perl and I only have some basics of hardware design but looking at the split function that doesn't look complicated to do. You could read the string into a bunch of parallel logic circuits, each circuit primed with the delimeter character. Each circuit would get one character of the string and the address of that character. It would look at the input character and see if it is the same as the delimeter character; if it is, it write a reference to the subsequent character into the memory for the function's output. Also, a reference to the first character in the string would go directly to memory. In case of multiple delimeter characters in a row, it would only take a few levels of combinational circuits to only write the output of the last delimeter processor in the row of those trying to write. The speedup would probably not be 100x but it would be faster.
 
Last edited:
  • #42
BicycleTree said:
I don't know Perl and I only have some basics of hardware design but looking at the split function that doesn't look complicated to do. You could read the string into a bunch of parallel logic circuits, each circuit primed with the delimeter character. Each circuit would get one character of the string and the address of that character. It would look at the input character and see if it is the same as the delimeter character; if it is, it write a reference to the subsequent character into the memory for the function's output. Also, a reference to the first character in the string would go directly to memory. In case of multiple delimeter characters in a row, it would only take a few levels of combinational circuits to only write the output of the last delimeter processor in the row of those trying to write. The speedup would probably not be 100x but it would be faster.

This description is no good. All of the important parts of the algorithm are glossed over by "write a reference ... into the memory for the function's output" and "a few levels of combinatorial circuits...", these are non-trivial parts of the algorithm. Without more details, I can't decide if your algorithm even works; I certainly can't determine if this algorithm is faster, and how much real estate this circuit is going to eat up.
 
  • #43
Well, I said I do not know more than basic hardware design. One course over 2 semesters is all I have. So I don't know exactly how writing the references would operate; it should be in parallel somehow or it would be a bottleneck, but I don't know exactly how that would work. Also the reading of the string would have to be in parallel for each character or there would probably be no advantage. You'd probably have to design a special kind of memory for the reading and writing.

The combinational circuits would be based on the write assert bits from the character processors. I guess the tentative write assert bit for each character processor (processor X) would be XOR'd with the tentative write assert bit from processor X+1, and the result would be the write assert for processor X. So that would only be 1 level of combinational circuits.

Each processor would basically be an adder to test for equality, which is very fast.

The part of this process I do not fully understand is the memory; the other parts definitely could be speeded up. But I don't know exactly how modern memory is designed and whether these things I am saying are possible. The only kind of memory that I understand a very simple schematic design which would not work for this.
 
  • #44
Actually, since the processors are going to have a dedicated function, they could just be a level of XNOR gates, one for each bit of the character, followed by two levels of AND gates, with the output of the second level being the write assert bit.
 
Last edited:
  • #45
Well if anything, I see that the old robot's proposal is interesting at the least. I think I would approach the problem in a much simpler way by simply implementing a high level instruction set for the CPU. So as you have a typical subtract instruction that is composed of an opcode and 2 operands, you could have a FOR NEXT instruction that is made up of an opcode and 2 operands saying the start (I=0) and end (TO 50) (and maybe a 3rd saying the step). Or you can have a string accumulator like PERL $_ and regular expression instructions, and maybe SQR and RND assembler level instructions etc. You could end up having an almost 1 to 1 correspondence between a high level language and it's underlying assembler translation. Anyways in this research I would start out simple, implementing an equivalent 4K ROM BASIC like instruction set for the CPU and then extend it. Start with VHDL and ASICs. You might want to also look up to see if a SOFTWARE TO HARDWARE CONVERTER exists or could be designed. Take any small program and the converter could convert the entire program into a bunch of combinational circuits.
 
Last edited:
  • #46
nameta9 said:
Well if anything, I see that the old robot's proposal is interesting at the least. I think I would approach the problem in a much simpler way by simply implementing a high level instruction set for the CPU. So as you have a typical subtract instruction that is composed of an opcode and 2 operands, you could have a FOR NEXT instruction that is made up of an opcode and 2 operands saying the start (I=0) and end (TO 50) (and maybe a 3rd saying the step). Or you can have a string accumulator like PERL $_ and regular expression instructions, and maybe SQR and RND assembler level instructions etc. You could end up having an almost 1 to 1 correspondence between a high level language and it's underlying assembler translation. Anyways in this research I would start out simple, implementing an equivalent 4K ROM BASIC like instruction set for the CPU and then extend it. Start with VHDL and ASICs. You might want to also look up to see if a SOFTWARE TO HARDWARE CONVERTER exists or could be designed. Take any small program and the converter could convert the entire program into a bunch of combinational circuits.

We already have instructions for computing values that can easily be implemented in hardware, square roots, trig functions, etc. Even a for next instruction isn't really useful; existing instructions like loop provide the same functionality.

However, other instructions just can't be practically implemented. For example, almost any complex string operations would be a waste of time; the algorithm you're implementing will almost certainly spend most of its time loading the string from memory. You can try and keep as much of the string on the cpu (or at least as close as possible) as is possible, which is the whole point of things like cache, but that approach doesn't scale; you can't just stick an arbitrarily large chunk of memory onto a cpu.


Still, if you want to implement a chip with a complete high-level instruction set, go right ahead; if you're into hardware design, you might enjoy it. It's just not a particularly useful idea.
 
  • #47
Ah, there was an error in my earlier suggestion. The tentative write enable bit from processor X should not be XOR'd with the tentative write enable bit from processor X+1; the correct operation is (tentative WE for X) AND NOT(tentative WE for X+1).

The question is whether a single memory can be written to and read from in parallel, practically speaking. You can't just say that it definitely would be a bottleneck. It might be possible and practical, and then a string operation could be performed requiring no more time than an add operation.

Do graphics processors have memory that is accessed 1 word at a time or are they set up in parallel? That would be the most similar application I can think of.
 
  • #49
BicycleTree said:
Ah, there was an error in my earlier suggestion. The tentative write enable bit from processor X should not be XOR'd with the tentative write enable bit from processor X+1; the correct operation is (tentative WE for X) AND NOT(tentative WE for X+1).

The question is whether a single memory can be written to and read from in parallel, practically speaking. You can't just say that it definitely would be a bottleneck. It might be possible and practical, and then a string operation could be performed requiring no more time than an add operation.

Do graphics processors have memory that is accessed 1 word at a time or are they set up in parallel? That would be the most similar application I can think of.

Memory access definitely would be a bottleneck. You can write blocks of memory in parallel, but there's always an upper bound on the amount you can write at once. And there's also always an upper bound on the amount of a string your chip can process at one time.

Add operations are fast because the numbers you're adding have a fixed size. Working with strings is more like working with arbitrary integer; an operation that you can't perform in constant time.
 
  • #50
How are you equating parallel memory to your parallel process? You could have done a search for parallel processing and came up with more applicable information. Heck the Amiga I hade in 1989 utilized limited parallel processing by offloading most functions to support chips all coordinated by the main 68K processor. That would be a more analogous situation to your proposition than parallel memory IMHO.

Now, I have to ask the question again: "Why bother?" Why try to hard code even something like an programming language? Why spend time and money trying to develop a transistor array on some die to do what computer manufacturers stopped doing in the late 80's. I don't see a way(this is me mind you) that one could implement a higher level language that operates faster than directly accessing the hundreds(or less depending on the processor) of instructions already placed on current chips. Simply saying "Well, remove the opcodes and replace with a direct HLL" minimizes the fact that current computer technology relies on only two states. You can build computers with more states, You can even build analog computers if you want---but those cost more with minimal improvements. Analog computers used to be common place in engine control units; however, a 20Mhz processor, and a SAR ADC is a more than sufficient replacement and costs less in the end.

I see a lot of minimizing of difficulties here---sure, Basic can run from a ROM. Basic is easy. Basic didn't include much of the functionality of modern HLL. Basic was slow (assembly or C programs STILL ran faster on Apple II's and Commodore C64 than basic programs). The use of this as an example is a canard at best because the Basic went by the way side due to its cost and relative performance. Software basic was and is faster.

So, Why bother? Why bother locking yourself into a project that is not easily user updatable---ever flashed the BIOS on a PC? Most users will never, ever, update their bios and that says a lot about this idea. Forcing such an inconvenience to fill a hole in word or to upgrade ones on die interpretor is less than plausible IMHO. Why bother? You won't see a speed improvement (I reiterate the RISC mantra, that being more is most definitely not better) from adding an HLL to your die. You might see a speed improvement from placing a program on a Flash, but the speed improvement will only show itself as a loadtime improvement, not a runtime improvement.

Those are my thoughts.
 
  • #51
Well, you would break the string into blocks of characters, maybe 30 or so characters each. 30 or fewer characters and it would be time 1, 31-60 characters and it would be time 2, and so on. Not constant time except for small strings but much faster than with standard instructions.

The advantage of specialized instructions would be of about the same type as the advantage of many parallel processors.
 
  • #52
Bjørn Bæverfjord said:
So then we can just divide it into optimal steps that each are 100 times faster than the original. Since each step is 100 times faster then the total result will be 100 times faster. The point is to make it optimal, not to make it as slow as possible to support your view.
You apparently did not read my previous posts. You cannot simply stuff a chip with hundreds of instructions dedicated to hundreds of different specialized functions; the resulting chip would be enormous, expensive, and slow.

You do not seem to grasp the engineering trade-off in play here: a more complicated control path cannot run faster than a simpler one.

- Warren
 
  • #53
You apparently did not read my previous posts. You cannot simply stuff a chip with hundreds of instructions dedicated to hundreds of different specialized functions; the resulting chip would be enormous, expensive, and slow.
I read your post and what you wrote was completely wrong. The Pentium 4 has many hundred instructions many of which are useless. The simple act of removing the things that are never used and changing the remaining instructions so they do the things that a specific program needs most will result in a large speed increase. There will be a simpler control path and each instruction will do more useful work.

As I said before the point was to make an optimal solution, not to be incompetent just to win an argument.
 
  • #54
Bjørn Bæverfjord said:
I read your post and what you wrote was completely wrong. The Pentium 4 has many hundred instructions many of which are useless.
:smile: This is the single stupidest thing I've ever heard. The P4 only has ~250 instructions in the first place, so you're saying 80% of the are "useless?" Do you have any idea how carefully instruction sets are selected, and how much communication there is between compiler designers (i.e. Microsoft) and the teams which design the processors?
The simple act of removing the things that are never used and changing the remaining instructions so they do the things that a specific program needs most will result in a large speed increase. There will be a simpler control path and each instruction will do more useful work.
WOW! What a concept! Let's call it, I don't know... RISC! What makes your post particularly entertaining, Bjørn, is that you began by arguing that CPUs should have more instructions for directly supporting languages like BASIC, but have now completely reversed your position and now support RISC architectures. Bravo! Great show!

- Warren
 
Last edited:
  • #55
Bjørn Bæverfjord said:
I read your post and what you wrote was completely wrong. The Pentium 4 has many hundred instructions many of which are useless. The simple act of removing the things that are never used and changing the remaining instructions so they do the things that a specific program needs most will result in a large speed increase. There will be a simpler control path and each instruction will do more useful work.

As I said before the point was to make an optimal solution, not to be incompetent just to win an argument.

You're assuming that the new instructions you add are not going to increase the complexity more than the instructions you just removed. That's a very poor assumption.
 

Similar threads

  • · Replies 5 ·
Replies
5
Views
815
  • · Replies 12 ·
Replies
12
Views
6K
  • · Replies 12 ·
Replies
12
Views
2K
  • · Replies 17 ·
Replies
17
Views
16K
  • · Replies 13 ·
Replies
13
Views
5K
  • · Replies 3 ·
Replies
3
Views
5K
Replies
1
Views
4K
  • · Replies 7 ·
Replies
7
Views
2K
Replies
5
Views
5K