Circuit design question -- 100% transistor functionality needed?

Summary:

A question for folks designing circuits, is it the case that 100% of transistors must be functional for the circuit to operate, or are a few transistors allowed to fail in the design, but compensation circuits or redundancy is put in place to prevent circut funtionality to degrade? I'm wondering about state of the art SoCs, and whether all billion transistors need to work, or whether only a subset of the total transistor count is needs to be functional for the circuit to operate appropriately

Main Question or Discussion Point

A question for folks designing circuits, is it the case that 100% of transistors must be functional for the circuit to operate, or are a few transistors allowed to fail in the design, but compensation circuits or redundancy is put in place to prevent circut funtionality to degrade? I'm wondering about state of the art SoCs, and whether all billion transistors need to work, or whether only a subset of the total transistor count is needs to be functional for the circuit to operate appropriately
 

Answers and Replies

berkeman
Mentor
56,131
6,156
It is commonly done in memory ICs. Espescially in ICs like large flash memories where sectors can get worn out by frequent re-writing. Here is the result for a Google search on Redundancy in Memory ICs:

https://www.google.com/search?q=redundancy+in+memory+ICs&ie=utf-8&oe=utf-8&client=firefox-b-1
Redundancy is also used at the module level for critical systems like flight computers, where you can have 3 duplicate processing blocks that vote on decisions to be sure at least 2 of them agree...

https://en.wikipedia.org/wiki/Triple_modular_redundancy
 
1,447
780
FPGAs and various GPUs are also able to disable (phisically or by software) faulty parts of a chip.
 
For high-end components and space applications, this makes sense. What about mobile SoC where there are millions of parts mass produced? If a transistor was randomly shut off, would it make the SoC inoperable?
 
berkeman
Mentor
56,131
6,156
Or IoT for example. The designs are simplistic and cheap. Do they leverage redundancy? Or is it the case that if 1 transistor fails, the entire system is a dud. For SoCs, I was thinking of a random transistor in an A11 chip, for example. Basically, I'm wondering if there's a correlation between circuit functionality and % of random transistors functional in a chip.
 
berkeman
Mentor
56,131
6,156
Or IoT for example. The designs are simplistic and cheap. Do they leverage redundancy?
It depends, but redundancy will usually cost something, so in the lowest cost systems they may not have any redundancy beyond fault detection and alarm notification (requesting to be replaced).

There are some exceptions for IoT systems that want some redundancy with a modest increase in cost. Some of our wired communication networks have fault detection and the ability to switch to a parallel wired network path to try to avoid the faulty wire segment. And in distributed RF systems that use Mesh communication protocols, single points of failure are routed around as part of the Mesh communication matrix (which can change with time to be fault tolerant)...

https://new.abb.com/network-management/communication-networks/wireless-networks/rf-mesh/architecture
For SoCs, I was thinking of a random transistor in an A11 chip, for example.
I don't know for sure, but I'm guessing that the A10...A11 SoCs do not have much in the way of redundancy, unless it's in their Memory blocks. Again, fault detection (via checksums, etc.) would generally be used to identify when a block or device has a problem...

https://en.wikipedia.org/wiki/Apple_A11
 
462
201
I think its important to distinguish between fault detection and handling vs redundancy.

Only very specific situations require actual redundancy, ie that the system has another unit available to take over in case of failure. Most consumer products don't have any redundancy, they may have fault handling an recovery (reset!), but not redundancy.

Most circuits will not work correctly even if only one transistor is not working. If the circuit works without this transistor, then the transistor was superfluous (unnecessary BOM cost!), unless the intent was a redundant system.

I think its a stretch to call the reducing capacity of flash memory redundancy, if you have 4GB memory, some wears out and now you only have 3.95GB memory usable is that redundancy or faulty memory? If the chip was listed as 4GB, and had say, another 2GB "hidden" from the users view, that was made available as the other memory failed then its redundant.

Even when redundancy is required (eg automotive ASIL D systems), there is much debate about what this actually means, is a dual core micro "redundant"? If there is a proper epoxy poping failure in one core, its unlikely the other one is still running, does two die in one case get around this? for example the faulted core might have grounded out the core power, and unless the power supply lines are also redundant, then the second core is also not going to work. Then it very rapidly descends into a discussion about how expensive things are allowed to be.
 
In the IC electronics world, is there a metric that manufacturers put on their product that says if a certain % of the transistors fail, the circuit would still be operational? For example, if Apple tells TSMC that for their next iphone chip, they will deliver an SoC that is 100% function, but if after say 1 year of use and 2% of their transistors fail, the likelihood of the IC still working is 70%? Is that something that is spec'd?
 
7,801
4,462
In the IC electronics world, is there a metric that manufacturers put on their product that says if a certain % of the transistors fail, the circuit would still be operational? For example, if Apple tells TSMC that for their next iphone chip, they will deliver an SoC that is 100% function, but if after say 1 year of use and 2% of their transistors fail, the likelihood of the IC still working is 70%? Is that something that is spec'd?
Not all chips. Memory chips can tolerate some failed bits, and they may have some on-chip spares. But many other chips require 100% of the components to work correctly.

In the past, we used to tolerate computer screens with some failed pixels, but it has been many years since I've seen that.
 
Not all chips. Memory chips can tolerate some failed bits, and they may have some on-chip spares. But many other chips require 100% of the components to work correctly.

In the past, we used to tolerate computer screens with some failed pixels, but it has been many years since I've seen that.
If we were to focus on ICs for space (Rad hard designs), then there's certainly redundancy there. However, the spec that companies use is not how many chips there are, but what's the ionizing dose, and for how long their ICs can survive at that particular dose. I'm wondering if there's a similar metric that is used for non-space applications, not necessarily for commercial ICs (as you mentioned 100% of the components need to work), but in terms of ICs for critical applications
 
eq1
Gold Member
120
49
I'm wondering about state of the art SoCs, and whether all billion transistors need to work, or whether only a subset of the total transistor count is needs to be functional for the circuit to operate appropriately
Totally depends on the design but even within a design not all transistors are equivalent. Using the memory example cited by others, often one can loose a few transistors (wear is a thing too) and just drop some bits, but the memory controller block in a memory IC will often need all of its transistors to function properly. Or often SOC are tested and debugged via JTAG. If a single transistor was non functional in that block the whole part is most likely useless. The controllers are usually a smaller fraction of the overall transistors in the die so the probability they contain the failed device is small.

But ya, with billions of transistors one needs an absurdly low defect rate to keep the overall product yield high enough to be economical. It's kind of a modern miracle that it works as well as it does, IMHO.

ZeroFunGame said:
For example, if Apple tells TSMC that for their next iphone chip, they will deliver an SoC that is 100% function, but if after say 1 year of use and 2% of their transistors fail, the likelihood of the IC still working is 70%? Is that something that is spec'd?
Usually this is spec'ed via the aging models and will be verified by the stress (over voltage, over temperature, etc.) data. The biggest problem is transistors slow down with age (or one can state it as they require more voltage to get the same delay) and if they get too slow over time the digital circuit will fail timing. When one creates a voltage budget for the SOC this will be one of the line items.

For the SOC though everything needs lifetime models, not just the transistors. Packaging, bond wires, etc. can all fail as they are under mechanical stress too.

But for transistors see:
https://semiengineering.com/taming-nbti-to-improve-device-reliability/https://en.wikipedia.org/wiki/Negative-bias_temperature_instability
 
1,447
780
In the IC electronics world, is there a metric that manufacturers put on their product that says if a certain % of the transistors fail, the circuit would still be operational?
The only thing slightly similar is the solid state (flash) storage wear management what can handle faulty blocks, but in general redundancy or fault tolerance not really works like that.
I've seen somewhere that case of big FPGAs it is possible to bypass faulty logical blocks too, but that's not my world, I don't know how that works exactly.

Rad-hardened stuff and high level redundancy also a different world than management of faulty transistors.
 
analogdesign
Science Advisor
1,139
354
In the IC electronics world, is there a metric that manufacturers put on their product that says if a certain % of the transistors fail, the circuit would still be operational? For example, if Apple tells TSMC that for their next iphone chip, they will deliver an SoC that is 100% function, but if after say 1 year of use and 2% of their transistors fail, the likelihood of the IC still working is 70%? Is that something that is spec'd?
I think you're looking at this wrong. In ICs it is simply not a typical failure mode for transistors to just "stop working". If they work when they are manufactured, they typically keep working for a long, long time (unless you're working in an extreme environment such as radiation or cold temperature where stress can causes devices to fail. In this case you have a mean-time between failures (MTBF) specification). There are many transistors in a device that if a single one failed, the chip would fail (for example, a pull up device in an accumulator, or a power down switch, for instance).

In memories (especially DRAMs) soft errors (or single-event effects) are common. This means a bit in the RAM was flipped due to a cosmic ray, heavy ion hit, or similar. This doesn't cause the transistors to fail permanently, rather it causes a data error that the extra transistors included for error correction can fix.

Memories also typically have a lot of redundancy to handle manufacturing losses, so if a column of your memory is dead on arrival, it can still function fine in practice so you don't have to throw the chip away.

TSMC does not specify lifetime in the way that you suggest. Lifetime is determined (among other metrics) by Iq of the device (current draw) and chips die when they start leaking too much due to lattice damage. This happens faster at high temperature, but at room temperature the MTBF for a typical IC is 1000s of years or more. The packaging and bonding will fail far earlier than the chip. The silicon itself is extremely reliable (once it has been demonstrated to work at all).

Performance (not functionality) can drift over time, especially for analog circuits. In this case, the circuits typically are quite programmable to adjust biases and gains and so on. The additional registers used to tune analog circuits are called chicken bits and are necessary in modern SoC-targeted processes due to the extreme performance variability of the raw devices. It is next to impossible to design a high-performance op amp in a modern nanometer process these days without some type of tuning or calibration.

There is a spec that TSMC will give Apple regarding manufacturing and that is yield. Yield is the percentage of chips fabricated that actually work. It can vary wildly depending on a lot of factors, and is critically important to price. One the nonrecurring engineering costs are paid, the price per wafer is typically fixed. This means higher yield means more chips per wafer, or equivalently, lower cost per chip. Apple and TSMC care very much about yield, believe me.
 
TSMC does not specify lifetime in the way that you suggest. Lifetime is determined (among other metrics) by Iq of the device (current draw) and chips die when they start leaking too much due to lattice damage. This happens faster at high temperature, but at room temperature the MTBF for a typical IC is 1000s of years or more. The packaging and bonding will fail far earlier than the chip. The silicon itself is extremely reliable (once it has been demonstrated to work at all).
Thanks! I have two questions:

Are there mitigation techniques to compensate for frequency drift as a function of temperature? For example, if an ADC that needs to operate within -100C to 200C, with a rate of change of 10C/min. Is this something that can be compensated for in the design such that even with wild temperature swings, your acquisition rate remains constant?

Prolonged use at high temp will cause degradation to occur faster. Are there mitigation techniques to prevent the circuit from degrading at prolonged high T usage? Or is this completely out of the designer's hands and solely dependent on the PDK specs?
 
analogdesign
Science Advisor
1,139
354
Thanks! I have two questions:

Are there mitigation techniques to compensate for frequency drift as a function of temperature? For example, if an ADC that needs to operate within -100C to 200C, with a rate of change of 10C/min. Is this something that can be compensated for in the design such that even with wild temperature swings, your acquisition rate remains constant?
If you're asking about ADC sampling rate that would be determined by a frequency synthesizer, not the ADC itself. There are designs with excellent temperature stability but there is always a limit to how fast it can track temperature change. 10C a minute seems fast.

If you specify that the PLL in the frequency synthesizer needs to remain stable over -100C / 200C with a deltaT of 10C/min then you could custom design to meet that specification. There are absolutely mitigation techniques to compensate for temperature drift. One simple analog technique using a constant-transconductance circuit in the bias network (basically ensures transconductance stays constant over temperature but can be expensive from a power standpoint). If you have a calibration circuit you could calibrate it as the temperature changes.

Prolonged use at high temp will cause degradation to occur faster. Are there mitigation techniques to prevent the circuit from degrading at prolonged high T usage? Or is this completely out of the designer's hands and solely dependent on the PDK specs?
High temperature is challenging because if you get the temperature high enough the silicon goes into degeneracy (basically the valence band begins to functionally overlap with the conduction band because there are so many thermally generated free carriers). Assuming it isn't that hot, yes, you can prevent the circuit from degrading at prolonged high T. The main failure mode is electromigration in the interconnect. Basically, at high temperature there is a lot of thermal motion in the electrons that constitute the dc current and over time this can lead to open circuits (the metal is literally eaten away). The mitigation here is to use more metal in your wiring for hot operation than you would typically use at room temperature. Another issue is bulk damage in the transistors themselves that leads to high body current. Typically at high temperature the current flowing through your device increases due to a variety of reasons which can lead to hot electron effects (damage to the bulk due to high energy electrons). The key way you mitigate this is to design your circuit to minimize the electric field at the drain of each device. Besides biasing, an easy way to do that is to use long gates in your devices, but this compromises performance (while increasing reliability).

For almost every problem there is a solution the designer can figure out. We are almost never solely dependent on the PDK. This is par of what makes the field of IC Design so interesting.
 
326
143
Are there mitigation techniques to compensate for frequency drift as a function of temperature? For example, if an ADC that needs to operate within -100C to 200C, with a rate of change of 10C/min. Is this something that can be compensated for in the design such that even with wild temperature swings, your acquisition rate remains constant?
-100C to 200C is very hard thermal cycling specs. You are very likely to develop failures in solder joints of microchip before CMOS transistors itself will break. Compensation (from your text, seems your compensate a built-in RC/ring oscillator clock) is indeed possible, but be ready for bad compensation quality. 8-20% frequency swing across -100 to 200C is realistic with modern (i.e. third order plus calibration) designs of compensation circuits.
Prolonged use at high temp will cause degradation to occur faster. Are there mitigation techniques to prevent the circuit from degrading at prolonged high T usage? Or is this completely out of the designer's hands and solely dependent on the PDK specs?
Avoid high gate voltages (more than 2/3 of PDK specs) , and try to do not rely on PMOS in performance-critical analog parts. Also, pay attention to avoid positive biasing of parasitic diodes. This way you reduce impact of HCI, TDDB and NBTI degradation mechanisms, and remaining relatively small PBTI is something you must just endure.

Electromigration as analogdesign mentioned is also an issue - although less for chips with copper interconnects. Always use double contacts to silicon and add extra wire width. PDK usually depicts these methods as "automotive" or "aerospace" design guidelines.
 
Last edited:

Related Threads for: Circuit design question -- 100% transistor functionality needed?

Replies
5
Views
3K
  • Last Post
Replies
6
Views
1K
Replies
13
Views
1K
Replies
4
Views
821
Replies
2
Views
667
  • Last Post
Replies
4
Views
2K
Replies
24
Views
2K
Top