NOR Flash failures at extreme low temperatures

  • Thread starter TheAnalogKid83
  • Start date
  • Tags
    Flash
In summary, there is an issue with a system that contains an embedded system inside a metal housing. The system is able to operate at ambient temperature, but when placed in a controlled environment at -45 degrees celsius, it begins to hang and then resets itself. This has been noticed on multiple units and the NOR Flash memory shows the same error in the boot image section. The system has been in the field and the problem only started after a different PCB manufacturer was used. The system is operating outside of its temperature limits, but this does not explain the change after the PCBs were changed. There are suggestions to use a logic analyzer to monitor memory bus activity and to check for any changes in parts or PCB films.
  • #1
TheAnalogKid83
174
0
I am working on a project, and there is an embedded system inside a metal housing. The unit can be turned on at ambient temperature running application software or test software and then is put inside a temperature controlled environment down to -45 degrees celsius.

The system is allowed to stay at -45 C, but internal housing is probably at a slightly higher temperature with all of the circuitry being powered.

While at -45C, the system begins to hang, and then the watchdog resets it. When it reboots, it fails to boot. This has happened on multiple units.

When the systems are taken out of -45C and JTAGed they repeatedly show the exact same memory location in the NOR Flash having the exact same wrong value, and this location is in the boot image section, which would explain why it doesn't boot, but doesn't explain the hang necessarily because the units are operating off of the image after it has been sent to RAM in the beginning. This part is very confusing, because it leads the error to seem not random at all (what are the odds of the exact same memory location having the exact same wrong value at each failure?). Also the NOR Flash has built-in lock protection, so that when it resets the boot image should not be corrupted.

This system has been out in the field, and this problem has only been noticed since a different PCB manufacturer was used, so the manufacturing process has changed; however, the layout has remained unchanged, and the boards have been examined to see if there has been any noticeable change without finding anything.

The limiting operating temp spec on the internal components is -40, so the unit is being used outside of its limits, but this still does not explain why this didn't happen before PCBs were changed and why this specific, nonrandom, failure occurs.

Has anyone had an experience like this or have any helpful ideas what could be the problem?
 
Engineering news on Phys.org
  • #2
Is this flash embedded in a micro or a separate eg. CF card?
We have had issues with CF cards changing timing from batch to batch.

The same location is probably not random - it may be the first cell to require power on all address lines or some other 'analog' issue. It's quite common for low power or high speed parts (especially ADCs) to fail in marginal conditions when they need to write all poutput pins at the same time.

I don't know if NAND flash is any better temperature wise.
 
  • #3
the NOR Flash is an external BGA IC on the memory bus.
 
  • #4
Is the system writing to the flash at that temperature? What is the minimum operating temperature for the flash (you say -40C above, but are you sure?), and what is the minimum temp for writes (also -40C?).

One possibility is that the RAM accesses are violating timing at -45C, and that is causing the processor to go nuts enough for it to do a write to the flash to corrupt it. I agree it's weird that it seems to go nuts in the same way each time, but that could be pattern dependent (I've seen that in some marginal memory timing cases).

I'd be inclined to instrument a system so that you can watch all memory bus activity on a logic analyzer, while the test system is in an oven. Work with different trigger patterns to try to capture the bad write to flash, and then back up from there to see where execution is getting corrupted.

You could also instrument a board with high-speed 'scope connectors, so that you can watch the external memory timing change with temperature, to look for any obvious bad trends in the timing margins.

BTW, you said the change seemed to happen when you switched over your PCB fab house. Could it maybe be instead that you started getting a different date code of the flash or some other memory bus component instead? What is your memory decode logic and controller like?
 
  • #5
Couple other thoughts... Be sure to re-check all the parts on the board, paying particular attention to any that might have been changed on the bill of materials (BOM) or approved vendor list (AVL). Usually such changes will have to be approved by the responsible engineer (RE), but sometimes manufacturing/production will take it upon themselves to make a change if they are having supply chain problems. I've had a case where a 3.3V CPLD was substituted for a 5V CPLD by mistake, and I wasn't on the approval list. The problems this caused were subtle enough to go unnoticed for a few weeks of production, but thankfully surfaced soon enough that we were able to recall the units that were shipped, and ship corrected units.

Even when a part substitution seems okay on the surface (like using a faster speed grade of RAM), you really have to spend the time going item by item in the datasheets, to see if some other obscure spec may have changed. I've seen several instances where going to the next faster speed grade of RAM has also changed a different access time spec in an unexpected direction, or changed the drive strength of some signal.

On the PCB, ask to check the films of the new PCB, and compare those to the old films. I'm talking about the actual film layers used in the PCB fabrication, not just some check films that they might have sent you. I have seen SO many instances of changes on the final fab films that were not in the PCB layout review package (*.PCB file or *.PDF plots of the layers, etc.), so I almost always go to the fab house and ask to see the final films before they shoot the first boards. Especially if it is a new fab house for us, or if it is a fab run that is critical for a schedule, then I need to see the final films.

There are a couple reasons for this. First, fab houses will often have design rules for their PCBs that will incline them to "touch up" the final film layers, as compared to the Gerber files that we sent them. I've had instances where they didn't recognize the spark gap structures that we have on some PCBs for ESD protection (little exposed metal points in the layout on the component side), and they've either deleted them or reshaped them ("what are those points for, must be a mistake, I'll fix it for them"), or covered them with soldermask. Bad news.

I've also had one instance where a design rule check in the *.PCB file had thrown a warning and had drawn a little arrow to the offending via on an inner layer, and the layout person had forgotten to surpress markers like that in the final Gerber output files. So when I reviewed the fab films right before shooting some PCBs on a very tight schedule project, I noticed a line on an inner layer that didn't look correct, and in fact would have shorted out some traces if left in there. Bad news.

Finally, one of the most vexing problems came on a PCB where I hadn't checked the final films (the change from the previous PCB was very small). It turned out that because of a build script error by our PCB layout person (who is normally very good), a single ground via to our microcontroller from the inner ground plane was deleted. The uC still had 5 of its 6 ground pins connected correctly to the ground plane, but that one ground connection being open caused very subtle and pattern-dependent memory bus errors (which varied with temperature, etc.). The only way I found the error finally was holding a blank PCB up to the light, and noticing in comparing the old versus new PCB, that there was a via missing. Ouch!

So double-check all the components, and check the new versus the old PCB in great detail. Hope all that helps. Good luck, and let us know what you find.
 

1. What causes NOR Flash failures at extreme low temperatures?

At extremely low temperatures, the electrons in NOR Flash memory can become "frozen" and unable to move, resulting in an inability to store or retrieve data. This is known as "charge trapping" and can be caused by a variety of factors such as the quality of the semiconductor material, the design of the memory cell, and the manufacturing process.

2. How does extreme low temperature affect the reliability of NOR Flash memory?

Extreme low temperatures can significantly reduce the reliability of NOR Flash memory by causing charge trapping, which can lead to data corruption or loss. Additionally, the physical properties of the materials used in NOR Flash memory can change at low temperatures, further impacting reliability.

3. Can NOR Flash memory be designed to withstand extreme low temperatures?

Yes, NOR Flash memory can be designed to withstand extreme low temperatures by using materials and designs that are less susceptible to charge trapping and other negative effects. However, this may come at the cost of reduced performance or storage capacity.

4. How can NOR Flash memory be tested for extreme low temperature reliability?

NOR Flash memory can be tested for extreme low temperature reliability by subjecting it to controlled low temperature environments and monitoring its performance and data integrity. This can help identify any potential failure points and inform design and manufacturing improvements.

5. Are there any strategies for mitigating NOR Flash failures at extreme low temperatures?

Yes, there are several strategies for mitigating NOR Flash failures at extreme low temperatures. These include using specialized materials and designs, implementing error correction and detection algorithms, and incorporating temperature monitoring and control mechanisms. Regular testing and maintenance can also help identify and address potential issues before they become failures.

Similar threads

Replies
1
Views
980
  • Electrical Engineering
Replies
1
Views
2K
Replies
7
Views
863
  • Electrical Engineering
Replies
14
Views
1K
Replies
13
Views
2K
  • Special and General Relativity
Replies
5
Views
962
Replies
5
Views
948
  • STEM Academic Advising
Replies
11
Views
1K
Replies
5
Views
2K
Back
Top