NOR Flash failures at extreme low temperatures

  • Thread starter Thread starter TheAnalogKid83
  • Start date Start date
  • Tags Tags
    Flash
AI Thread Summary
The discussion centers on issues with NOR Flash memory failures in an embedded system operating at extreme low temperatures, specifically -45 degrees Celsius. The system hangs and fails to boot after being exposed to this temperature, with repeated failures occurring at the same memory location in the NOR Flash, raising concerns about the reliability of the new PCB manufacturer. Potential causes include timing violations during RAM access at low temperatures and possible changes in component specifications due to the new manufacturing process. Participants suggest monitoring memory bus activity and comparing the new PCB's fabrication details with the old to identify any discrepancies. The conversation emphasizes the importance of thorough component checks and understanding the implications of any changes made during manufacturing.
TheAnalogKid83
Messages
174
Reaction score
0
I am working on a project, and there is an embedded system inside a metal housing. The unit can be turned on at ambient temperature running application software or test software and then is put inside a temperature controlled environment down to -45 degrees celsius.

The system is allowed to stay at -45 C, but internal housing is probably at a slightly higher temperature with all of the circuitry being powered.

While at -45C, the system begins to hang, and then the watchdog resets it. When it reboots, it fails to boot. This has happened on multiple units.

When the systems are taken out of -45C and JTAGed they repeatedly show the exact same memory location in the NOR Flash having the exact same wrong value, and this location is in the boot image section, which would explain why it doesn't boot, but doesn't explain the hang necessarily because the units are operating off of the image after it has been sent to RAM in the beginning. This part is very confusing, because it leads the error to seem not random at all (what are the odds of the exact same memory location having the exact same wrong value at each failure?). Also the NOR Flash has built-in lock protection, so that when it resets the boot image should not be corrupted.

This system has been out in the field, and this problem has only been noticed since a different PCB manufacturer was used, so the manufacturing process has changed; however, the layout has remained unchanged, and the boards have been examined to see if there has been any noticeable change without finding anything.

The limiting operating temp spec on the internal components is -40, so the unit is being used outside of its limits, but this still does not explain why this didn't happen before PCBs were changed and why this specific, nonrandom, failure occurs.

Has anyone had an experience like this or have any helpful ideas what could be the problem?
 
Engineering news on Phys.org
Is this flash embedded in a micro or a separate eg. CF card?
We have had issues with CF cards changing timing from batch to batch.

The same location is probably not random - it may be the first cell to require power on all address lines or some other 'analog' issue. It's quite common for low power or high speed parts (especially ADCs) to fail in marginal conditions when they need to write all poutput pins at the same time.

I don't know if NAND flash is any better temperature wise.
 
the NOR Flash is an external BGA IC on the memory bus.
 
Is the system writing to the flash at that temperature? What is the minimum operating temperature for the flash (you say -40C above, but are you sure?), and what is the minimum temp for writes (also -40C?).

One possibility is that the RAM accesses are violating timing at -45C, and that is causing the processor to go nuts enough for it to do a write to the flash to corrupt it. I agree it's weird that it seems to go nuts in the same way each time, but that could be pattern dependent (I've seen that in some marginal memory timing cases).

I'd be inclined to instrument a system so that you can watch all memory bus activity on a logic analyzer, while the test system is in an oven. Work with different trigger patterns to try to capture the bad write to flash, and then back up from there to see where execution is getting corrupted.

You could also instrument a board with high-speed 'scope connectors, so that you can watch the external memory timing change with temperature, to look for any obvious bad trends in the timing margins.

BTW, you said the change seemed to happen when you switched over your PCB fab house. Could it maybe be instead that you started getting a different date code of the flash or some other memory bus component instead? What is your memory decode logic and controller like?
 
Couple other thoughts... Be sure to re-check all the parts on the board, paying particular attention to any that might have been changed on the bill of materials (BOM) or approved vendor list (AVL). Usually such changes will have to be approved by the responsible engineer (RE), but sometimes manufacturing/production will take it upon themselves to make a change if they are having supply chain problems. I've had a case where a 3.3V CPLD was substituted for a 5V CPLD by mistake, and I wasn't on the approval list. The problems this caused were subtle enough to go unnoticed for a few weeks of production, but thankfully surfaced soon enough that we were able to recall the units that were shipped, and ship corrected units.

Even when a part substitution seems okay on the surface (like using a faster speed grade of RAM), you really have to spend the time going item by item in the datasheets, to see if some other obscure spec may have changed. I've seen several instances where going to the next faster speed grade of RAM has also changed a different access time spec in an unexpected direction, or changed the drive strength of some signal.

On the PCB, ask to check the films of the new PCB, and compare those to the old films. I'm talking about the actual film layers used in the PCB fabrication, not just some check films that they might have sent you. I have seen SO many instances of changes on the final fab films that were not in the PCB layout review package (*.PCB file or *.PDF plots of the layers, etc.), so I almost always go to the fab house and ask to see the final films before they shoot the first boards. Especially if it is a new fab house for us, or if it is a fab run that is critical for a schedule, then I need to see the final films.

There are a couple reasons for this. First, fab houses will often have design rules for their PCBs that will incline them to "touch up" the final film layers, as compared to the Gerber files that we sent them. I've had instances where they didn't recognize the spark gap structures that we have on some PCBs for ESD protection (little exposed metal points in the layout on the component side), and they've either deleted them or reshaped them ("what are those points for, must be a mistake, I'll fix it for them"), or covered them with soldermask. Bad news.

I've also had one instance where a design rule check in the *.PCB file had thrown a warning and had drawn a little arrow to the offending via on an inner layer, and the layout person had forgotten to surpress markers like that in the final Gerber output files. So when I reviewed the fab films right before shooting some PCBs on a very tight schedule project, I noticed a line on an inner layer that didn't look correct, and in fact would have shorted out some traces if left in there. Bad news.

Finally, one of the most vexing problems came on a PCB where I hadn't checked the final films (the change from the previous PCB was very small). It turned out that because of a build script error by our PCB layout person (who is normally very good), a single ground via to our microcontroller from the inner ground plane was deleted. The uC still had 5 of its 6 ground pins connected correctly to the ground plane, but that one ground connection being open caused very subtle and pattern-dependent memory bus errors (which varied with temperature, etc.). The only way I found the error finally was holding a blank PCB up to the light, and noticing in comparing the old versus new PCB, that there was a via missing. Ouch!

So double-check all the components, and check the new versus the old PCB in great detail. Hope all that helps. Good luck, and let us know what you find.
 
Hi all I have some confusion about piezoelectrical sensors combination. If i have three acoustic piezoelectrical sensors (with same receive sensitivity in dB ref V/1uPa) placed at specific distance, these sensors receive acoustic signal from a sound source placed at far field distance (Plane Wave) and from broadside. I receive output of these sensors through individual preamplifiers, add them through hardware like summer circuit adder or in software after digitization and in this way got an...
I have recently moved into a new (rather ancient) house and had a few trips of my Residual Current breaker. I dug out my old Socket tester which tell me the three pins are correct. But then the Red warning light tells me my socket(s) fail the loop test. I never had this before but my last house had an overhead supply with no Earth from the company. The tester said "get this checked" and the man said the (high but not ridiculous) earth resistance was acceptable. I stuck a new copper earth...
Thread 'Beauty of old electrical and measuring things, etc.'
Even as a kid, I saw beauty in old devices. That made me want to understand how they worked. I had lots of old things that I keep and now reviving. Old things need to work to see the beauty. Here's what I've done so far. Two views of the gadgets shelves and my small work space: Here's a close up look at the meters, gauges and other measuring things: This is what I think of as surface-mount electrical components and wiring. The components are very old and shows how...
Back
Top