Do soft errors still occur in today's machines?

  • Thread starter Thread starter Muhammad Usman
  • Start date Start date
  • Tags Tags
    Errors Machines
Click For Summary
SUMMARY

Soft errors, primarily caused by alpha particles and thermal issues, have significantly decreased in modern machines, particularly in DRAM manufactured after 2010. While soft errors were more prevalent in older DRAM technologies, contemporary designs incorporate error correction codes (ECC) and parity checks to mitigate these issues. In terrestrial applications, soft error rates are negligible, but they remain a concern in space applications due to higher radiation exposure. The majority of system crashes are attributed to software bugs rather than soft errors, emphasizing the importance of thorough testing and validation in critical applications.

PREREQUISITES
  • Understanding of soft errors and their causes
  • Familiarity with DRAM technology and error correction methods
  • Knowledge of thermal management in computing systems
  • Awareness of the differences between terrestrial and space radiation environments
NEXT STEPS
  • Research error correction techniques in DRAM, focusing on ECC and parity-checking
  • Explore the impact of thermal issues on computer reliability and performance
  • Investigate the effects of radiation on electronic components in space applications
  • Learn about testing methodologies for critical applications, including Triple Modular Redundancy (TMR)
USEFUL FOR

Engineers, hardware designers, and system architects focused on improving the reliability of computing systems, especially in critical and space applications.

Muhammad Usman
Messages
52
Reaction score
3
TL;DR
Do soft errors still occur in today's machines specially laptops, routers and is it the most probable cause behind the reboot-fix?
Hi,

I was reading about the soft errors. I read that the soft errors are the one that basically caused by alpha particles and even some time thermal issues (Too much heat in the machines). Although I search lot of mitigation techniques but I am curious that are these errors still exist and are they the leading cause of issues that are fixed with the reboot. Thanks
 
Engineering news on Phys.org
Most things that are fixed with reboots are software bugs. That is why I banned reset buttons from motherboards that were to be used in critical applications (find the error - do not act as if nothing happened!).
 
  • Like
Likes   Reactions: anorlunda
The majority of crashes are due to software bugs. Very few crashes are due to soft errors.
If it is important to be soft error free, then you must run the program twice and check the outputs are the same.
Soft errors due to contaminated packaging have been greatly reduced over the last 40 years.
https://en.wikipedia.org/wiki/Soft_error#Designing_around_soft_errors
 
Baluncore said:
If it is important to be soft error free, then you must run the program twice and check the outputs are the same.
In critical applications (where delay is not acceptable) it's not twice but thrice in paralell, and two similar result is the requirement.
TMR
Guess this might count as existing mitigation technique for soft errors.

In everyday computer usage the end result is mostly the same for software- and soft errors - for a singular event you will never know what happened.
 
Last edited:
  • Like
Likes   Reactions: Asymptotic and anorlunda
Muhammad Usman said:
Summary:: Do soft errors still occur in today's machines specially laptops, routers and is it the most probable cause behind the reboot-fix?

Hi,

I was reading about the soft errors. I read that the soft errors are the one that basically caused by alpha particles and even some time thermal issues (Too much heat in the machines). Although I search lot of mitigation techniques but I am curious that are these errors still exist and are they the leading cause of issues that are fixed with the reboot. Thanks
Soft errors are mostly occurring in DRAM; other parts of computer have a build-in rejection of low-energy events and should not have soft errors if designed properly.

In terrestrial applications, the soft error rates are negligible for most DRAM chips manufactured after 2010. Before 2010 it was higher because some DRAM makers (i cannot say who due to my work contract restrictions) did used a fault-prone variant of deep trench technology. The occurrence of faulty technology was common enough to make popular the ECC and parity-checked DRAM modules.

In space application radiation doses are much higher (0.3-10 Sieverts/year) compared to terrestrial (~0.002 Sieverts/year), therefore soft errors in form of bit-flip and SEU are still common in satellites, despite of less sensitive modern DRAM.
 
  • Like
Likes   Reactions: eq1

Similar threads

Replies
1
Views
3K
  • · Replies 18 ·
Replies
18
Views
4K
  • · Replies 21 ·
Replies
21
Views
3K
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
29
Views
5K
  • · Replies 4 ·
Replies
4
Views
7K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
6K
Replies
17
Views
5K