Do soft errors still occur in today's machines?

  • Thread starter Thread starter Muhammad Usman
  • Start date Start date
  • Tags Tags
    Errors Machines
Click For Summary

Discussion Overview

The discussion revolves around the occurrence of soft errors in modern computing machines, particularly focusing on their causes, mitigation techniques, and relevance in various applications such as laptops and routers. Participants explore the implications of soft errors in both terrestrial and space environments.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • Some participants note that soft errors are caused by alpha particles and thermal issues, questioning whether these errors still exist and if they are a leading cause of problems resolved by rebooting.
  • Others argue that most issues fixed by reboots are due to software bugs rather than soft errors, suggesting a ban on reset buttons in critical applications to address underlying issues.
  • It is mentioned that soft errors have been significantly reduced over the past 40 years, particularly in DRAM, and that running programs multiple times can help ensure error-free results.
  • One participant highlights that in critical applications, running programs in parallel multiple times is a common mitigation technique against soft errors.
  • Another participant states that soft errors are mostly found in DRAM and that modern designs have built-in protections against low-energy events, making soft errors negligible in most DRAM chips manufactured after 2010.
  • In contrast, it is noted that soft error rates remain a concern in space applications due to higher radiation doses, leading to common occurrences of bit-flips and single event upsets (SEUs) in satellites.

Areas of Agreement / Disagreement

Participants express differing views on the prevalence and impact of soft errors in modern machines, with some asserting that they are largely mitigated while others emphasize their ongoing relevance, particularly in specific contexts like space applications. The discussion remains unresolved regarding the extent to which soft errors contribute to reboot-related issues.

Contextual Notes

Participants mention limitations in knowledge regarding specific manufacturers of DRAM and the historical context of soft error rates, as well as the dependence on application environments (terrestrial vs. space) for the occurrence of soft errors.

Muhammad Usman
Messages
52
Reaction score
3
TL;DR
Do soft errors still occur in today's machines specially laptops, routers and is it the most probable cause behind the reboot-fix?
Hi,

I was reading about the soft errors. I read that the soft errors are the one that basically caused by alpha particles and even some time thermal issues (Too much heat in the machines). Although I search lot of mitigation techniques but I am curious that are these errors still exist and are they the leading cause of issues that are fixed with the reboot. Thanks
 
Engineering news on Phys.org
Most things that are fixed with reboots are software bugs. That is why I banned reset buttons from motherboards that were to be used in critical applications (find the error - do not act as if nothing happened!).
 
  • Like
Likes   Reactions: anorlunda
The majority of crashes are due to software bugs. Very few crashes are due to soft errors.
If it is important to be soft error free, then you must run the program twice and check the outputs are the same.
Soft errors due to contaminated packaging have been greatly reduced over the last 40 years.
https://en.wikipedia.org/wiki/Soft_error#Designing_around_soft_errors
 
Baluncore said:
If it is important to be soft error free, then you must run the program twice and check the outputs are the same.
In critical applications (where delay is not acceptable) it's not twice but thrice in paralell, and two similar result is the requirement.
TMR
Guess this might count as existing mitigation technique for soft errors.

In everyday computer usage the end result is mostly the same for software- and soft errors - for a singular event you will never know what happened.
 
Last edited:
  • Like
Likes   Reactions: Asymptotic and anorlunda
Muhammad Usman said:
Summary:: Do soft errors still occur in today's machines specially laptops, routers and is it the most probable cause behind the reboot-fix?

Hi,

I was reading about the soft errors. I read that the soft errors are the one that basically caused by alpha particles and even some time thermal issues (Too much heat in the machines). Although I search lot of mitigation techniques but I am curious that are these errors still exist and are they the leading cause of issues that are fixed with the reboot. Thanks
Soft errors are mostly occurring in DRAM; other parts of computer have a build-in rejection of low-energy events and should not have soft errors if designed properly.

In terrestrial applications, the soft error rates are negligible for most DRAM chips manufactured after 2010. Before 2010 it was higher because some DRAM makers (i cannot say who due to my work contract restrictions) did used a fault-prone variant of deep trench technology. The occurrence of faulty technology was common enough to make popular the ECC and parity-checked DRAM modules.

In space application radiation doses are much higher (0.3-10 Sieverts/year) compared to terrestrial (~0.002 Sieverts/year), therefore soft errors in form of bit-flip and SEU are still common in satellites, despite of less sensitive modern DRAM.
 
  • Like
Likes   Reactions: eq1

Similar threads

Replies
1
Views
3K
  • · Replies 18 ·
Replies
18
Views
5K
  • · Replies 21 ·
Replies
21
Views
3K
Replies
2
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
Replies
29
Views
6K
  • · Replies 4 ·
Replies
4
Views
7K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 2 ·
Replies
2
Views
6K
Replies
17
Views
5K