Help with failure statistics

In summary: I don't see anything wrong with estimating a specific distribution as long as you're willing to accept some degree of uncertainty around the result. In summary, if you bought a lot of sensors and expect a lot of them to break, you could calculate the joint probability of having the sensors broken at a given time. If the number of sensors breaks is significantly different from the number you expected, then you know the manufacturer may not be living up to their end of the deal.
  • #1
Strum
105
4
(Sorry for the terrible title. If anybody have a better idea, post and I will edit. Also I have no idea of the level so now I just put undergraduate since the problem is fairly easy to state.)
Suppose I buy ## N## sensors which the manufacturer tells me will fail at some point and the failure distribution is given by a Normal distribution(## p_{f}(t) = N[\mu_{f},\sigma_{f}^{2}](t) ##). Now after ## t<\mu_{f} ## I have ## n = |\mathbf{n}| \ll N## failed sensors, where ## \mathbf{n} ## is an ordered vector with timestamps of when the sensors broke. I would like to quantify whether my manufacturer is living up to his end of the deal. I have made a few attempts but I would very much like to know if there is some canonical way to do this.

Attempt 1: Calculate the joint probability of having the ## n ## sensors broken at ## t ## in the correct order. This I suppose would be given by
\begin{equation}
P_{joint} = \Pi_{i=1}^{n}cdf[\mu_{f},\sigma_{f}^{2}](n_{i})
\end{equation}
Where ## cdf(t) ## is the cumulative probability function of ## p_{f}(t) ##.

Attempt 2: Make a new random variable given by the sum of ## n ## failure times which will follow ## p_{tot}(t) = N(n\mu_{f},n\sigma_{f}^{2})(t) ## and calculate the probability
\begin{equation}
P_{tot} = p_{tot}(t<\max{(\mathbf{n})})
\end{equation}

Attempt 3:
Make a one sample Kolmogorov-Smirnov test using ## \mathbf{n} ## and a cut normal distribution, ## p_{cut} = N[\mu_{f}\sigma_{f}^{2}](t<\max(n)) / L ##, where ## L ## is a normalisation constant, and then estimate the significance.

Attempt 4: Make a two sample Kolmogorov-Smirnov test using simulated data from ## p_{cut} ## sort of like the answer given here:
http://stats.stackexchange.com/questions/126539/testing-whether-data-follows-t-distribution , and then estimate the significance.

I am not sure which method is best and what the advantages and disadvantages is for each. I also need some help in order to quantify the uncertainty on my final answer. I understand how to calculate the uncertainty on the mean and variance on a normal distributed sample, but I do not know how to do it on this sample.
I feel this should be a super simple exercise but just can not seem to get a real hold on it.
 
Physics news on Phys.org
  • #2
t is the same for all sensors?

I don't see how much you would learn from attempt 1.

Attempt 2 is interesting if you are just interested in the total number of failures so far. The expected distribution of failed sensors follows a binomial distribution (which can be approximated as Poisson distribution here). To avoid bias, I would use the present time, not the time of the last failure.
If the difference between expected and observed failures is significant, you know the sensors are more or less reliable than predicted.

Attempt 3 is interesting if you try to track the actual distribution over time. If there is a significant difference, you know it doesn't follow a Gaussian distribution, but you don't necessarily know what is wrong.

As far as I understand attempt 4, it is for cases where you don't have an expectation about the true parameters.
 
  • Like
Likes Buzz Bloom
  • #3
Strum said:
(Sorry for the terrible title. If anybody have a better idea, post and I will edit. Also I have no idea of the level so now I just put undergraduate since the problem is fairly easy to state.)
Suppose I buy ## N## sensors which the manufacturer tells me will fail at some point and the failure distribution is given by a Normal distribution(## p_{f}(t) = N[\mu_{f},\sigma_{f}^{2}](t) ##). Now after ## t<\mu_{f} ## I have ## n = |\mathbf{n}| \ll N## failed sensors, where ## \mathbf{n}

It really depends on what you want out of this. My suspicion is you're posting this because you think you got a very bad deal... if that's the case then you're probably interested in tail outcomes, which indicates you should be able to use the mean and chernoff bounds to get a real simple cut at the likelihood function. (If for some reason you don't think i.i.d. for the failures is appropriate, then maybe chebyshev -- but not independent identically distributed is something I would have expected you'd mention. The i.i.d. assumption naturally leads to the normal you mentioned, though strictly speaking other assumptions can as well.)

From here it's a question as to whether you want a classical or Bayesian take. I'm partial to the latter, but it becomes a bit of a fudge if you don't have prior information to incorporate. The big question is just how many sensors you have. If its like 5 vs 5000... makes a big difference on many levels -- including things like whether or not you have to worry about having a useful prior or not.

If you want to put a finer point on the distribution, I'm generally skeptical about assuming normality off the bat where you clearly have an asymmetric distribution -- i.e. the failures occur in the interval of ##[0, \infty)##. This feel more like something that would be Poisson modeled (which mfb spotted) and Poissons can be approximated by a Gaussian, of course, with enough of them.

The nice thing about using bounds, though, is as long as you have a well defined mean, you don't need to get into the minutia of what distribution is being used with respect to your likelihood function.

Summary: if each individual sensor has a finite expected life, and their lives are independent from one another, then you can estimate the actual expected life per sensor, and compare against what you were told (##\frac{\mu_{f}}{N}##). That's the simple way I'd think about it.
 
  • #4
StoneTemplePython said:
If you want to put a finer point on the distribution, I'm generally skeptical about assuming normality off the bat where you clearly have an asymmetric distribution -- i.e. the failures occur in the interval of ##[0, \infty)##. This feel more like something that would be Poisson modeled (which mfb spotted) and Poissons can be approximated by a Gaussian, of course, with enough of them.
My comment about Poisson was about the distribution of failures before time T.

A normal distribution for the failure probability looks odd, especially in the tails which we discuss here, but if the manufacturer claims that: so what. We can test it.
 
  • #5
Thank you both for your answers. A few comments.

mfb said:
t is the same for all sensors?

Attempt 2 is interesting if you are just interested in the total number of failures so far. The expected distribution of failed sensors follows a binomial distribution (which can be approximated as Poisson distribution here). To avoid bias, I would use the present time, not the time of the last failure.
If the difference between expected and observed failures is significant, you know the sensors are more or less reliable than predicted.

Attempt 3 is interesting if you try to track the actual distribution over time. If there is a significant difference, you know it doesn't follow a Gaussian distribution, but you don't necessarily know what is wrong.
Yes t is the same for all sensors. It is time since installation. I do not understand why the distribution of failed sensors will be binomial and do you mean the distribution of the amount of sensors failed at a given time? Can you elaborate?
It is a really good point that I should use present time instead of last failure.

I do not understand your comment on attempt 3. Since I track the sensors for ##t<\mu_f ## wouldn't i expect the distribution of when's a sensor breaks should follow a gaussion cut off at t = present time?

StoneTemplePython said:
It really depends on what you want out of this. My suspicion is you're posting this because you think you got a very bad deal... if that's the case then you're probably interested in tail outcomes, which indicates you should be able to use the mean and chernoff bounds to get a real simple cut at the likelihood function.

If you want to put a finer point on the distribution, I'm generally skeptical about assuming normality off the bat where you clearly have an asymmetric distribution -- i.e. the failures occur in the interval of ##[0, \infty)##. This feel more like something that would be Poisson modeled (which mfb spotted) and Poissons can be approximated by a Gaussian, of course, with enough of them.

The nice thing about using bounds, though, is as long as you have a well defined mean, you don't need to get into the minutia of what distribution is being used with respect to your likelihood function.

Summary: if each individual sensor has a finite expected life, and their lives are independent from one another, then you can estimate the actual expected life per sensor, and compare against what you were told (##\frac{\mu_{f}}{N}##). That's the simple way I'd think about it.

From here it's a question as to whether you want a classical or Bayesian take. I'm partial to the latter, but it becomes a bit of a fudge if you don't have prior information to incorporate. The big question is just how many sensors you have. If its like 5 vs 5000... makes a big difference on many levels -- including things like whether or not you have to worry about having a useful prior or not.

The Chernoff bound is really nice. I did not know about that and will definitely use that in the future. Also the comment on the distribution makes a lot of sense. The supplier never really tells you the distribution so I just picked one. Ofcause it should have been a poisson( or something from ##[0, \infty)##) instead.

You seem to suggest that I can figure other stuff out( or maybe refine the bounds) using Bayesian statistics but I do not follow if you have anything specific in mind?
 
  • #6
Strum said:
Yes t is the same for all sensors. It is time since installation. I do not understand why the distribution of failed sensors will be binomial and do you mean the distribution of the amount of sensors failed at a given time? Can you elaborate?
It is a really good point that I should use present time instead of last failure.
For a fixed present time T, every sensor has the same failure probability p. Assuming the manufacturer is right, you can calculate this p by integrating the Gaussian distribution. What is the probability that n sensors out of N failed, if every sensor had a failure probability p? There is the binomial distribution.

Strum said:
I do not understand your comment on attempt 3. Since I track the sensors for ##t<\mu_f ## wouldn't i expect the distribution of when's a sensor breaks should follow a gaussion cut off at t = present time?
Sure. I never said anything against that.
My comment was for the case "what happens if the KS test fails?". The failure alone doesn't tell you want went wrong. It just tells you the Gaussian failure distribution is probably not right.
The supplier never really tells you the distribution so I just picked one. Ofcause it should have been a poisson( or something from ##[0, \infty)##) instead.
Wait, what did you pick and what did you get from the manufacturer?
 
  • #7
Sorry for the late reply. I have been traveling.

mfb said:
For a fixed present time T, every sensor has the same failure probability p. Assuming the manufacturer is right, you can calculate this p by integrating the Gaussian distribution. What is the probability that n sensors out of N failed, if every sensor had a failure probability p? There is the binomial distribution.
Ahh of cause I get it now. Thanks for the help!

mfb said:
Wait, what did you pick and what did you get from the manufacturer?

What they give is a number of how long they say the sensor should at least work( for example 2000 hours.). If the sensor company does not want to many RMA's this number will be some low quantile of an (to me) unknown probability distribution. To get anywhere it is now necessary to assume how small a quantile and a distribution( including its parametrization if necessary). What I did was to assume a gaussian but it should probably have been some distribution on ##[0,\infty)## instead. Since I tried to make the question fairly general I did not think much about my choice( I just wanted something with adjustable mean and width).
After you have helped me figure out a good way to quantify the failure rate of my sample, I would then change my assumed parametrization and see how the results would vary.,
 
  • #8
Strum said:
What they give is a number of how long they say the sensor should at least work( for example 2000 hours.). If the sensor company does not want to many RMA's this number will be some low quantile of an (to me) unknown probability distribution.
This is completely different from what you said in post 1, and it leads to a completely different approach.

The failures will not follow a Gaussian distribution, especially not in the tails where you are. And the shape of the tail won't tell you where the peak is.
 
  • #9
I can not see why this changes a whole lot. I just assumed some distribution of when a sensor will fail.
What sort of distribution will they follow since it is not Gaussian( ignoring the fact that it extends to ##-\infty##)? You obviously know a lot more about it than me, so please enlighten me. I tried calculating the failure rate as a function of time for a few different distribution and the all seemed qualitatively the same for ##t<\mu##.
I know the shape might not give me the peak but by assuming a distribution I can find a peak within the assumption which could be used to gauge whether the sensor are breaking way to early.
 
  • #10
Strum said:
I just assumed some distribution of when a sensor will fail.
And what if this assumption is wrong? Different assumptions will lead to completely different results.
Strum said:
What sort of distribution will they follow since it is not Gaussian
No idea. It would need a dedicated failure analysis of the components or simply a large test sample running until a significant fraction fails.

If you want to try different distributions: Gaussian, Crystal Ball, logistic, Student-t. Gauss plus a constant. Sums of two of the previous options. Maybe even an exponential distribution to have a completely different result.

How large are N and |n|?
 

What is failure statistics?

Failure statistics is a branch of statistics that deals with the analysis and interpretation of data related to failures or errors in a system, process, or product. It involves identifying patterns and trends in failure data to understand the causes and potential solutions to prevent future failures.

Why is failure statistics important?

Failure statistics is important because it helps us understand the frequency and severity of failures in a system, process, or product. This information can be used to improve the reliability and safety of these systems, leading to cost savings and increased customer satisfaction.

What are the common methods used in failure statistics?

The most common methods used in failure statistics include failure mode and effects analysis (FMEA), hazard and operability study (HAZOP), reliability analysis, and statistical process control. These methods help identify potential failure modes, assess their impact, and implement preventive or corrective measures.

What are the challenges in analyzing failure statistics?

One of the main challenges in analyzing failure statistics is the availability and quality of data. It can be difficult to collect accurate and comprehensive failure data, especially for complex systems. Additionally, there may be a lack of standardization in data collection and reporting, making it challenging to compare and analyze data from different sources.

How can failure statistics be used in decision making?

Failure statistics can be used in decision making by providing insights into the reliability and safety of a system, process, or product. This information can help identify areas for improvement and guide decision-making on implementing preventive or corrective actions. It can also be used to assess the effectiveness of these actions and make data-driven decisions to optimize performance and reduce failures.

Similar threads

  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
846
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
951
  • Set Theory, Logic, Probability, Statistics
Replies
30
Views
2K
  • Set Theory, Logic, Probability, Statistics
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
4
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
9
Views
2K
Replies
4
Views
1K
Back
Top