Collecting, Analysing, and Exploiting Failure Data from large IT installations

  • Thread starter Thread starter Ivan Seeking
  • Start date Start date
  • Tags Tags
    Data Failure
Click For Summary

Discussion Overview

The discussion revolves around the challenges of component failure in large-scale IT installations, particularly focusing on the analysis of failure data from various systems. Participants explore the implications of failure rates, monitoring practices, and comparisons to machinery component monitoring.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested

Main Points Raised

  • Some participants highlight the lack of publicly available failure data in IT, which forces reliance on anecdotal evidence and rough calculations.
  • One participant draws an analogy between IT failure monitoring and machinery component monitoring, expressing surprise at the lack of similar practices in IT.
  • Concerns are raised about the presentation style of the speaker, with mixed reactions regarding the effectiveness of the points made.
  • A participant shares experiences with Beowulf clusters, noting that the average lifespan of hard drives leads to frequent replacements in large setups, compounded by inadequate cooling practices.
  • Another participant mentions Google's approach to managing large clusters, suggesting that the cost of monitoring and fixing individual machines may not be justified.
  • It is noted that the failure rate of nodes can depend on the type of activity being performed, not just the level of activity.

Areas of Agreement / Disagreement

Participants express a range of opinions about the presentation and its content, with some finding value in the points raised while others criticize the delivery. There is no consensus on the effectiveness of the speaker or the applicability of the discussed concepts.

Contextual Notes

Participants reference specific experiences and practices in managing IT installations, indicating variability in approaches and outcomes based on different operational contexts.

Who May Find This Useful

This discussion may be of interest to IT professionals, researchers in computer science and engineering, and those involved in large-scale system management or failure analysis.

Ivan Seeking
Staff Emeritus
Science Advisor
Gold Member
Messages
8,252
Reaction score
2,664
ABSTRACT
Component failure in large-scale IT installations is becoming an ever larger problem as the number of processors, memory chips, and disks in a single cluster approaches a million. Yet, virtually no data on failures in real systems is publicly available, forcing researchers to base their work on anecdotes and back of the envelope calculations. In this talk, we will present results from our analysis of failure data from 26 large-scale production systems at three different organizations, including two high-performance computing sites and one large internet service provider.
http://www.youtube.com/watch?v=p2FWMO2QonY&feature=dir
 
Computer science news on Phys.org
I can't watch Youtube here, but this sounds exactly analogous to machinery component monitoring. I am really surprised no one really does that in the IT setting.
 
AAARGHH, who let this woman speak? It's like listening to someone drag their fingernails over a chalkboard. She's so nervous about speaking it's painful. I had to shut it off.

Ivan, does she end up making any points? You're a better person than I am if you could sit through almost an hour of this.
 
Yes, I thought she made a number of interesting points, but I'm not an IT person, and I have no idea how much might be common knowledge to a pro.
 
I didn't watch the video but i used to run Beowulf clusters ( lots of desktop computers wired together into one big computer)
Numbers do come back to bite you, if you have a hard drive with an average life of 3years ( 150weeks) but you have an cluster with 150 machines you can expect to be replacing a disk every week.
In practice because we are using cheap home machoines we are also not cooling them properly ( large AC is expensive 0 so we had even more failures than you would expect.

Google used to claim that with their clusters of several 1000s machines it wasn't worth even finding the broken machine and turning it off never mind trying to fix it.

For most real installations you tend to use fewer higher powered better engineered servers instead of 1000s of PCs which are easier to monitor and manage - in fact an increasingly common technique is to use virtual machine software to run many independent copies of machines on a single large machine.
 
Evo said:
You're a better person than I am if you could sit through almost an hour of this.

She is a bit like Julia Child with a german accent.

After twenty years of marriage, I can handle anything! :biggrin: :rolleyes:

I thought that one of the more interesting points was that the node failure rate was dependent on the type of activity, and not just the level of activity at a node.
 
Last edited:

Similar threads

  • · Replies 3 ·
Replies
3
Views
4K
Replies
10
Views
5K
  • · Replies 2 ·
Replies
2
Views
2K
  • · Replies 0 ·
Replies
0
Views
3K
  • · Replies 2 ·
Replies
2
Views
494
  • · Replies 1 ·
Replies
1
Views
4K
  • · Replies 11 ·
Replies
11
Views
5K
  • · Replies 19 ·
Replies
19
Views
6K
Replies
23
Views
6K