Collecting, Analysing, and Exploiting Failure Data from large IT installations

  • Thread starter Thread starter Ivan Seeking
  • Start date Start date
  • Tags Tags
    Data Failure
AI Thread Summary
Component failure in large-scale IT installations is a growing concern as systems expand to nearly a million components. There is a lack of publicly available data on these failures, leading researchers to rely on anecdotal evidence. An analysis of failure data from 26 large-scale production systems revealed insights into the frequency and types of failures experienced. The discussion highlighted that the failure rate of nodes is influenced not only by the level of activity but also by the type of tasks being performed. Additionally, using inexpensive, less robust machines can exacerbate failure rates due to inadequate cooling and maintenance. In contrast, many organizations are shifting towards using fewer, higher-quality servers or virtual machines to improve reliability and manageability.
Ivan Seeking
Staff Emeritus
Science Advisor
Gold Member
Messages
8,194
Reaction score
2,487
ABSTRACT
Component failure in large-scale IT installations is becoming an ever larger problem as the number of processors, memory chips, and disks in a single cluster approaches a million. Yet, virtually no data on failures in real systems is publicly available, forcing researchers to base their work on anecdotes and back of the envelope calculations. In this talk, we will present results from our analysis of failure data from 26 large-scale production systems at three different organizations, including two high-performance computing sites and one large internet service provider.
http://www.youtube.com/watch?v=p2FWMO2QonY&feature=dir
 
Computer science news on Phys.org
I can't watch Youtube here, but this sounds exactly analogous to machinery component monitoring. I am really surprised no one really does that in the IT setting.
 
AAARGHH, who let this woman speak? It's like listening to someone drag their fingernails over a chalkboard. She's so nervous about speaking it's painful. I had to shut it off.

Ivan, does she end up making any points? You're a better person than I am if you could sit through almost an hour of this.
 
Yes, I thought she made a number of interesting points, but I'm not an IT person, and I have no idea how much might be common knowledge to a pro.
 
I didn't watch the video but i used to run Beowulf clusters ( lots of desktop computers wired together into one big computer)
Numbers do come back to bite you, if you have a hard drive with an average life of 3years ( 150weeks) but you have an cluster with 150 machines you can expect to be replacing a disk every week.
In practice because we are using cheap home machoines we are also not cooling them properly ( large AC is expensive 0 so we had even more failures than you would expect.

Google used to claim that with their clusters of several 1000s machines it wasn't worth even finding the broken machine and turning it off never mind trying to fix it.

For most real installations you tend to use fewer higher powered better engineered servers instead of 1000s of PCs which are easier to monitor and manage - in fact an increasingly common technique is to use virtual machine software to run many independant copies of machines on a single large machine.
 
Evo said:
You're a better person than I am if you could sit through almost an hour of this.

She is a bit like Julia Child with a german accent.

After twenty years of marriage, I can handle anything! :biggrin: :rolleyes:

I thought that one of the more interesting points was that the node failure rate was dependent on the type of activity, and not just the level of activity at a node.
 
Last edited:
Sorry if 'Profile Badge' is not the correct term. I have an MS 365 subscription and I've noticed on my Word documents the small circle with my initials in it is sometimes different in colour document to document (it's the circle at the top right of the doc, that, when you hover over it it tells you you're signed in; if you click on it you get a bit more info). Last night I had four docs with a red circle, one with blue. When I closed the blue and opened it again it was red. Today I have 3...
Thread 'ChatGPT Examples, Good and Bad'
I've been experimenting with ChatGPT. Some results are good, some very very bad. I think examples can help expose the properties of this AI. Maybe you can post some of your favorite examples and tell us what they reveal about the properties of this AI. (I had problems with copy/paste of text and formatting, so I'm posting my examples as screen shots. That is a promising start. :smile: But then I provided values V=1, R1=1, R2=2, R3=3 and asked for the value of I. At first, it said...
Back
Top