Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Collecting, Analysing, and Exploiting Failure Data from large IT installations

  1. Jul 24, 2007 #1

    Ivan Seeking

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    http://www.youtube.com/watch?v=p2FWMO2QonY&feature=dir
     
  2. jcsd
  3. Jul 24, 2007 #2

    FredGarvin

    User Avatar
    Science Advisor

    I can't watch Youtube here, but this sounds exactly analogous to machinery component monitoring. I am really surprised no one really does that in the IT setting.
     
  4. Jul 24, 2007 #3

    Evo

    User Avatar

    Staff: Mentor

    AAARGHH, who let this woman speak? It's like listening to someone drag their fingernails over a chalkboard. She's so nervous about speaking it's painful. I had to shut it off.

    Ivan, does she end up making any points? You're a better person than I am if you could sit through almost an hour of this.
     
  5. Jul 24, 2007 #4

    Ivan Seeking

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    Yes, I thought she made a number of interesting points, but I'm not an IT person, and I have no idea how much might be common knowledge to a pro.
     
  6. Jul 25, 2007 #5

    mgb_phys

    User Avatar
    Science Advisor
    Homework Helper

    I didn't watch the video but i used to run Beowulf clusters ( lots of desktop computers wired together into one big computer)
    Numbers do come back to bite you, if you have a hard drive with an average life of 3years ( 150weeks) but you have an cluster with 150 machines you can expect to be replacing a disk every week.
    In practice because we are using cheap home machoines we are also not cooling them properly ( large AC is expensive 0 so we had even more failures than you would expect.

    Google used to claim that with their clusters of several 1000s machines it wasn't worth even finding the broken machine and turning it off never mind trying to fix it.

    For most real installations you tend to use fewer higher powered better engineered servers instead of 1000s of PCs which are easier to monitor and manage - in fact an increasingly common technique is to use virtual machine software to run many independant copies of machines on a single large machine.
     
  7. Jul 25, 2007 #6

    Ivan Seeking

    User Avatar
    Staff Emeritus
    Science Advisor
    Gold Member

    She is a bit like Julia Child with a german accent.

    After twenty years of marriage, I can handle anything! :biggrin: :uhh:

    I thought that one of the more interesting points was that the node failure rate was dependent on the type of activity, and not just the level of activity at a node.
     
    Last edited: Jul 25, 2007
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: Collecting, Analysing, and Exploiting Failure Data from large IT installations
Loading...