SMART reporting and Hard Disk buzzing sound

Vanadium 50 · Dec 29, 2016

I had a drive that every few minutes makes a buzzing sound. Here's what SMART is telling me.

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   071   071   016    Pre-fail  Always       -       61213753
  2 Throughput_Performance  0x0005   139   139   054    Pre-fail  Offline      -       71
  3 Spin_Up_Time            0x0007   160   160   024    Pre-fail  Always       -       399 (Average 317)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       89
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       87
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0

Amazingly, here's the SMART health report:

Code:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

How it thinks this drive is healthy with 61 million read errors (since Tuesday) is beyond me.

I'm doing a surface scan of its replacement now. I hope it finishes and is good before this one gives up the ghost. Anyone thing I am being too paranoid?

fresh_42 · Dec 29, 2016

Vanadium 50 said:

Anyone thing I am being too paranoid?

Not me. Who manufactured the drive?

Vanadium 50 · Dec 29, 2016

Toshiba. It's a DT01ACA300.

fresh_42 · Dec 29, 2016

Vanadium 50 said:

Toshiba. It's a DT01ACA300.

Thanks. Looks like an internal. They seem to be far less reliable than external ones (my experiences).

Vanadium 50 · Dec 29, 2016

It is an internal. 3.5", 3TB, SATA-3, 3 TB. It is the oldest of the four drives in the array, and has January 2014 on it. 24799 powered on hours.

Oddly, the Raw_Read_Error_Rate dropped to zero - but is creeping upward again. It's at 28 now.

Routaran · Dec 29, 2016

Vanadium 50 said:
I had a drive that every few minutes makes a buzzing sound. Here's what SMART is telling me.
Code:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   071   071   016    Pre-fail  Always       -       61213753
  2 Throughput_Performance  0x0005   139   139   054    Pre-fail  Offline      -       71
  3 Spin_Up_Time            0x0007   160   160   024    Pre-fail  Always       -       399 (Average 317)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       89
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       87
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
Amazingly, here's the SMART health report:
Code:
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
How it thinks this drive is healthy with 61 million read errors (since Tuesday) is beyond me.

I'm doing a surface scan of its replacement now. I hope it finishes and is good before this one gives up the ghost. Anyone thing I am being too paranoid?

I wouldn't worry about it. The Raw_Read_Error_Rate is the indicator for the rate of sector read operations. There are always errors when attempting to read and this is dealt with by the drive's error correction mechanisms. The RAW_VALUE field is supposed to be the number of read errors but that value is only actually reported by Seagate drives so you can safely ignore this value.

The important bit to compare here is to compare the Worst field to the Thresh field. If the Worst value drops below the Thresh value, then the drive is considered as failed.

https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes
"(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number."I suggest that you pull out the drive from the case and reseat it. Also reseat the SATA cable on both ends. See if that does anything to resolve the noise issue.

Vanadium 50 · Dec 29, 2016

Routaran said:

I suggest that you pull out the drive from the case and reseat it. Also reseat the SATA cable on both ends. See if that does anything to resolve the noise issue.

Done (it's in its own enclosure) and no difference.

While errors are normal, it seems to me that 61 million errors is excessive. If that number is meaningless, the other three drives are at 100 for Worst, and this drive is at 71. I'm also a bit concerned that 87 sectors have been reallocated since Tuesday. Reading every byte in use normally takes 2 hours, but Tuesday it took 9. However, no data was lost. (Compared to the data on the drive's mirror and the checksum)

Routaran · Dec 29, 2016

Vanadium 50 said:

Done (it's in its own enclosure) and no difference.

While errors are normal, it seems to me that 61 million errors is excessive. If that number is meaningless, the other three drives are at 100 for Worst, and this drive is at 71. I'm also a bit concerned that 87 sectors have been reallocated since Tuesday. Reading every byte in use normally takes 2 hours, but Tuesday it took 9. However, no data was lost. (Compared to the data on the drive's mirror and the checksum)

The 61 million probably doesn't actually mean 61 million read errors. Remember, most vendors don't report this value so it's probably just a meaningless number.

This is a 3TB drive, each sector is 4k in length. So we're talking 87 sectors in 805 million. The surface scan you ran should mark those sectors as bad and that should be the end of it. You're pretty far off the threshold for Reallocated Sectors.

Here's the output from one of my drives which is working just fine.

Code:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   143   140   021    Pre-fail  Always       -       3825
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4454
  5 Reallocated_Sector_Ct   0x0033   147   147   140    Pre-fail  Always       -       417
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   026   026   000    Old_age   Always       -       54256

The reallocated count on this drive is 417, it's a 500GB drive so 417 out of 131 million. This is still normal for a 4-5 year old drive. I should potentially consider getting a new replacement drive and keeping it on standby because my Worst is pretty close to my Thresh but it's still fine.

You said you had it back in its enclosure. This sounds to me like a external drive you're connecting with USB. Does the caddy you are using support USB3 and were you plugged into a USB2 port when you did your test that took 9 hours? That would explain the 4x longer scan as USB2 is around 4 times slower than USB3.

Vanadium 50 · Dec 29, 2016

The enclosure is SATA - it's an Icy Dock 4-drives-in-the-space-of-3 thing, but the nice thing is I get front access to the drives. The spare is going through badblocks now, and when it finishes, I'll swap it for the loud drive. Then I'll badblock the heck of it out of the suspect drive and based on the output decide if I want to keep it or not.

The 9 hours was my weekly RAID verification. It's run weekly since April. It's typically 2-3 hours and runs Tuesday nights. My first sign something was wrong was that in the morning it hadn't finished, and the LED from the drive in question was on solid. There were no errors the week before or from the previous weekly SMART long test. (I do a weekly SMART long, a daily SMART short, and a weekly RAID check and compare)

Oh, and the drive is getting louder. And there's definitely a correlation between its LED and the sound.

fresh_42 · Dec 29, 2016

Vanadium 50 said:

Oh, and the drive is getting louder.

This is definitely a red alert. Read it as long as you can and make a copy. If the head is already maladjusted ...

Routaran · Dec 29, 2016

fresh_42 said:

This is definitely a red alert. Read it as long as you can and make a copy. If the head is already maladjusted ...

I also agree. If the drive is getting louder as its spinning then its most likely approaching failure. Backup your data while you still have time.

Vanadium 50 said:

The 9 hours was my weekly RAID verification. It's run weekly since April. It's typically 2-3 hours and runs Tuesday nights.

I'm curious, what is your RAID setup, and are you running the Intel RAID Volume Data Verify and Repair?

Vanadium 50 · Dec 29, 2016

It's ZFS. I have four 3 TB drives, configured as 2 3+3 mirrors.

Routaran · Dec 30, 2016

I suspect that the weekly verification of the checksums may have accelerated the drive failure. This is pretty heavy usage for consumer grade hardware. I would suggest that you do not do the checks so often. Drives these days are pretty reliable and a RAID3 array keeps a parity to ensure the data is safe.
If the data is really important and you need the peace of mind, then perhaps do the checks once a month or once every other month, or invest in enterprise grade hard drives that are meant to see heavy use.

Vanadium 50 · Dec 30, 2016

If a drive can't stand up to 52 reads per year, that is a problem. After all, it holds my login area, so files like .bashrc get read many more than 50 times in a year.

I would much rather have a drive fail after a year in a way that the data is recoverable than last twice as long but lose data when it fails. Weekly scans protect against silent corruption. Besides, that's the general recommendation - weekly for consumer-grade disks, monthly for enterprise-grade, if possible. There are some very large pools out there, and monthly scrubs would mean non-stop scrubbing. And to be fair, the system worked as intended - it alerted me to a probable failing drive in time to do something about it.

The drive has been swapped, and is resilvering now. This takes 2-3 hours. The original disk was hot when I took it out. Not warm like other disks - hot. Like fresh from the oven cookies. I'm going to run a couple of R/W badblock tests on it, and if it looks mostly OK, I may keep it around as an emergency spare, but right now I doubt that this will be possible.

Vanadium 50 · Jan 2, 2017

Update: the drive has been replaced. I erased the original drive and it reallocated another sector. It's much quieter in the external USB "toaster".

In it's normal position, it was running about 65 C. The other three drives (and now the replacement) range from 38 to 41 C or so. In the external USB adapter, drives are vertical and have air on all four sides. Drives run at 25 C or so at idle and high 30's under heavy load. The questionable drive idles at 40 C and is 48-51 C or so at load. My conclusion is that something mechanical in the drive likely has more friction than it should, and it's only a matter of time before it goes. The immediate symptoms are second-order effects.

Oh, and no data loss. Swapping the drives and rebuilding the RAID was a 5 minute job, plus the time it took to rebuild.

russ_watters · Jan 2, 2017

Vanadium 50 said:

How it thinks this drive is healthy with 61 million read errors (since Tuesday) is beyond me.

It doesn't say "healthy", it says "passed" and "pre-fail". That's a C- in my book...

Anyone thin[k] I am being too paranoid?

Nope. I'me very paranoid about hard drive failures and I think justifiably. A failed hard drive is the data equivalent of burning down your house: It's not about the money, it's about the potentially irreplaceable things you lose (if not properly backed-up).

Since we're on the subject of failing hard drives, I'm going to whine a bit about my Crucial M4 SSD (again). After 6 months or so of being installed in my laptop it turned into a brick for no apparent reason. Google informed me that it had a bug that made it brick after a certain number of hours of use (a counter overflow or something). Fixed with a firmware flash. Awesome. But then I found that if my laptop ever locked-up and had to be hard reset for any reason, it would brick again. Google informed me that this was an "issue" with the drive's fault response system and could be recovered with a cumbersome series of 20 minute power cycles. Crucial didn't consider this to be a problem worthy of recall (since if it comes back to life it isn't really dead, right?), and since it was expensive I put it into a media center PC. Well last night it crashed again. I recovered it, but still, it is really annoying. [/rant]

Vanadium 50 · Jan 2, 2017

I've had good luck with Crucial - or rather, I had some bad luck, but the company really made it right. It was failing main memory, and the replacement memory was also failing. It tested fine with one stick, but any more and it failed. Randomly. On different motherboards. They sent me a big pile of memory and asked me to find a pair that worked and send the rest back. Oh, and they sent me a memory fan as well.

Kingston on the other hand...I had a 128 GB SSD that failed. Eventually they agreed to replace it, and they replaced it with a 120 GB SSD. Their position was, "hey, close enough. Like it or lump it."

Routaran · Jan 3, 2017

I've had absolutely rotten luck with Crucial and Samsung SSDs. 100% failure rate for me. I've tried 3 different ones over last year all failed within a month or two. 1 other person at work who I helped swap their spinning 2.5" for a SSD also failed pretty fast. I finally gave up and went for a RAID5 with spinny disks. I obviousy had a run of really bad luck but i really wish i can some day work up the courge try a SSD raid lol

Vanadium 50 · Jan 6, 2017

Oddly, I have had very good luck with a Samsung SSD 840 EVO.

I've hammered on it pretty hard. I've written 24.50 TB to it. 22204 power on hours. No problems ever.

Maybe I had good luck because it's an @Evo.

russ_watters · Jan 22, 2017

Vanadium 50 said:

Oddly, I have had very good luck with a Samsung SSD 840 EVO.

I had a Samsung SSD 840 Pro until a few minutes ago when it turned into a brick. [sigh] Windows gave me the old "configuring windows, do not turn off your computer" troll -- After two hours, I turned it off and now the SSD won't detect.

Vanadium 50 · Jan 23, 2017

Did it eventually recover? Can you take it out and attach it to a USB dongle?

russ_watters · Jan 23, 2017

Vanadium 50 said:

Did it eventually recover? Can you take it out and attach it to a USB dongle?

Nope; I did and nope.

Fortunately, I was speaking Samsung customer service's language and they gave me an RMA after just the 30 second description of the problem. Normally I wouldn't go after $150 spent three years ago, but I want these guys to know that SSDs are waaaay too unreliable. Also fortunately, there was nothing of value on this drive; I used it only for programs and the OS. Virtually no data (that's mostly on spinning drives, RAID 1...though some is on my laptop on an SSD).

Psinter · Feb 6, 2017

russ_watters said:

but I want these guys to know that SSDs are waaaay too unreliable.

That is what I have been saying for like a thousand years, but nobody believes me, russ! Nobody!

There are white papers out there that clearly show that consumer grade SSDs are a crappy technology despite the huge effort that is put into making them. I mean, the white papers are not written in a way to negatively criticize the technology, they are objective, but if you read them you realize how crappy the consumer grade technology is. One of the various variables is that the manufacturers try to fix the hardware quality inconveniences with software because doing a good quality hardware product is too expensive (also big in size) and consumers will be very unlikely to buy them unless they reduce the prices. So they reduce the prices by reducing the hardware quality and you end up with all sort of problems. Specially in the software that keeps growing and growing to make up for the hardware quality. As the software grows, more bugs will appear.

Vanadium 50, as soon as I hear buzzing sounds, I begin looking for a new one. I'd recommend you ignore the SMART report and begin looking for a new one.

rcgldr · Feb 6, 2017

The raw read errors in itself doesn't mean much without also knowing the number of raw reads, in order to get an idea of the raw read error rate. You would also want to know what raw read rate error corresponds to the drives read failure rate (usually 1 in 10^14 or more bits read). As hard drives increase density, the normal trend is to expect higher raw read error rates, countered with stronger error correction, to maintain the stated read failure rate. Sector remapping algorithms are vendor specific. I had the impression that there was some smart related sequence that indicated if a drive was becoming marginal and should be backed up and replaced.

ssd's

One concern here is the mapping of sectors done to distribute writes somewhat evenly on the internal memory, combined with garbage collection performed to repack the data, as overwritten sectors end up as duplicates in the memory with a flag obsoleting the "older" copies. Operating systems that use the trim command to note which sectors have been deleted helps with the garbage collection. It's a pretty complex process. During a power loss, it would be nice if the SSD drive had enough energy storage (capacitor) in order to complete any on going garbage collection cycle.

As noted by russ_waters, it's probably best to keep SSD's for operating system and programs but not data. Only the stuff that doesn't change much (except for the near daily updates of Windows defender ...) , using image backups to regular hard drives just to be safe.

Vanadium 50 · Dec 27, 2018

To follow up, I had another Toshiba DT01ACA300 fail, this time with 22,000 hours on it. I replaced the two oldest drives in the array with larger ones, and then tested them to see if I could use them as spares. The older of the two made it through 6 badblocks r/w passes without error, but started showing errors on the 7th.

I also replaced a Maxtor 200 GB that had been running continuously for 13 years. (!) I'd tell you what the power-on time was, except that the SMART register for it has overflowed.

rbelli1 · Jan 3, 2019

Vanadium 50 said:

lose data when it fails.

If you lose any significant amount of data when a drive fails you are doing it wrong.

Back everything up. Then you could heave all of your primary drives into a volcano and you lose no data.

Been there enough times when storage was expensive. Now storage is practically free.

BoB

Vanadium 50 · Jan 4, 2019

Of course it's backed up. That doesn't mean I am happy when disks fail, especially unexpectedly.

I have fire insurance for my house. Doesn't mean I'd be OK with it catching fire.

Vanadium 50 · Sep 21, 2019

Follow-up.

Had a 4TB drive in the array fail in late July. Hard. If it was plugged in, the machine wouldn't power on - most likely a short somewhere. RMA'ed it to Toshiba with the receipt. Today I got a refund - of about $40 more than the drive cost. That's service!

Vanadium 50 · Dec 23, 2019

Another update.

The raw read error rate is not the raw read error rate. :doh:

It's actually logarithmic. A value of 73 means that you get an error every 20,000,000 reads. (10^7.3). Failing at 006 means every fourth read fails, and I would not use the word "failing" to describe that. More like "failed".

I just replaced a 320 GB Seagate DB35.3 that was in 24/7 use for 12 years. The only problem was that SMART tests took a (very) long time to finish, but after 12 years I figured I got my money's worth. The replacement isn't exactly new - it is a Seagate 500 GB that Amazon listed as New, but it wasn't new old stock or even new: it had 100 hours on it. And a filesystem. Amazon refunded my money and told me not to bother shipping it back. Can't complain about the service, but I suspect the business model is to sell lightly used drives as new, and do a refund only if someone notices and complains.

Vanadium 50 · Feb 8, 2020

Vanadium 50 said:

Today I got a refund - of about $40 more than the drive cost. That's service!

I got another refund, this time for exactly what the drive cost. No tax, no shipping back to Toshiba. They're off my list. They seem to have a high failure rate in my application (oddly, the N300 NAS drives seem to be worse than the non-NAS drives) and I am much more willing to play RMA roulette when they cover the full cost. No more "that's service!".

SMART reporting and Hard Disk buzzing sound

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

France to ditch Windows for Linux

Is This Music AI?

Help me build my server with a laptop that has a broken screen

Gmail AI summaries

Warning: Bad actors may already be in store-now-decrypt-later mode

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect