- #1
- 106
- 3
A number of popular-level articles have been published saying that RAID5 is increasingly impractical at larger data volumes due to the chance of an individual HDD having an unrecoverable read error (URE) during the rebuild phase. I think the underlying math which produced this conclusion may be in error, and I'd like someone to check it.
Published reasoning: typical HDDs have a URE rate of 1 in 10^14 reads (interpreted as bytes). If a 16TB 8-drive RAID5 array has a single URE, that HDD must be replaced and data rebuilt on the spare. During the rebuild no further errors can be tolerated else the entire array is bad. Apparent mathematical reasoning: 1 URE per 10^14 bytes / 8 HDD per array = 1 URE per 12.5 TB read from the 8-drive array. Conclusion: given that URE rate there is a nearly 100% chance of a 2nd HDD failing while rebuilding the 16TB array. Example article: http://www.zdnet.com/article/has-raid5-stopped-working/
I think this is incorrect for several reasons:
(1) Common sense: if the chance of a URE is 1 per 12.5 TB read, large RAID0 and RAID5 arrays would be failing at an incredible rate.
(2) The specified URE rate is 1 per 10^14 *reads*, not bytes. Modern drives do reads in 4k-byte sectors, not bytes or 512-byte sectors. So translated to bytes, the URE rate is 1 per 10^14 * 4096 bytes per sector, or 1 URE per 409,600 terabytes read.
(3) While number of drives in the array increases chance of an individual failure per unit of *operating time*, they do not increase failure chance per data transfer volume vs a single HDD of equal capacity. E.g, if reading the *same* data volume from a single HDD vs an n-drive RAID array, each drive in the array will only do 1/nth the reads, hence have 1/nth the failure chance per aggregate volume of data. Therefore for a given number of reads, the URE probability is about the same between a single drive and a multi-drive RAID array.
Published reasoning: typical HDDs have a URE rate of 1 in 10^14 reads (interpreted as bytes). If a 16TB 8-drive RAID5 array has a single URE, that HDD must be replaced and data rebuilt on the spare. During the rebuild no further errors can be tolerated else the entire array is bad. Apparent mathematical reasoning: 1 URE per 10^14 bytes / 8 HDD per array = 1 URE per 12.5 TB read from the 8-drive array. Conclusion: given that URE rate there is a nearly 100% chance of a 2nd HDD failing while rebuilding the 16TB array. Example article: http://www.zdnet.com/article/has-raid5-stopped-working/
I think this is incorrect for several reasons:
(1) Common sense: if the chance of a URE is 1 per 12.5 TB read, large RAID0 and RAID5 arrays would be failing at an incredible rate.
(2) The specified URE rate is 1 per 10^14 *reads*, not bytes. Modern drives do reads in 4k-byte sectors, not bytes or 512-byte sectors. So translated to bytes, the URE rate is 1 per 10^14 * 4096 bytes per sector, or 1 URE per 409,600 terabytes read.
(3) While number of drives in the array increases chance of an individual failure per unit of *operating time*, they do not increase failure chance per data transfer volume vs a single HDD of equal capacity. E.g, if reading the *same* data volume from a single HDD vs an n-drive RAID array, each drive in the array will only do 1/nth the reads, hence have 1/nth the failure chance per aggregate volume of data. Therefore for a given number of reads, the URE probability is about the same between a single drive and a multi-drive RAID array.