Why aren't ditto blocks used more often and more transparently?

  • Thread starter Vanadium 50
  • Start date
  • #1
Vanadium 50
Staff Emeritus
Science Advisor
Education Advisor
2023 Award
33,294
19,816
TL;DR Summary
Why aren't ditto blocks used more often and more transparently?
Why aren't ditto blocks used more often and more transparently?

First, what is a ditto block? It is simply a copy of a block stored elsewhere on the disks. It provides an additional level of redundancy beyond RAID, and faster read speeds. If you have a 3-disk RAID5, and data is distributed as D1 (data) on drive 1, D2 on drive 2 and P1 (parity) on drive 3, if you dittoed it D2, P1 and D1 respectively, you could lose any two drives and get your data back.

This is routinely done with critical data, such as the superblock.

Sp why isn't it done transparently with data? What's the downside? If the disk array is 40% full, I can use the remaining space for one more copy of everything, and two more copies of some things. As the array fills up, the number of copies is reduced.

I see two downsides, but both are easily fixed. One is the writes take longer, because you are writing N copies. Sure, but once you have one complete copy, you can report the write as successful and finish the replication in the background. The other is you use up space N times faster, but again, if this is dynamic, N can decrease as the array fills up.

So, why don't we see more of this out 'in the wild'?
 
Technology news on Phys.org
  • #2
Vanadium 50 said:
TL;DR Summary: Why aren't ditto blocks used more often and more transparently?

So, why don't we see more of this out 'in the wild'?
Probably seen as a threat to the Data-Backup Industrial Complex.

There's another keyword : "dup profile".

Kneejerk reaction says that the radial portion of the read seek-time improves 25% (for two full, contiguous partitions, and not bothering to account for track-length differences on a platter). Rotational latency and settle-time stay the same, of course.

Sounds worth a shot on a HDD, if you don't mind doubling the write wear-n-tear. SSD you only get the security benefit : no improvement in access time.
 
Last edited:
  • #3
Ah, but does it really double the write wear and tear? If you drive is nearly empty, sure. As it fills up, it can't ditto so often. Indeed, the write wear-and-tear increase is at its largest when disks are nearly empty, so if you have a weak drive, better it fail now than later. Because your data has multiple copies, recovery from a bad drive goes faster.

I think resi8liency is a better word than security. It's actually less secure since the data can be recovered from fewer disks.
 
  • #4
Vanadium 50 said:
I think resi8liency is a better word than security.
Was this a finger-slip or did you really mean 'r8y'?

In s/w dev we use terms like i19n and l10n. r8y works perfectly.
 
  • #5
Finger slip.

But my point is that having more data on the drive makes it less secure against bad actors, although it is more robust against hardware failures.
 
  • #6
There may be Backup parameters that can get you close, albeit without read-seek improvement.

(It's been awhile since I bothered - M$'s forced mediocrity-at-best has taken my soul - but I'm one of those guys who likes to micromanage i/o parameters on a PC)
Vanadium 50 said:
Finger slip.

But my point is that having more data on the drive makes it less secure against bad actors, although it is more robust against hardware failures.
Actually, your point was wondering why dup'ing data wasn't done more, so... ?

Dup'ing data in the manner we've been talking about does not increase the security risk : the data is still on one disk, accessed by one controller, and managed by one chunk of software : there's no increase in the number of exploitable entry vectors.
 
Last edited:
  • #7
Well, there's more vendors that Microsoft. :)

You could do this manually. Start with a 10-way mirror, and when that fills up, make it a 5-way, and when that fills up, make it a 3-way, and so on. But that seems like something better handled automatically.

The security hole that I was thinking of involved returning bad drives. If your drive has only every Nth bit on it, there's not so much risk. if you have a complete copy, in N different places, there is more. A past employer had a service contract where we didn't return bad drives for just that reason.
 
  • #8
In my experience, the drive(s) used for paging and temporary files are the ones that fail most often.

The extra Seek operations, at least for drives installed laying flat, seem to be the limiting factor in hardware lifetime.

So having the paging/temporary file drive(s) separated from your data is some extra insurance.

Tip: I've had luck recovering data from a failed drive by changing its orientation and doing an image copy.

Cheers,
Tom
 
  • #9
I think in this day and age, page drives are moving towards SSDs. Why not? Indeed, you can make them cheap/fragile SSDs; if they fail, there's no real loss, just annoyance.

Actual data data looks to be spinning rust for the foreseeable future. 20 TB drives are not uncommon in data centers (and you can pre-order 26 TB drives), and a robust 20 TB SSD would cost a fortune.

My experience is that drives follow the bathtub curve - infant mortality and old age are the problems. But if you can put 1000 hours on a drive, you can usually put 10,000 hours on a drive. So I don't see this increased "wear and tear" is a problem. And as I said, if this is going to happen, I'd rather have this happen early and with lots of replicas.

So why don't we see more of this?
 
  • #10
My first SSD drive, I turned paging off and ran all the download/temp/cache/spool directories in RAM, to keep from wearing it out. Worked fine.
 
Last edited:
  • #11
That just tells me you didn't need a page/swap area to begin with.

The "you'll write your SSD to death" meme is kind of oversold. You get 10.000 writes, right? Call it 8000 from a worse-than-average SSD. Filling a 1 TB HDD takes about an hour, and there are 8000 hours in a year. So you have a year of non-stop writing before the drive falls over. How many calendar years will this take? Five? Ten?

For normal workloads, SSDs will last roughly as long as HDDs. Can I come up with a workload that breaks SSDs? Sure - fill the drive up to 99% and hammer on that last 1%. Is this what people typically do? No.

So I don't think "You only get 9995 writes out of your SSD and not 10.,000!" is a compelling argument.

Further, consider the excitement about the opposite design, deduplication - where if two files store the same block, only one is stored. This, as far as I can tell, has only one use case where it matters, makes the data more fragile, and hogs memory. That should scream out "go the other direction!"
 
  • #12
Vanadium 50 said:
That just tells me you didn't need a page/swap area to begin with.
Well, yeah, which is why I turned it off ; had to keep an eye on things, of course.

Also, it did away with other usually-unnecessary disk-writes, such as print spooling, non-persistent temporary files, downloads which are going to be unpacked anyways, etc.

So, anyways, your objection to your proposal, on security grounds, is that the decreased lifetime of a disk means it gets binned faster, ie: more potential breaches.
 
  • #13
I don't think the number of writes is substantially higher with this implemented, Sure, if the drive is empty it can add a whole bunch of ditto blocks. But if the drive is 2/3 full, only half the data can be dittoed.

Additionally, in principle the system can move the drive head around in a more logical way ("elevator seeking") which might, again in theory, reduce wear on the stepper motors. In practice, I am not so sure. Modern drives do a lot of lying about their geometry to their hosts.
 
  • #14
Vanadium 50 said:
The "you'll write your SSD to death" meme is kind of oversold.
I recall small embedded computers with flash drives and Linux dying after a week or so because someone forgot to disable sync when setting up the drive mounts. Of course, flash drives back then did not have the fancy life-prolonging firmware that modern SSD has so if it was just used as a HDD it died pretty quickly.
 

1. Why aren't ditto blocks used more often?

One reason ditto blocks may not be used more often is because they can be difficult to interpret and understand. They use a shorthand notation which may not be familiar to all individuals, leading to confusion and potential errors.

2. Are ditto blocks more transparent than other methods?

Ditto blocks are not necessarily more transparent than other methods. While they do show the repetition of information, they do not provide any additional context or information about the data being repeated.

3. How do ditto blocks affect data analysis?

Ditto blocks can make data analysis more challenging, as they require additional steps to interpret and understand the data. This can lead to longer processing times and potentially introduce errors into the analysis.

4. Can ditto blocks be used in all types of data?

Ditto blocks can be used in any type of data, but they may not always be the most appropriate method. They are best suited for data sets with a lot of repetition, as they can help reduce the amount of data that needs to be inputted or stored.

5. Are ditto blocks considered a best practice in data management?

Ditto blocks are not necessarily considered a best practice in data management. While they can be useful in certain situations, there are other methods that may be more widely accepted and easier to use for managing and analyzing data.

Similar threads

  • Programming and Computer Science
Replies
1
Views
984
Replies
2
Views
884
  • Programming and Computer Science
Replies
2
Views
2K
  • Introductory Physics Homework Help
Replies
2
Views
2K
  • Computing and Technology
Replies
2
Views
3K
Replies
1
Views
1K
  • Special and General Relativity
2
Replies
46
Views
7K
  • Programming and Computer Science
Replies
1
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
0
Views
2K
Back
Top