Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

What's the fastest way to increase file size?

  1. Mar 18, 2007 #1
    So say, I'm writing a program with an infinite loop, and that I'm trying to write a file that is as large as the disk drive is, in the smallest possible time.

    What would be the best algorithm to do it?

    Clearly, such an algorithm would consume as many system resources as it possibly could. So it would be limited by the system resources, so we need not concern ourselves with factorials or exponentials.

    fprintf(fp, "blahblah");

    and blahblah would be "output text". Say, blahblah was a huge amount of text, and the loop was a for loop that outputted blahblah an infinite amount of times (it outputted as it went through the loop, so the loop doesn't need to finish for the file to be written). The question is - how much MB/second is usually consumed in the process? (and would there be a maximum value, given limited speed in file writing/reading?) I know that it's correlated with CPU speed. And I don't want to try it myself yet (to avoid stressing out the hard disk)- though it probably has been tried by people who forgot to close the loop up. Anyways, it is conceivable that the entire hard disk space could be eaten up in the matter of seconds? Given that it does take time to transfer system files, I don't think so.
     
  2. jcsd
  3. Mar 18, 2007 #2

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    The fastest algorithm would be not to write any data at all.

    Just modify the file allocation table (or whatever it's called in your preferred operating system) to say all the free space on the disk belongs to a new file.
     
  4. Mar 18, 2007 #3
    What AlephZero has said. The file system is different for different operating systems so the method to use depends on your OS. Using the Win32 API you could do this:

    LONG sizeLow = <low 32 bits of a 64 bit size value>;
    LONG sizeHigh = <high 32 bits of a 64 bit size value>;
    SetFilePointer(fileHandle, sizeLow, &sizeHigh, FILE_BEGIN);
    SetEndOfFile(fileHandle);

    The size value is whatever you need it to be. If you want to use up the whole drive then query the file system to determine how much free space is available on the drive using the GetDiskFreeSpace function.


    If you used a loop instead of setting the file size, the program would do a lot of file access but still not use very much CPU time. The task is I/O intensive (disk access is relatively slow) but you should still be able to use your computer to perform other non-IO-intensive tasks. CPU speed is not so important since there is little computation involved. You will need to look at the IO specifications of your system to see how fast data can be written. Different architectures provide different results. For example, a SCSI drive will be much faster than an external USB drive.

    Don't worry about "stressing out" the hard drive with your test. They are designed to access data at a specified rate and that's all your test would do. It's not like you testing your car at maximum speed. Of course if you fill up the drive used by your operating system, you may find out that the system becomes sluggish or unresponsive by lack of room to create temporary files and for tasks tasks that need disk access. Under Windows, you should probably test this with a secondary drive or partition instead of your C:\ drive.
     
  5. Mar 18, 2007 #4

    DaveC426913

    User Avatar
    Gold Member

    I am having trouble envisioning an application for this that isn't malicious. :bugeye:
     
  6. Mar 18, 2007 #5
    The process is used to securely wipe out disk space. Previous file content is retrievable by checking unused disk sectors. It can be erased for good if you fill all available space with random data in a humongous file, then delete it.
     
  7. Mar 18, 2007 #6
    Not really.
    Even if you overwrite the whole disk with random data it still can be retrieved. :smile:

    By the way there are professional applications that already do this for you. but they are a bit more advanced than just overwriting everything with random numbers.
     
  8. Mar 18, 2007 #7
    There are different standards of security of course. Overwriting a sector with new data will indeed prevent normal users from accessing what was there before since the system will now read the new data instead. This is good enough for most people even though it will not stop more sophisticated inquisitors. The US DoD uses two methods. The simpler one has three passes: all 0s, all 1s and then random data. The second one has seven passes. Peter Gutmann presents a method in 27 passes. These are all aimed at making it increasingly unlikely that any data would be retrieved from magnetic media. The best method remains to incinerate the drive. But this is getting OTer and OTer.


    Freeware is always nice. I run this one weekly on my business computer:

    http://www.heidi.ie/eraser/

    I set it for the 3-pass DoD method which is good enough for me. No state secret.
     
  9. Mar 18, 2007 #8

    AlephZero

    User Avatar
    Science Advisor
    Homework Helper

    Sometimes it is useful to create a large file for later use with random access reads and writes. If you create it all in one go, it's more likely to be unfragmented. Also if you know what the file size you need when running on a multi-user system, it makes sense to allocate the resources you need up front, rather than run a long computation for 2 or 3 weeks and then have it fail near the end because the disk was full.

    Re the security issue, this used to be a security hole in some operating systems (e.g. the original Cray supercomputer OS back in the 1980s). Cray put speed before everything, so they didn't wipe the disk when you created a large file. You could read whatever was left on the disk by earlier progams - not good.

    A clever OS doesn't waste time overwriting the existing data either, it just remembers which disk sectors you haven't written yourself and returns zeroes if you try to read parts of the file before writing it. I've no idea if Windoze is that clever (and can't be bothered to find out).
     
    Last edited: Mar 18, 2007
  10. Mar 22, 2007 #9
    Thanks for the replies everyone!

    If old data can still be retrieved when overwritten (through whatever processes it takes), then is disk-writing an irreversible process? Which means that the more you write/delete off the disk, the more stressed out the hard disk will become? And do different types of disks have different levels of tolerance to writing/deleting? USB drives, hard drives, CD-RWs? I know that CD-RWs burn mini-holes into the disk, so it appears that CD-RW burning is an irreversible process. Yet it always has appeared as if you could bring the hard disk back to new if you format the disk.

    And what would be more stressful to disk? A huge 800 M file or a bunch of small files that add up to 800 MB? And are all files considered equal in the filesystem? (considering that it's just 0's and 1's?)

    Also, just out of curiosity - where is the registry info stored? What folder and what file?
     
    Last edited: Mar 23, 2007
  11. Mar 23, 2007 #10
    Hmm. I keep hearing on the net, over and over again, that one overwrite isn't good enough, that 'some trace' of your data remains, and that professional data recovery firms and/or the NSA can recover the files. As someone dipping their toe into information theory at the moment, this sounds a bit woo to me.

    Is this actually TRUE? Has this data-recovery been demonstrated conclusively?

    If it were, then really a 20GB hard drive is more of a 40GB or even 60GB hard drive! Albeit one that needs hi-tech equipment to access the extra info...

    Here's a link that covers my concerns... I read a more thorough article last month but can't find it now. :frown:

    http://www.actionfront.com/ts_dataremoval.aspx
     
  12. Mar 23, 2007 #11
    As I understand it, it's a matter of recording accuracy. You may be old enough to have recorded music on cassette tapes (an analog recording technology used by our ancestors before the invention of fire). If so then you've undoubtedly heard phantom sounds in the background when playing re-recorded tapes. You could recognize a faint rendition of the previous recording made on the same tape. VHS tapes can also produce a similar effect: you can sometimes see ghost images of the previous recording on poor quality equipment.

    The fact is that a magnetic surface is not an absolutely perfect medium and recording equipment does not provide 100% reliability either. Even though a hard drive is immensely more precise than an old audio cassette, prefection does not exist, or if it does at least it is not affordable to the masses. Importantly, each data bit is not encoded using a single molecule of magnetic substrate. If it were then there would be no problem, the bit could definitely be only on or off, and changing this state would provide 100% reliable erasure. But instead, each bit is represented by a number of "microbits" (I don't know the actual term) in each pit on the platter. The recording head will set most of them to either 1 or 0 but not necessarily 100% of them because it's not technically necessary and because affordability of the drive matters. You only need to have a clear majority, so maybe 70-80% predominance of a positive or negative value is sufficient, perhaps 90% is the goal, an engineer in this field might know exact figures that I don't have. Regardless, the principle stands: imprefections permit recovery. It also means that newer technology makes recovery harder and harder because recording accuracy gets better and better.

    I don't know what technology actually exists for recovery since I don't work for any security organization. But if I had to do it and if I had the appropriate resources, here's one thing I would look at. I know that each bit is not recorded at 100%, say each bit overwrite is 90% effective. Then in any pit of the platter that shows a 1, I expect the magnetic state of this pit to be 1 at 90% and something else at 10% (the latent state of the old value). If the previous value was also a 1 then I expect 90% of this latent 10% to be 1 so the pit should be 1 at 99%. If the previous value was a 0 then I expect 90% of the latent 10% to be 0 (i.e. 10% of the 10% will be a 1) so the current pit value should be 1 at 91%. The hard drive itself is not designed to differentiate between 91% and 99% positive, it just reads a 1 as it is designed to do. But in theory there should be 4 possible values in each pit, corresponding to 0 over 0, 0 over 1, 1 over 0 and 1 over 1. And, still in theory, you could repeat the calculation one more time to determine other values corresponding to 1 over 1 over 1 (99.9%), 1 over 0 over 0 (90.1%), and so on. In practice you are limited by equipment accuracy.

    What I would need then is to be able to read each pit more accurately than to obtain just a positive or negative value: I would need to know the exact percentage instead. It could be enough to replace or bypass the electronics of the hard drive so that I could measure the raw signal coming from the reading head. Or it could be necessary to extract the platter and place it in a purpose designed enclosure at the CIA lab. While I'm using the ultra-precise reading head at the lab, I would probably also position it to look at the edges of each track since another source of inaccuracy is the slight but inevitable misalignment of the recording head over the track. There must be more clues there.

    But for today, let me get back to the normal world and closer to the OP. I think you can safely wipe out disk space by filling the drive with random data in a humongous file and then delete it. Let the NSA jump through hoops to retrieve your secret chicken recipe if they find it necessary for national security. They won't bother of course, they'll just beat it out of you.
     
  13. Apr 18, 2007 #12

    rcgldr

    User Avatar
    Homework Helper

    Use these instead since they are more 64 bit pointer friendly (they use the LARGE_INTEGER union, which includes quad-word==64 bit values).

    GetDiskFreeSpaceEx

    SetFilePointerEx

    Sequence:

    CreateFile
    GetDiskFreeSpaceEx
    SetFilePointerEx
    SetEndOfFile // this causes the space to be allocated and the cluster table updated
    FlushFileBuffers // should cause a wait until the clusters are commited to the disk

    SetFilePointerEx // set pointer back to start of file
    WriteFile // write data in a loop
    CloseFile


    The disk won't get filled up in just seconds. Streaming transfer rates on hard drives range from 30 mega bytes / second on inner cylinders to around 70 mega bytes / second on outer cylinders. Newer hard drives bump this to about 38 mB/s to 80 mB/s, while the Seagate Cheeta 15,000 rpm Scsi / Sas 15.5K family of drives set this range to 80mB/s to 140 mB/s. Assume average rate is 50mB/s, or 1GB/20 s, and a drive size of 250GB, then it takes 1 hour 23 minutes and 20 seconds to wipe a disk.

    If this is done electrically, then bits are only read when transitions flow past a read head. Hard drives are already pushing the envelope bit density wise, so trying to read finer "sub-bits" wouldn't really be possible. However maybe something along the line of an electron microscope could "see" the bits, but I don't know about this.
     
    Last edited: Apr 18, 2007
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?