1. Limited time only! Sign up for a free 30min personal tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Homework Help: Multithreaded File IO

  1. Sep 25, 2011 #1
    1. The problem statement, all variables and given/known data

    Perform file IO with 3 reader and 3 writer threads using a circular buffer. Read the (text) file in and write the same file out.

    3. The attempt at a solution

    I have an assignment that requires me to use multiple (3) threads to read data from a file, and multiple threads (again, 3) to write this data out to a new file, but the spec is very vague after that and I'm having trouble working out the logic. To compound the issue, I need to use a circular buffer which I have not implemented before.

    I'd just like to understand this problem better before I get to coding.

    Would it make sense to have each reader thread only concern itself with a portion of the file? So reader_thread_1 read()'s from the start of the file to 1/3 in, and reader_thread_2 reads from 1/3 to 2/3, with reader_thread_3 from 2/3 to the end, and writer_thread_1-3 working in a similar fashion? I will need to pass file offset information to each thread on creation.

    With regards to the circular buffer, it seems I will need to wait for reader_thread_1 to complete a full read all the way to 1/3 in, at which time reader_thread_2 can start adding its lines to the buffer. Writer_thread_1 can start consumining lines as soon as they are available but how do I know when to stop, and start writer_thread_2? I'm a bit confused here.

    In any case, does this seem like a logical path to take? I realise that this probably isn't the most efficient way to perform File IO but I guess it's an exercise in thread synchronisation first and foremost.
  2. jcsd
  3. Sep 26, 2011 #2
    Are all three reader threads accessing the same text file?

    One way to approach reading a file with one thread and writing with another is to use a mutex for the first thread, so the second thread would wait until the first thread is done filling a buffer in memory, then it would hand over control of the mutex to the second thread. The second thread would write to file and release the mutex back to the first thread, allowing it to continue reading. With this method, the reader would block the writer until it's filled its buffer or finished the whole file, and the writer would block the reader from re-filling the read buffer until the writer writes it all to file. Eg:

    20mb file, 8mb buffer in memory
    Reader enters critical section and reads 8mb, writer waits.
    Reader exits critical section, writer enters critical section.
    Writer writes 8mb from the reader, reader waits.

    This continues until the reader reaches the end of the file, then the reader thread terminates, and the writer thread, allowed to resume with an empty read buffer, also terminates.

    I don't know how or why you'd use three threads for a single file though.
  4. Sep 26, 2011 #3
    Yes, all 3 reader threads are reading the same file, and 3 writer threads are writing to the same output file. As I mentioned I think this is quite an inefficient way to do File IO but it's an assignment, and we are studying operating systems, processes, threads, etc

    I've been reading up on semaphores, I'm going to need to use more than just a mutex I think.

    If 3 threads are all reading from the same file, do I need to lock the file with a mutex if they all open it as read only? I think I can get away without one in that case. However, I will need to use one when writing to the shared buffer, and the writing threads will definitely need to lock it the output file.

    I think a counting semaphore will help with the circular buffer too, where reader_thread_1 can increment the semaphore with each line it adds to the buffer, and writer_thread_1 can decrement it each time it grabs one, and then I start writer_thread_2 when it's done.

    I have some test programs to write I think!
  5. Sep 26, 2011 #4
    Yes, a counting semaphore would probably be great with multiple threads. Increment after reading a buffer in a reader thread, decrement when writing a buffer in a writer thread.

    Rather than have each thread read one third sequentially of the file, you might wish to have each reader thread read a buffer amount (say, 1mb) whenever it reads, so you would have 3 writers writing out the read buffers. This is where the circular buffer comes in: each reader reads the next "block" in order, then appends in order. If it can't append in order (i.e. block read pending somewhere between it and the head of the circular buffer), it'll spin until it can place its buffer as the head of the circular buffer. Each writer would lock, write and remove the tail of the circular buffer.
  6. Sep 26, 2011 #5

    D H

    User Avatar
    Staff Emeritus
    Science Advisor

    You really don't want to do it this way unless the file is small. At the start you will have five threads waiting for the first reader to finish reading its share of the file, and at the end you once again only have one active thread. You have pretty much maximized this inherent lack of parallelism at start and end by reading and writing thirds of the file. Reading and writing one byte at a time will minimize this particular problem, but this single byte approach obviously isn't optimal either. There's a buffer size that represents a happy ground somewhere between these extremes.

    Parallelizing your I/O can make the output appear to be cost free. For example, suppose the input file is on one drive, the output file on another. Having parallel readers and writers can halve the time needed to accomplish the task compared to that taken by a single threaded process.

    That said, this mandated architecture is a bit silly. The point is to make you learn how to do threading, how to control what those threads are doing. You are moving from a subroutine model of programming to a coroutine model, and making that paradigm switch is perhaps easiest to accomplish when you are working on a problem that is a bit artificial.

    A state machine might be of help as well. Your readers are, at a minimum, waiting to read or reading. Depending on your architecture, a reader might also go into output modes to transfer its newly acquired data to some consumer of the data. A similar concept applies to your writers.
  7. Sep 27, 2011 #6
    OK I have it working with 1 reader and 1 writer thread, a mutex for the buffer and a counting semaphore to regulate production and consumption and it appears to be working well o_O my output file is correct. Now to add some more threads and see what happens. eek! haha
  8. Oct 5, 2011 #7
    I have been working on this for a week or so, and I have a problem. For the record, "fill count" and "empty count" are counting semaphores and I use two more semaphores as mutexes for each file (input and output). My logic is like this:

    Spawn 3 readers, spawn 3 writers (all explicitly JOINABLE by setting an attribute variable). Each thread function is a while(1) loop with an if-clause for the termination condition that is executed each loop.

    A reader thread will:

    Code (Text):

    lock the input file (sem down)
    if(fgets a line == NULL)
        unlock the input file (sem up)
        set a global flag "reading complete"
    decrement emptycount (sem down)
    lock the shared buffer (sem down)
    unlock the input file (sem up) /* the 2 locks overlap here to enforce line order */
    insert the line into the buffer
    unlock the shared buffer (sem up)
    increment the fillcount (sem up)
    when I break (above) I free the line resource and call pthread_exit(). The reader threads seem to work well, and they all exit appropriately when fgets == NULL.

    My writer threads are not so well behaved. Their logic is similar:

    Code (Text):

    if(reading is complete && the buffer is empty)
        break; /* cleanup and pthread_exit() */
    decrement fill count (sem down)
    lock the shared buffer (sem down)
    remove the next line from the buffer
    lock the output file (sem down)
    unlock the shared buffer (sem up) /* again the 2 locks overlap to enforce line order */
    increment empty count (sem up)
    insert the line into the output file
    unlock the output file (sem up)
    Now, here and there my writer threads will all finish as expected, and the output file is correct, but that is very much the exception rather than the rule! More often, my program hangs at pthread_join() waiting for either the 1st or 2nd writer thread to finish.

    So it seems I have some kind of deadlock but I'm at a bit of a loss as to how to debug it. The writer threads are not terminating (gdb reports thread termination) but they are also not looping (I can use a printf in my loop and it will stop printing at this point), so I think they're stuck on one of the semaphores but which one, and how to find out?

    Also, I'm not entirely sure that my method of overlapping semaphore locks (the buffer and file mutexes) to enforce line order is appropriate ... perhaps I should be using some other kind of method.

    Sometimes when I do get the program to finish, I have a single line output twice (in a row) to the output file, which doesn't seem possible, but there it is ...

    Oh, I can post some code, but since it's an assignment I decided not to at this stage. I think my pseudo-code should make it clear what's going on.
    Last edited: Oct 6, 2011
  9. Oct 6, 2011 #8
    Interestingly, I forgot to add this, if I comment out my pthread_join() calls for the writer threads, the program terminates without a problem, but I still sometimes get a single double line output to the file (only ever once, and always well into the middle of the file, never at the start or end). It's a 10000-odd line file (a movie script).
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook

Similar Threads for Multithreaded File
Read specific lines from a file using fortran90