How Can I Merge .dat Files into One Using Python?

member 428835 · Apr 22, 2020

Hi PF!

There are directories 0 5 10 15 20 (and so on), each containing a single volFieldValue.dat file (same name in each directory). I would like to successively combine each .dat file into one. So far what I have is this:

Python:

#!/usr/bin/python3
import numpy as np

#-----------------------SCRIPT DESCRITION-----------------------#
# THIS SCRIPT MERGES ALL .dat FILES FOR POST-PROCESSING
#---------------------------------------------------------------#

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 20
increment = 5
dirNames  = np.arange(first_dir, last_dir, increment)

print(len(dirnames)

# LOOP THROUGH ALL DIRECTORIES
for j in range(0, len(dirNames))

    # OPEN volFieldValue.dat IN THE DIRECTORY ./0/ IN WRITE MODE 
    with open('./0/volFieldValue.dat', 'w') as outfile: 

        # OPEN EACH SUCCESSIVE FILE IN READ MODE 
        with open('./'+str(dirNames(j))+'/volFieldValue.dat') as infile: 
  
            # READ DATA FROM SUCCESSIVE FILE AND WRITE IN ./0/volFieldValue.dat 
            outfile.write(infile.read()) 
  
        # LINEBREAK TO SEPARATE SUCCESSIVE FILES
        outfile.write("\n")

but I'm getting the error

Code:

  File "file_merge.py", line 20
    with open('./0/volFieldValue.dat', 'w') as outfile: 
       ^
SyntaxError: invalid syntax

Your help is greatly appreciated!

PeterDonis · Apr 22, 2020

joshmccraney said:

I'm getting the error

You need a colon at the end of the for statement in line 17. One of the warts of Python is that syntax errors often get attributed to the wrong line, as in this case; because of the way the lexer and parser work, Python attributes the error to the fact that the with line is indented when it shouldn't be, rather than to the fact that you forgot to put a colon to indicate an indented block on a previous line. Which is kind of frustrating since a for statement always starts an indented block, but that's how it works; Python doesn't yet know, when it generates the syntax error, that the previous line is a for statement (since that stage of parsing hasn't happened yet), so all it can do is flag wrong indentation at that point.

member 428835 · Apr 22, 2020

PeterDonis said:

You need a colon at the end of the for statement in line 17. One of the warts of Python is that syntax errors often get attributed to the wrong line, as in this case; because of the way the lexer and parser work, Python attributes the error to the fact that the with line is indented when it shouldn't be, rather than to the fact that you forgot to put a colon to indicate an indented block on a previous line. Which is kind of frustrating since a for statement always starts an indented block, but that's how it works; Python doesn't yet know, when it generates the syntax error, that the previous line is a for statement (since that stage of parsing hasn't happened yet), so all it can do is flag wrong indentation at that point.

This is really good to know: thanks a ton!

Okay, so the new code both includes the colon and changes line 23 to dirNames[j] (evidently the () was invalid syntax too).

However, one big problem persists: all my code seemed to do is delete the previous .dat file and replace it with the new .dat file. So nothing is merged. Any idea what's wrong?

EDIT: nevermind, all I needed to do was switch the order of the for loop and the open in write mode. Makes sense why it deleted and pasted. Thanks again!

member 428835 · Apr 22, 2020

One last question: I am trying to delete the first 4 lines of each of the .dat files before merging them. So far what I have is this

Python:

#!/usr/bin/python3
import numpy as np

#-----------------------SCRIPT DESCRITION-----------------------#
# THIS SCRIPT MERGES ALL .dat FILES FOR POST-PROCESSING
#---------------------------------------------------------------#

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 15
increment = 5
dirNames  = np.arange(first_dir, last_dir+1, increment)

# OPEN volFieldValue.dat IN THE DIRECTORY ./0/ IN WRITE MODE 
with open('./0/volFieldValue.dat', 'w') as outfile:

    # LOOP THROUGH ALL DIRECTORIES
    for j in range(0, len(dirNames)):

        # OPEN EACH SUCCESSIVE FILE IN READ MODE 
        with open('./'+str(dirNames[j])+'/volFieldValue.dat', 'r+') as infile:
            lines = infile.readlines()
            infile.writelines(lines[4:])
  
            # READ DATA FROM SUCCESSIVE FILE AND WRITE IN ./0/volFieldValue.dat 
            outfile.write(infile.read()) 
  
        # LINEBREAK TO SEPARATE SUCCESSIVE FILES
        # outfile.write("\n")

but now everything is deleted. I think it is due to line 22. Any ideas?

PeterDonis · Apr 22, 2020

joshmccraney said:

but now everything is deleted

Why not just do outfile.write(lines[4:]) in line 26?

joshmccraney said:

now everything is deleted

Is the input file actually empty, or just the output file?

joshmccraney said:

I think it is due to line 22.

Why would that line cause a problem?

The problem is most likely to be lines 23 and 26: after you've written to infile in line 23, the file pointer is at the end of the file, so when you read from it in line 26, you get an empty string, and that's what gets written to outfile.

It's generally bad practice to open a file in read/write mode, since it forces you to think about where the file pointer is. Most of the time you don't actually need to.

member 428835 · Apr 23, 2020

PeterDonis said:

Why not just do outfile.write(lines[4:]) in line 26?

I was not aware this was an option. But when I tried your suggestion, I got an error about a string. So I changed to outfile.write(str(lines[4:])) and then got an output file that was very wrong (lots of arbitrary text, very confusing).

PeterDonis said:

Is the input file actually empty, or just the output file?

Just the output.

PeterDonis said:

Why would that line cause a problem?

I'm not sure why, but I know it does, because when I comment out line 23 but not 22, the output file is blank. However, when I comment out line 22, the output file works except for the first four lines not being deleted.

PeterDonis said:

The problem is most likely to be lines 23 and 26: after you've written to infile in line 23, the file pointer is at the end of the file, so when you read from it in line 26, you get an empty string, and that's what gets written to outfile.

It's generally bad practice to open a file in read/write mode, since it forces you to think about where the file pointer is. Most of the time you don't actually need to.

Okay, so should I rewrite the code to look like this:

Python:

#!/usr/bin/python3
import numpy as np

#-----------------------SCRIPT DESCRITION-----------------------#
# THIS SCRIPT MERGES ALL .dat FILES FOR POST-PROCESSING
#---------------------------------------------------------------#

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 15
increment = 5
dirNames  = np.arange(first_dir, last_dir+1, increment)

# THIS LOOP SHOULD DELETE THE FIRST 4 LINES OF THE .dat FILES
for j in range(0, len(dirNames)):
    
    # OPEN THE .dat FILES IN READ MODE
    with open('./'+str(dirNames[j])+'/volFieldValue.dat', 'r') as f:

            # LINES FROM THE .dat FILE
            lines = f.readlines()

            # DELETE FIRST 4 LINES
            lines[4:] = []

# OPEN volFieldValue.dat IN THE DIRECTORY ./0/ IN WRITE MODE 
with open('./0/volFieldValue.dat', 'w') as outfile:

    # LOOP THROUGH ALL DIRECTORIES
    for j in range(0, len(dirNames)):

        # OPEN EACH SUCCESSIVE FILE IN READ MODE 
        with open('./'+str(dirNames[j])+'/volFieldValue.dat') as infile:
  
            # READ DATA FROM SUCCESSIVE FILE AND WRITE IN ./0/volFieldValue.dat 
            outfile.write(infile.read())  
        # LINEBREAK TO SEPARATE SUCCESSIVE FILES
        # outfile.write("\n")

(NOTE: I'm not sure if line 24 is deleting the line or not.)

PeterDonis · Apr 23, 2020

joshmccraney said:

I was not aware this was an option. But when I tried your suggestion, I got an error about a string.

Yes, my bad, it should have been outfile.writelines(lines[4:]).

Why wouldn't this be an option? You want to write all but the first four lines from the input file, to the output file. The code just above is the obvious simplest way to do that, since you already have the lines read in from the input file. Writing the lines back to the input file, then reading them back in, is an unnecessary complication.

joshmccraney said:

when I comment out line 23 but not 22, the output file is blank.

Yes, because once you've done infile.readlines(), the file pointer is at the end of the file, so when line 26 reads from the same file, it gets an empty string.

joshmccraney said:

However, when I comment out line 22, the output file works except for the first four lines not being deleted.

Do you mean "when I comment out line 22 as well as line 23"? If so, this makes sense, as the only file operation then is line 26, which will just read the whole file since the file pointer hasn't been moved from the start of the file by any lines before it.

If line 22 is commented out but line 23 is not, however, you should get an exception at line 23 on the first iteration of the loop because the lines variable hasn't yet been initialized with any data (because line 22 was commented out and it's the only place where the lines variable is assigned to).

The lesson here is that "comment out debugging" isn't always the best way to find errors. I strongly suggest reading the Python documentation on r+ file mode; it will at least give you a start on understanding all of the pitfalls involved.

joshmccraney said:

Okay, so should I rewrite the code to look like this

No. You're not thinking through what the code is doing.

Here is what you want the code to do (the simplest version):

(1) Open the output file for writing.

(2) Loop through all the input files: for each input file, read in all its lines, then write all but the first four lines to the output file.

Alternatively, if for some reason you need to have the input files changed as well (to have the first 4 lines eliminated from each--although there is no need to actually do that if all you want is to get the output file right), then you want the code to do this (this is the version you appear to have attempted to code):

(1) Loop through all the input files: for each input file, read in all its lines, then truncate the file and write out all but the first four lines.

(2) Open the output file for writing.

(3) Loop through all the input files: for each input file, read in all its data and write that data to the output file.

Can you see, first, why the rewritten code you posted is not doing the above, and second, why the simplest version I gave above is cleaner and simpler than the version you tried to code?

PeterDonis · Apr 23, 2020

joshmccraney said:

I'm not sure if line 24 is deleting the line or not

Line 4 is deleting all but the first four lines from the list of lines in memory. First, those aren't the lines you want to delete, they're the lines you want to keep; and second, deleting them from the list of lines in memory, stored in the lines variable, does nothing at all to the file on disk.

member 428835 · Apr 23, 2020

PeterDonis said:

Yes, my bad, it should have been outfile.writelines(lines[4:]).

Why wouldn't this be an option? You want to write all but the first four lines from the input file, to the output file. The code just above is the obvious simplest way to do that, since you already have the lines read in from the input file. Writing the lines back to the input file, then reading them back in, is an unnecessary complication.

I agree, this is why I was trying to do it this way at first. The reason I asked about redoing it a different way is because I was struggling to get it working.

PeterDonis said:

Yes, because once you've done infile.readlines(), the file pointer is at the end of the file, so when line 26 reads from the same file, it gets an empty string.

So the file pointer reads through the entire .dat file, pointing at each line, and then when it gets to the end it points at nothing because the lines have all been pointed at?

PeterDonis said:

Do you mean "when I comment out line 22 as well as line 23"? If so, this makes sense, as the only file operation then is line 26, which will just read the whole file since the file pointer hasn't been moved from the start of the file by any lines before it.

Yes, this is what I mean: thanks for finessing my language.

PeterDonis said:

If line 22 is commented out but line 23 is not, however, you should get an exception at line 23 on the first iteration of the loop because the lines variable hasn't yet been initialized with any data (because line 22 was commented out and it's the only place where the lines variable is assigned to).

Yep, totally agree.

PeterDonis said:

The lesson here is that "comment out debugging" isn't always the best way to find errors. I strongly suggest reading the Python documentation on r+ file mode; it will at least give you a start on understanding all of the pitfalls involved.

Got it, I started to, hence changing it from post 1 to post 4, but I'll go deeper into it. Thanks for the suggestions.

PeterDonis said:

No. You're not thinking through what the code is doing.

Here is what you want the code to do (the simplest version):

(1) Open the output file for writing.

(2) Loop through all the input files: for each input file, read in all its lines, then write all but the first four lines to the output file.

Alternatively, if for some reason you need to have the input files changed as well (to have the first 4 lines eliminated from each--although there is no need to actually do that if all you want is to get the output file right), then you want the code to do this (this is the version you appear to have attempted to code):

(1) Loop through all the input files: for each input file, read in all its lines, then truncate the file and write out all but the first four lines.

(2) Open the output file for writing.

(3) Loop through all the input files: for each input file, read in all its data and write that data to the output file.

Can you see, first, why the rewritten code you posted is not doing the above, and second, why the simplest version I gave above is cleaner and simpler than the version you tried to code?

I think so, and it seems, as you pointed out above, that a big issue here is my lack of understanding pointers. I'll read up on these so I can perhaps problem solve this issue on my own next time. But thank you for calling attention to this: it's hard to know what I don't know.

One last request: the python code in post 6, line 10 gives the last directory number. As mentioned in post 1, there are several file directories. Do you know how I can find the one with the largest number, so I don't have to hard-code a number here? The literature I've seen online entail solutions that hard-code each of the file names, which seems pointless.

PeterDonis · Apr 23, 2020

joshmccraney said:

I agree, this is why I was trying to do it this way at first. The reason I asked about redoing it a different way is because I was struggling to get it working.

What were you struggling with?

joshmccraney said:

So the file pointer reads through the entire .dat file, pointing at each line, and then when it gets to the end it points at nothing because the lines have all been pointed at?

If all you do is call the readlines method, yes, that's what will happen.

There are other methods of file objects that you can use to move the file pointer; the Python documentation describes them. But those are really overkill for what you are doing; they're meant more for text editing or database type programs that need continual random access to the file.

joshmccraney said:

I think so, and it seems, as you pointed out above, that a big issue here is my lack of understanding pointers.

File pointers are good to understand, yes, but as I noted above, using the methods of file objects that move the file pointer is overkill for what you are trying to do. The reason your revised code is not doing what you want has nothing to do with file pointers; it has to do with what I pointed out in post #8.

In any case, getting the simpler method working (the one I asked why you were struggling with above) is going to be a better way of solving this particular problem than trying to fix your more complicated code.

member 428835 · Apr 23, 2020

PeterDonis said:

What were you struggling with?

I was struggling to get your simple method working (the typo).

PeterDonis said:

In any case, getting the simpler method working (the one I asked why you were struggling with above) is going to be a better way of solving this particular problem than trying to fix your more complicated code.

Yep, I have it working; your simple suggestion is perfect!

Did you have any advice on the last request regarding directory numbers?

joshmccraney said:

One last request: the python code in post 6, line 10 gives the last directory number. As mentioned in post 1, there are several file directories. Do you know how I can find the one with the largest number, so I don't have to hard-code a number here? The literature I've seen online entail solutions that hard-code each of the file names, which seems pointless.

Nick-stg · Apr 23, 2020

First let me says that the advice from PeterDonis is good.

I would just like to raise a few more points to make things more readable and thus easier to understand

Python:

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 15
increment = 5
dirNames  = np.arange(first_dir, last_dir+1, increment)

# THIS LOOP SHOULD DELETE THE FIRST 4 LINES OF THE .dat FILES
for j in range(0, len(dirNames)):
    # OPEN THE .dat FILES IN READ MODE
    with open('./'+str(dirNames[j])+'/volFieldValue.dat', 'r') as f:
    ...

The use of numpy as shown above is not required, it is redundant here. You can use range directly without "dirNames".

Python:

first_dir = 5
last_dir  = 15
increment = 5
for i in range(first_dir, last_dir+1, increment):
    with open('./'+str(j)+'/volFieldValue.dat', 'r') as f:
    ...

Also, you do not need to use the with open() as f: construct, not that there is anything wrong with that. But in the case of open the file for writing outifile = open('/path/to/file.dat', 'w'). Then when you need to write outfile.write('foo'). Finally when your done you will need to call the close method outfile.close().

So to bring this all together:

Python:

first_dir = 5
last_dir  = 15
increment = 5
outifile = open('./0/volFieldValue.dat', 'w')

for i in range(first_dir, last_dir+1, increment):
    with open('./'+str(i)+'/volFieldValue.dat', 'r') as f:
        outfile.writelines(f.readlines[4:])

outfile.close()

PeterDonis · Apr 23, 2020

joshmccraney said:

Did you have any advice on the last request regarding directory numbers?

You can use os.listdir to get a list of file and directory names in the current directory, and os.path.isdir to find out which of them are directory names. If the directory names will sort lexicographically, then that should be what you want.

Also, you can iterate through the directory names with just for dirname in dirNames. If all you want is to use each element in a list, that's the simplest way to do it.

PeterDonis · Apr 23, 2020

Nick-stg said:

you do not need to use the with open() as f: construct

Yes, you do, to make sure the output file gets properly closed if there is an error. @joshmccraney is doing that part correctly.

PeterDonis · Apr 23, 2020

Nick-stg said:

So to bring this all together

This code has several mistakes in it. You need to be more careful if you are going to post actual code as a suggestion. A good general rule is, never post actual code that you yourself haven't run and verified that it works properly.

Nick-stg · Apr 23, 2020

PeterDonis said:

Yes, you do, to make sure the output file gets properly closed if there is an error.

Ok, my bad, I was going to warn about that. I would agree if it were the code were part of program, by understanding is that this is a simple script for merging a few files, in which case you could handle such an issue manually. But strictly speaking, yes I agree.

PeterDonis said:

This code has several mistakes in it.

Besides the with open(), I found the i vs j variable, which I already edited (copy and paste mistake). Was there anything else?

PeterDonis · Apr 23, 2020

Nick-stg said:

Was there anything else?

You have a typo in a variable name.

How Can I Merge .dat Files into One Using Python?

Is A.I. more than the sum of its parts?

AI vs. Humans as Processors in an Environment

Sweetspot of data compression

Other than just FizzBuzz to test programmer candidates

How to show RS(U+TRS)* is equivalent to (R+SUT)SU?

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect