# Merge .dat files into one

• Python
Gold Member
Hi PF!

There are directories 0 5 10 15 20 (and so on), each containing a single volFieldValue.dat file (same name in each directory). I would like to successively combine each .dat file into one. So far what I have is this:

Python:
#!/usr/bin/python3
import numpy as np

#-----------------------SCRIPT DESCRITION-----------------------#
# THIS SCRIPT MERGES ALL .dat FILES FOR POST-PROCESSING
#---------------------------------------------------------------#

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 20
increment = 5
dirNames  = np.arange(first_dir, last_dir, increment)

print(len(dirnames)

# LOOP THROUGH ALL DIRECTORIES
for j in range(0, len(dirNames))

# OPEN volFieldValue.dat IN THE DIRECTORY ./0/ IN WRITE MODE
with open('./0/volFieldValue.dat', 'w') as outfile:

# OPEN EACH SUCCESSIVE FILE IN READ MODE
with open('./'+str(dirNames(j))+'/volFieldValue.dat') as infile:

# READ DATA FROM SUCCESSIVE FILE AND WRITE IN ./0/volFieldValue.dat

# LINEBREAK TO SEPARATE SUCCESSIVE FILES
outfile.write("\n")

but I'm getting the error

Code:
  File "file_merge.py", line 20
with open('./0/volFieldValue.dat', 'w') as outfile:
^
SyntaxError: invalid syntax

Last edited:

Mentor
I'm getting the error

You need a colon at the end of the for statement in line 17. One of the warts of Python is that syntax errors often get attributed to the wrong line, as in this case; because of the way the lexer and parser work, Python attributes the error to the fact that the with line is indented when it shouldn't be, rather than to the fact that you forgot to put a colon to indicate an indented block on a previous line. Which is kind of frustrating since a for statement always starts an indented block, but that's how it works; Python doesn't yet know, when it generates the syntax error, that the previous line is a for statement (since that stage of parsing hasn't happened yet), so all it can do is flag wrong indentation at that point.

joshmccraney
Gold Member
You need a colon at the end of the for statement in line 17. One of the warts of Python is that syntax errors often get attributed to the wrong line, as in this case; because of the way the lexer and parser work, Python attributes the error to the fact that the with line is indented when it shouldn't be, rather than to the fact that you forgot to put a colon to indicate an indented block on a previous line. Which is kind of frustrating since a for statement always starts an indented block, but that's how it works; Python doesn't yet know, when it generates the syntax error, that the previous line is a for statement (since that stage of parsing hasn't happened yet), so all it can do is flag wrong indentation at that point.
This is really good to know: thanks a ton!

Okay, so the new code both includes the colon and changes line 23 to dirNames[j] (evidently the () was invalid syntax too).

However, one big problem persists: all my code seemed to do is delete the previous .dat file and replace it with the new .dat file. So nothing is merged. Any idea what's wrong?

EDIT: nevermind, all I needed to do was switch the order of the for loop and the open in write mode. Makes sense why it deleted and pasted. Thanks again!

Gold Member
One last question: I am trying to delete the first 4 lines of each of the .dat files before merging them. So far what I have is this

Python:
#!/usr/bin/python3
import numpy as np

#-----------------------SCRIPT DESCRITION-----------------------#
# THIS SCRIPT MERGES ALL .dat FILES FOR POST-PROCESSING
#---------------------------------------------------------------#

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 15
increment = 5
dirNames  = np.arange(first_dir, last_dir+1, increment)

# OPEN volFieldValue.dat IN THE DIRECTORY ./0/ IN WRITE MODE
with open('./0/volFieldValue.dat', 'w') as outfile:

# LOOP THROUGH ALL DIRECTORIES
for j in range(0, len(dirNames)):

# OPEN EACH SUCCESSIVE FILE IN READ MODE
with open('./'+str(dirNames[j])+'/volFieldValue.dat', 'r+') as infile:
infile.writelines(lines[4:])

# READ DATA FROM SUCCESSIVE FILE AND WRITE IN ./0/volFieldValue.dat

# LINEBREAK TO SEPARATE SUCCESSIVE FILES
# outfile.write("\n")

but now everything is deleted. I think it is due to line 22. Any ideas?

Mentor
but now everything is deleted

Why not just do outfile.write(lines[4:]) in line 26?

now everything is deleted

Is the input file actually empty, or just the output file?

I think it is due to line 22.

Why would that line cause a problem?

The problem is most likely to be lines 23 and 26: after you've written to infile in line 23, the file pointer is at the end of the file, so when you read from it in line 26, you get an empty string, and that's what gets written to outfile.

It's generally bad practice to open a file in read/write mode, since it forces you to think about where the file pointer is. Most of the time you don't actually need to.

Gold Member
Why not just do outfile.write(lines[4:]) in line 26?
I was not aware this was an option. But when I tried your suggestion, I got an error about a string. So I changed to outfile.write(str(lines[4:])) and then got an output file that was very wrong (lots of arbitrary text, very confusing).

Is the input file actually empty, or just the output file?
Just the output.

Why would that line cause a problem?
I'm not sure why, but I know it does, because when I comment out line 23 but not 22, the output file is blank. However, when I comment out line 22, the output file works except for the first four lines not being deleted.

The problem is most likely to be lines 23 and 26: after you've written to infile in line 23, the file pointer is at the end of the file, so when you read from it in line 26, you get an empty string, and that's what gets written to outfile.

It's generally bad practice to open a file in read/write mode, since it forces you to think about where the file pointer is. Most of the time you don't actually need to.

Okay, so should I rewrite the code to look like this:

Python:
#!/usr/bin/python3
import numpy as np

#-----------------------SCRIPT DESCRITION-----------------------#
# THIS SCRIPT MERGES ALL .dat FILES FOR POST-PROCESSING
#---------------------------------------------------------------#

# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 15
increment = 5
dirNames  = np.arange(first_dir, last_dir+1, increment)

# THIS LOOP SHOULD DELETE THE FIRST 4 LINES OF THE .dat FILES
for j in range(0, len(dirNames)):

# OPEN THE .dat FILES IN READ MODE
with open('./'+str(dirNames[j])+'/volFieldValue.dat', 'r') as f:

# LINES FROM THE .dat FILE

# DELETE FIRST 4 LINES
lines[4:] = []

# OPEN volFieldValue.dat IN THE DIRECTORY ./0/ IN WRITE MODE
with open('./0/volFieldValue.dat', 'w') as outfile:

# LOOP THROUGH ALL DIRECTORIES
for j in range(0, len(dirNames)):

# OPEN EACH SUCCESSIVE FILE IN READ MODE
with open('./'+str(dirNames[j])+'/volFieldValue.dat') as infile:

# READ DATA FROM SUCCESSIVE FILE AND WRITE IN ./0/volFieldValue.dat
# LINEBREAK TO SEPARATE SUCCESSIVE FILES
# outfile.write("\n")
(NOTE: I'm not sure if line 24 is deleting the line or not.)

Mentor
I was not aware this was an option. But when I tried your suggestion, I got an error about a string.

Yes, my bad, it should have been outfile.writelines(lines[4:]).

Why wouldn't this be an option? You want to write all but the first four lines from the input file, to the output file. The code just above is the obvious simplest way to do that, since you already have the lines read in from the input file. Writing the lines back to the input file, then reading them back in, is an unnecessary complication.

when I comment out line 23 but not 22, the output file is blank.

Yes, because once you've done infile.readlines(), the file pointer is at the end of the file, so when line 26 reads from the same file, it gets an empty string.

However, when I comment out line 22, the output file works except for the first four lines not being deleted.

Do you mean "when I comment out line 22 as well as line 23"? If so, this makes sense, as the only file operation then is line 26, which will just read the whole file since the file pointer hasn't been moved from the start of the file by any lines before it.

If line 22 is commented out but line 23 is not, however, you should get an exception at line 23 on the first iteration of the loop because the lines variable hasn't yet been initialized with any data (because line 22 was commented out and it's the only place where the lines variable is assigned to).

The lesson here is that "comment out debugging" isn't always the best way to find errors. I strongly suggest reading the Python documentation on r+ file mode; it will at least give you a start on understanding all of the pitfalls involved.

Okay, so should I rewrite the code to look like this

No. You're not thinking through what the code is doing.

Here is what you want the code to do (the simplest version):

(1) Open the output file for writing.

(2) Loop through all the input files: for each input file, read in all its lines, then write all but the first four lines to the output file.

Alternatively, if for some reason you need to have the input files changed as well (to have the first 4 lines eliminated from each--although there is no need to actually do that if all you want is to get the output file right), then you want the code to do this (this is the version you appear to have attempted to code):

(1) Loop through all the input files: for each input file, read in all its lines, then truncate the file and write out all but the first four lines.

(2) Open the output file for writing.

(3) Loop through all the input files: for each input file, read in all its data and write that data to the output file.

Can you see, first, why the rewritten code you posted is not doing the above, and second, why the simplest version I gave above is cleaner and simpler than the version you tried to code?

joshmccraney
Mentor
I'm not sure if line 24 is deleting the line or not

Line 4 is deleting all but the first four lines from the list of lines in memory. First, those aren't the lines you want to delete, they're the lines you want to keep; and second, deleting them from the list of lines in memory, stored in the lines variable, does nothing at all to the file on disk.

Gold Member
Yes, my bad, it should have been outfile.writelines(lines[4:]).

Why wouldn't this be an option? You want to write all but the first four lines from the input file, to the output file. The code just above is the obvious simplest way to do that, since you already have the lines read in from the input file. Writing the lines back to the input file, then reading them back in, is an unnecessary complication.
I agree, this is why I was trying to do it this way at first. The reason I asked about redoing it a different way is because I was struggling to get it working.

Yes, because once you've done infile.readlines(), the file pointer is at the end of the file, so when line 26 reads from the same file, it gets an empty string.
So the file pointer reads through the entire .dat file, pointing at each line, and then when it gets to the end it points at nothing because the lines have all been pointed at?

Do you mean "when I comment out line 22 as well as line 23"? If so, this makes sense, as the only file operation then is line 26, which will just read the whole file since the file pointer hasn't been moved from the start of the file by any lines before it.
Yes, this is what I mean: thanks for finessing my language.

If line 22 is commented out but line 23 is not, however, you should get an exception at line 23 on the first iteration of the loop because the lines variable hasn't yet been initialized with any data (because line 22 was commented out and it's the only place where the lines variable is assigned to).
Yep, totally agree.

The lesson here is that "comment out debugging" isn't always the best way to find errors. I strongly suggest reading the Python documentation on r+ file mode; it will at least give you a start on understanding all of the pitfalls involved.
Got it, I started to, hence changing it from post 1 to post 4, but I'll go deeper into it. Thanks for the suggestions.

No. You're not thinking through what the code is doing.

Here is what you want the code to do (the simplest version):

(1) Open the output file for writing.

(2) Loop through all the input files: for each input file, read in all its lines, then write all but the first four lines to the output file.

Alternatively, if for some reason you need to have the input files changed as well (to have the first 4 lines eliminated from each--although there is no need to actually do that if all you want is to get the output file right), then you want the code to do this (this is the version you appear to have attempted to code):

(1) Loop through all the input files: for each input file, read in all its lines, then truncate the file and write out all but the first four lines.

(2) Open the output file for writing.

(3) Loop through all the input files: for each input file, read in all its data and write that data to the output file.

Can you see, first, why the rewritten code you posted is not doing the above, and second, why the simplest version I gave above is cleaner and simpler than the version you tried to code?
I think so, and it seems, as you pointed out above, that a big issue here is my lack of understanding pointers. I'll read up on these so I can perhaps problem solve this issue on my own next time. But thank you for calling attention to this: it's hard to know what I don't know.

One last request: the python code in post 6, line 10 gives the last directory number. As mentioned in post 1, there are several file directories. Do you know how I can find the one with the largest number, so I don't have to hard-code a number here? The literature I've seen online entail solutions that hard-code each of the file names, which seems pointless.

Last edited:
Mentor
I agree, this is why I was trying to do it this way at first. The reason I asked about redoing it a different way is because I was struggling to get it working.

What were you struggling with?

So the file pointer reads through the entire .dat file, pointing at each line, and then when it gets to the end it points at nothing because the lines have all been pointed at?

If all you do is call the readlines method, yes, that's what will happen.

There are other methods of file objects that you can use to move the file pointer; the Python documentation describes them. But those are really overkill for what you are doing; they're meant more for text editing or database type programs that need continual random access to the file.

I think so, and it seems, as you pointed out above, that a big issue here is my lack of understanding pointers.

File pointers are good to understand, yes, but as I noted above, using the methods of file objects that move the file pointer is overkill for what you are trying to do. The reason your revised code is not doing what you want has nothing to do with file pointers; it has to do with what I pointed out in post #8.

In any case, getting the simpler method working (the one I asked why you were struggling with above) is going to be a better way of solving this particular problem than trying to fix your more complicated code.

Gold Member
What were you struggling with?
I was struggling to get your simple method working (the typo).

In any case, getting the simpler method working (the one I asked why you were struggling with above) is going to be a better way of solving this particular problem than trying to fix your more complicated code.
Yep, I have it working; your simple suggestion is perfect!

Did you have any advice on the last request regarding directory numbers?

One last request: the python code in post 6, line 10 gives the last directory number. As mentioned in post 1, there are several file directories. Do you know how I can find the one with the largest number, so I don't have to hard-code a number here? The literature I've seen online entail solutions that hard-code each of the file names, which seems pointless.

Nick-stg
First let me says that the advice from PeterDonis is good.

I would just like to raise a few more points to make things more readable and thus easier to understand
Python:
# Create a list of DIRECTORY NAMES
first_dir = 5
last_dir  = 15
increment = 5
dirNames  = np.arange(first_dir, last_dir+1, increment)

# THIS LOOP SHOULD DELETE THE FIRST 4 LINES OF THE .dat FILES
for j in range(0, len(dirNames)):
# OPEN THE .dat FILES IN READ MODE
with open('./'+str(dirNames[j])+'/volFieldValue.dat', 'r') as f:
...

The use of numpy as shown above is not required, it is redundant here. You can use range directly without "dirNames".

Python:
first_dir = 5
last_dir  = 15
increment = 5
for i in range(first_dir, last_dir+1, increment):
with open('./'+str(j)+'/volFieldValue.dat', 'r') as f:
...

Also, you do not need to use the with open() as f: construct, not that there is anything wrong with that. But in the case of open the file for writing outifile = open('/path/to/file.dat', 'w'). Then when you need to write outfile.write('foo'). Finally when your done you will need to call the close method outfile.close().

So to bring this all together:
Python:
first_dir = 5
last_dir  = 15
increment = 5
outifile = open('./0/volFieldValue.dat', 'w')

for i in range(first_dir, last_dir+1, increment):
with open('./'+str(i)+'/volFieldValue.dat', 'r') as f:

outfile.close()

Greg Bernhardt
Mentor
Did you have any advice on the last request regarding directory numbers?

You can use os.listdir to get a list of file and directory names in the current directory, and os.path.isdir to find out which of them are directory names. If the directory names will sort lexicographically, then that should be what you want.

Also, you can iterate through the directory names with just for dirname in dirNames. If all you want is to use each element in a list, that's the simplest way to do it.

joshmccraney
Mentor
you do not need to use the with open() as f: construct

Yes, you do, to make sure the output file gets properly closed if there is an error. @joshmccraney is doing that part correctly.

Mentor
So to bring this all together

This code has several mistakes in it. You need to be more careful if you are going to post actual code as a suggestion. A good general rule is, never post actual code that you yourself haven't run and verified that it works properly.

Nick-stg
Yes, you do, to make sure the output file gets properly closed if there is an error.
Ok, my bad, I was going to warn about that. I would agree if it were the code were part of program, by understanding is that this is a simple script for merging a few files, in which case you could handle such an issue manually. But strictly speaking, yes I agree.

This code has several mistakes in it.
Besides the with open(), I found the i vs j variable, which I already edited (copy and paste mistake). Was there anything else?

Mentor
Was there anything else?

You have a typo in a variable name.