Python Match Numbers in Two Files & Get Results: Results of Comparing .lw & .pw Files

AI Thread Summary
The discussion centers on a Python script intended to match numbers from two files with extensions .lw and .pw, specifically targeting the identifier "X2d12G". The user, Balaji, provides examples of the data format in both files and expresses difficulty in obtaining the expected output when running the script. The output should list matched entries in a specific format, but the script currently produces no results, leading to a 0kb output file.Key points of feedback include concerns about the method of parsing the input files using fixed-width slices, which may lead to errors if the data format varies. Suggestions are made to utilize the `split()` method for more reliable parsing, as well as to consider using regular expressions for improved accuracy. Additionally, the code's repetitive structure is highlighted, indicating a need for refactoring to enhance readability and maintainability. The discussion emphasizes the importance of clarifying what output is currently being generated versus the expected results to diagnose the issue effectively.
Bala06
Messages
10
Reaction score
0
Dear Members

I would like to match numbers in two files of extensions .lw & .pw and put the results according to matching numbers.

For example, the .lw file contains data as

59880 SPC X2d12G 4714 UNK X 900B

and .pw file has

59474 SPC X2c8bG 991 ILE A 118B
59726 SPC X2cdfG 1803 SER A 168B
59876 SPC X2d11G 4055 ASP A 356B
59879 SPC X2d12G 3849 ASN A 344B

I want to match according to this number "X2d12G" and put in output as

For example like this (result):
431-hydrogen-bond-frame.dat.c.d.pw [(4714, 'UNK', 'X 900B', 59880, 'SPC', 'X2d12G', 59879, 'SPC', 'X2d12G', 4186, 'ASN', 'A 344B')]
453-hydrogen-bond-frame.dat.c.d.pw [(4714, 'UNK', 'X 900B', 59880, 'SPC', 'X2d12G', 59879, 'SPC', 'X2d12G', 4186, 'ASN', 'A 344B')]

Since the attachments is limited, I couldn't attached thos file 453-hydrogen-bond-frame_lw.txt. It also contains the same data as 431-hydrogen-bond-frame_lw.txt.

When ever I run my python code, I'm not getting the result as expected.

I'm running python script as (python water_cont.py > summary.txt)

I 'm posting the python code for your reference.

Code:
#! /usr/bin/env python
import sys, os, math, glob
#  Run as:  python water_cont.py 
#  This script will provide the summary of waters along the trajectory.  
#  Use it after the run of python_water_cont.py
#  Bala 28 May. 2011
#

def read_lig_wat(file):
    file = open (file, "r")
    data=file.readlines()
    atom_number1 = map(lambda x: int(x[0:7]), data)
    resname1 = map(lambda x: x[10:13], data)
    res_number1=map(lambda x: x[17:23], data)
    atom_number2=map(lambda x: int(x[27:36]), data)
    resname2=map(lambda x: x[37:40], data)
    res_number2= map(lambda x: x[44:50], data)
    return atom_number1, resname1, res_number1, atom_number2, resname2, res_number2 def read_prot_wat(file1):
    file1 = open (file, "r")
    data1=file.readlines()
    atom_number11 = map(lambda x: int(x[0:7]), data)
    resname11 = map(lambda x: x[10:13], data)
    res_number11=map(lambda x: x[17:23], data)
    atom_number22=map(lambda x: int(x[27:36]), data)
    resname22=map(lambda x: x[37:40], data)
    res_number22= map(lambda x: x[44:50], data)
    return atom_number11, resname11, res_number11, atom_number22, resname22, res_number22 

 
for filename in glob.glob1("/home/water", "*.lw"):
   atom_number1, resname1, res_number1, atom_number2, resname2, res_number2 =read_lig_wat(filename)
#   column_file=summary+".lw"
#   file2=open( column_file, "w")
   text=len(atom_number1)

for filename in glob.glob1("/home/water", "*.pw"):
   atom_number11, resname11, res_number11, atom_number22, resname22, res_number22 =read_lig_wat(filename)
#   column_file=filename+".lw"
#   file2=open( column_file, "w")
   text1=len(atom_number11)

   List=[]

   for i in range(text):
      for j in range(text1):
          
#         print  res_number2[i], res_number22[j]
#         if res_number2[i]==res_number11[j]:
#             print res_number1[i], res_number2[i]
         if res_number1[i]==res_number11[j] or res_number1[i]==res_number22[j]\
            or res_number2[i]==res_number11[j] or res_number2[i]==res_number22[j]:
#            print atom_number1[i], resname1[i], res_number1[i], atom_number2[i], resname2[i], res_number2[i]

            List.append((atom_number1[i], resname1[i], res_number1[i], atom_number2[i], resname2[i], res_number2[i], atom_number11[j], resname11[j], res_number11[j], atom_number22[j], resname22[j], res_number22[j]))
#            print List
            print filename, List

#             file2.write("%5i%8s%11s%8i%8s%11s%5i%8s%11s%8i%8s%11s \n" % (atom_number1[i], resname1[i], res_number1[i], atom_number2[i], resname2[i], res_number2[i], atom_number11[i], resname11[i], res_number11[i], atom_number22[i], resname22[i], res_number22[i]  ))

Kindly advice.

Many Thanks
Balaji
 

Attachments

Last edited:
Technology news on Phys.org
First off: I think using slices in this way is a very bad idea and not the way most python programmers would do it. A more sensible thing would be rather than expecting fixed-width character fields (!) to do the readlines, then for each line do a split() on each line, then this will return a list of the whitespace-delimited tokens in that line. For starters, what if ONE LINE in your file is deformed and has, say, one whitespace character too many? Second off, if I try to run the program "in my head" (haven't tried to run it on disk yet) the very first thing I notice is your sample inputs begin with a five-character ID, yet when you parse the files you first attempt to grab a seven-character token from the beginning of the string. You're sure this is correct?

Second off, even if you were to use slices in this way, I think there is something bad about your repeated "map lambda" construction. A rule of thumb: if you find yourself repeating yourself in a computer program, this is a good place to. I would be very uncomfortable if this were my program until I took that repeated map lambda construction into a separate slice_out_field(0,7, data) method. The problem is you've copy and pasted this so many times, what if there was an error in one of your copy and pastes? It would be very easy to overlook.

Third off-- you say "I'm not getting the result as expected". What result are you getting instead?

I think you should start by just rewriting this to use a more conventional text parsing method like split(), or even better a regular expression (these are easy to use in Python and well fit to your problem). You have what looks to me like error-prone code and you are trying to chase a mysterious error in it... cleaning things up is a good first step. It is probably fixable as is though if you give us some more information (what is it doing now instead of working, why is it 0:7 then 10:13 and not 0:5 and then 6:9).
 
Dear Python Users

I have to map the text "X2d12G"in two files .lw and .pw and draw output as the contents of both the files.

The small correction in the .lw file. It should be like this:
" 4714 UNK X 900B 59880 SPC X2d12G"

In the attachment the format was not correct.

Say for example like this

431-hydrogen-bond-frame.dat.c.d.pw [(4714, 'UNK', 'X 900B', 59880, 'SPC', 'X2d12G', 59879, 'SPC', 'X2d12G', 4186, 'ASN', 'A 344B')]

Now, by running the script it doesn't produces any output for me (output file size is 0kb).

Kindly advice

Many Thanks
Balaji
 
Last edited:
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I have a quick questions. I am going through a book on C programming on my own. Afterwards, I plan to go through something call data structures and algorithms on my own also in C. I also need to learn C++, Matlab and for personal interest Haskell. For the two topic of data structures and algorithms, I understand there are standard ones across all programming languages. After learning it through C, what would be the biggest issue when trying to implement the same data...

Similar threads

Back
Top