Match Numbers in Two Files & Get Results: Results of Comparing .lw & .pw Files

  • Context: Python 
  • Thread starter Thread starter Bala06
  • Start date Start date
  • Tags Tags
    files Match Numbers
Click For Summary
SUMMARY

The discussion focuses on matching numbers in two file types, .lw and .pw, using a Python script. The user, Balaji, seeks to extract and compare data based on a specific identifier, "X2d12G," but encounters issues with the script producing no output. Key problems identified include improper parsing of file lines and the use of fixed-width character slicing, which can lead to errors if the input format varies. Recommendations include using the split() method for more robust parsing and refactoring repetitive code segments for clarity and maintainability.

PREREQUISITES
  • Understanding of Python programming, specifically file I/O operations.
  • Familiarity with string manipulation techniques in Python, including slicing and splitting.
  • Knowledge of data structures in Python, particularly lists and tuples.
  • Basic understanding of regular expressions for text parsing.
NEXT STEPS
  • Refactor the Python script to utilize the split() method for parsing lines from .lw and .pw files.
  • Implement regular expressions to enhance the robustness of data extraction from the files.
  • Debug the script by adding print statements to track variable values and flow of execution.
  • Explore Python's built-in libraries for handling file formats and data comparison more efficiently.
USEFUL FOR

This discussion is beneficial for Python developers, data analysts, and anyone involved in file processing and data extraction tasks, particularly those working with custom file formats.

Bala06
Messages
10
Reaction score
0
Dear Members

I would like to match numbers in two files of extensions .lw & .pw and put the results according to matching numbers.

For example, the .lw file contains data as

59880 SPC X2d12G 4714 UNK X 900B

and .pw file has

59474 SPC X2c8bG 991 ILE A 118B
59726 SPC X2cdfG 1803 SER A 168B
59876 SPC X2d11G 4055 ASP A 356B
59879 SPC X2d12G 3849 ASN A 344B

I want to match according to this number "X2d12G" and put in output as

For example like this (result):
431-hydrogen-bond-frame.dat.c.d.pw [(4714, 'UNK', 'X 900B', 59880, 'SPC', 'X2d12G', 59879, 'SPC', 'X2d12G', 4186, 'ASN', 'A 344B')]
453-hydrogen-bond-frame.dat.c.d.pw [(4714, 'UNK', 'X 900B', 59880, 'SPC', 'X2d12G', 59879, 'SPC', 'X2d12G', 4186, 'ASN', 'A 344B')]

Since the attachments is limited, I couldn't attached thos file 453-hydrogen-bond-frame_lw.txt. It also contains the same data as 431-hydrogen-bond-frame_lw.txt.

When ever I run my python code, I'm not getting the result as expected.

I'm running python script as (python water_cont.py > summary.txt)

I 'm posting the python code for your reference.

Code:
#! /usr/bin/env python
import sys, os, math, glob
#  Run as:  python water_cont.py 
#  This script will provide the summary of waters along the trajectory.  
#  Use it after the run of python_water_cont.py
#  Bala 28 May. 2011
#

def read_lig_wat(file):
    file = open (file, "r")
    data=file.readlines()
    atom_number1 = map(lambda x: int(x[0:7]), data)
    resname1 = map(lambda x: x[10:13], data)
    res_number1=map(lambda x: x[17:23], data)
    atom_number2=map(lambda x: int(x[27:36]), data)
    resname2=map(lambda x: x[37:40], data)
    res_number2= map(lambda x: x[44:50], data)
    return atom_number1, resname1, res_number1, atom_number2, resname2, res_number2 def read_prot_wat(file1):
    file1 = open (file, "r")
    data1=file.readlines()
    atom_number11 = map(lambda x: int(x[0:7]), data)
    resname11 = map(lambda x: x[10:13], data)
    res_number11=map(lambda x: x[17:23], data)
    atom_number22=map(lambda x: int(x[27:36]), data)
    resname22=map(lambda x: x[37:40], data)
    res_number22= map(lambda x: x[44:50], data)
    return atom_number11, resname11, res_number11, atom_number22, resname22, res_number22 

 
for filename in glob.glob1("/home/water", "*.lw"):
   atom_number1, resname1, res_number1, atom_number2, resname2, res_number2 =read_lig_wat(filename)
#   column_file=summary+".lw"
#   file2=open( column_file, "w")
   text=len(atom_number1)

for filename in glob.glob1("/home/water", "*.pw"):
   atom_number11, resname11, res_number11, atom_number22, resname22, res_number22 =read_lig_wat(filename)
#   column_file=filename+".lw"
#   file2=open( column_file, "w")
   text1=len(atom_number11)

   List=[]

   for i in range(text):
      for j in range(text1):
          
#         print  res_number2[i], res_number22[j]
#         if res_number2[i]==res_number11[j]:
#             print res_number1[i], res_number2[i]
         if res_number1[i]==res_number11[j] or res_number1[i]==res_number22[j]\
            or res_number2[i]==res_number11[j] or res_number2[i]==res_number22[j]:
#            print atom_number1[i], resname1[i], res_number1[i], atom_number2[i], resname2[i], res_number2[i]

            List.append((atom_number1[i], resname1[i], res_number1[i], atom_number2[i], resname2[i], res_number2[i], atom_number11[j], resname11[j], res_number11[j], atom_number22[j], resname22[j], res_number22[j]))
#            print List
            print filename, List

#             file2.write("%5i%8s%11s%8i%8s%11s%5i%8s%11s%8i%8s%11s \n" % (atom_number1[i], resname1[i], res_number1[i], atom_number2[i], resname2[i], res_number2[i], atom_number11[i], resname11[i], res_number11[i], atom_number22[i], resname22[i], res_number22[i]  ))

Kindly advice.

Many Thanks
Balaji
 

Attachments

Last edited:
Technology news on Phys.org
First off: I think using slices in this way is a very bad idea and not the way most python programmers would do it. A more sensible thing would be rather than expecting fixed-width character fields (!) to do the readlines, then for each line do a split() on each line, then this will return a list of the whitespace-delimited tokens in that line. For starters, what if ONE LINE in your file is deformed and has, say, one whitespace character too many? Second off, if I try to run the program "in my head" (haven't tried to run it on disk yet) the very first thing I notice is your sample inputs begin with a five-character ID, yet when you parse the files you first attempt to grab a seven-character token from the beginning of the string. You're sure this is correct?

Second off, even if you were to use slices in this way, I think there is something bad about your repeated "map lambda" construction. A rule of thumb: if you find yourself repeating yourself in a computer program, this is a good place to. I would be very uncomfortable if this were my program until I took that repeated map lambda construction into a separate slice_out_field(0,7, data) method. The problem is you've copy and pasted this so many times, what if there was an error in one of your copy and pastes? It would be very easy to overlook.

Third off-- you say "I'm not getting the result as expected". What result are you getting instead?

I think you should start by just rewriting this to use a more conventional text parsing method like split(), or even better a regular expression (these are easy to use in Python and well fit to your problem). You have what looks to me like error-prone code and you are trying to chase a mysterious error in it... cleaning things up is a good first step. It is probably fixable as is though if you give us some more information (what is it doing now instead of working, why is it 0:7 then 10:13 and not 0:5 and then 6:9).
 
Dear Python Users

I have to map the text "X2d12G"in two files .lw and .pw and draw output as the contents of both the files.

The small correction in the .lw file. It should be like this:
" 4714 UNK X 900B 59880 SPC X2d12G"

In the attachment the format was not correct.

Say for example like this

431-hydrogen-bond-frame.dat.c.d.pw [(4714, 'UNK', 'X 900B', 59880, 'SPC', 'X2d12G', 59879, 'SPC', 'X2d12G', 4186, 'ASN', 'A 344B')]

Now, by running the script it doesn't produces any output for me (output file size is 0kb).

Kindly advice

Many Thanks
Balaji
 
Last edited:

Similar threads

  • · Replies 1 ·
Replies
1
Views
5K
  • · Replies 2 ·
Replies
2
Views
4K
Replies
3
Views
2K