Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Program to Pick Out Files From List

  1. Mar 15, 2013 #1
    Hi Everyone!

    This is my first question for programming. I have very little experience, so if you could explain as simply as possible.

    I have a list of output that actually looks like this:

    # Query: 2678.m000169 conserved hypothetical protein
    # Database: EST_matched_Genome.fas.txt
    # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
    # 3 hits found
    2678.m000169 NODE_295089_length_165_cov_8.775758 100 209 0 0 120 328 1 209 2.00E-107 387
    2678.m000169 NODE_575647_length_122_cov_7.155738 100 165 0 0 378 542 166 2 6.00E-83 305
    2678.m000169 NODE_311774_length_123_cov_6.276423 100 159 0 0 1 159 159 1 1.00E-79 294
    # BLASTN 2.2.23+
    # Query: 2678.m000182 hypothetical protein
    # Database: EST_matched_Genome.fas.txt
    # Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
    # 8 hits found
    2678.m000182 NODE_207397_length_1580_cov_14.322152 99.87 1597 2 0 6354 7950 1598 2 0 2939
    2678.m000182 NODE_217031_length_1396_cov_9.866762 99.86 1439 2 0 4724 6162 1440 2 0 2647
    2678.m000182 NODE_298029_length_1052_cov_10.303232 100 1096 0 0 10519 11614 1096 1 0 2025
    2678.m000182 NODE_224620_length_710_cov_8.192958 100 751 0 0 1357 2107 4 754 0 1387


    It's output from a BLAST program.

    I would like to write a script in Python, or whichever language is easiest, that (1) checks to see if NODE_X=NODE_Y where NODE_X is the value in the second column directly preceding NODE_Y, and (2) if NODE_X=NODE_Y, it finds the file corresponding to the value in the first column (e.g., 2678.m000182.txt) in a database and puts it into a new folder entitled "2_Same_Nodes."

    Does anyone have any suggestions for how I would go about this?

    Thanks
     
  2. jcsd
  3. Mar 16, 2013 #2

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: Try this, and let us know if it does not work.

    (1) Type the following code into a Bourne or bash shell file named myprog.sh, where in01 is your input file name (containing your BLAST output), and dir01 is the database directory name.

    #!/bin/sh
    # Created: 2013-03-16, nvn.
    awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \
    | awk 'NF>1 {if($2==$4)
    printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01​

    (2) On linux bash, create your output directory, by issuing the following bash command: mkdir 2-same-nodes

    (3) Issue the following bash command: sh myprog.sh

    (4) Issue the following bash command: sh out01
     
    Last edited: Mar 16, 2013
  4. Mar 17, 2013 #3
    Thanks nvn!

    It's giving me output that looks like a long list of this:

    cp: /Users/username/databasegene/2003.m000043.txt: No such file or directory

    Which file or directory is it referring to?

    Another question, how exactly do I set up the database? I have a bunch of sequences listed in one text file under the headings abc.m000yz , but I would like to break it apart into individual files and then create a database with them. Then I would be able to call them individually if they meet the criterion of myprog.sh. Does this make sense?

    Thanks for your help.
     
  5. Mar 18, 2013 #4

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: Manually find one of the database files you want to copy to 2-same-nodes, then manually enter a cp -p dir01/filename01 2-same-nodes command yourself, at the bash prompt. Debug this cp -p command until it works the way you want. Then enter your debugged cp -p command in myprog.sh. Try it, and see if it works.

    In other words, the error message in post 3 is telling you, "A file named 2003.m000043.txt does not exist in the path and directory /Users/username/databasegene/, or a directory named /Users/username/databasegene/ does not exist." Find the directory where file 2003.m000043.txt is located, and then substitute that path and directory name for dir01 in myprog.sh.
     
    Last edited: Mar 18, 2013
  6. Mar 19, 2013 #5
    Great, that worked. Thanks a lot for your help, and thanks for your advice to test it manually.

    Now I'm trying to build the database. I have a large .txt file m.txt in that looks like this:

    ...
    >something.m.something1.txt
    ATCTC......etc....ACTCCCCATTTT...
    >something.m.something2.txt
    GATTA...etc...GCTTTGGGG

    ...

    It's about 900 entries long. I want to break it up into a database, so that it's broken up by the ">" character. so

    >something.m.something1.txt
    ATCTC......etc....ACTCCCCATTTT...

    would be a file called something.m.something1.txt

    I've looked around and come up with:

    $ awk '{if(match($0, /^\[ (.+?) \]/, k)){name=k[1]}} {print >name".txt" }' entry.txt

    I'm not sure how to tailor this to what I want to do.
     
  7. Mar 19, 2013 #6

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: Issue the bash command "mkdir db01", then execute the following. Try it, and see if it works.

    awk '{if(/^>/)printf substr($0,2)" ";else print $0}' m.txt > tmp01
    awk '{name1=$1;dat1=substr($0,length(name1)+2);
    print dat1 > "db01/"name1}' tmp01​
     
    Last edited: Mar 19, 2013
  8. Mar 20, 2013 #7
    Hi nvn,

    I'm getting:

    awk: syntax error at source line 2
    context is
    print dat1 > >>> "db01/"name1 <<< }
    awk: illegal statement at source line 2
     
  9. Mar 20, 2013 #8

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: Perhaps try removing the line break between the second and third line. The third line in post 6 is a continuation of the second line.

    Is directory db01 in the same directory you are running this awk statement in? Directory db01 needs to exist in the directory you are running this awk statement in.

    If that fails, perhaps try the following.

    awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);
    print dat1 > name1}' tmp01​

    If that fails, perhaps show us the first several lines in your files m.txt and tmp01. (Enclose the lines within [noparse]
    Code (Text):
     ...
    [/noparse] tags, for posting.)
     
    Last edited: Mar 20, 2013
  10. Mar 21, 2013 #9
    Hi nvn,

    It is in the same directory, removing the line break (;, right?) does not work.

    tmp01 is blank.

    m.txt looks like this:

    Code (Text):
     
    >1123.m000123 conserved hypothetical protein
    ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAAAATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
    >1243.m000121 hypothetical protein
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGTTTGA
     
     
  11. Mar 21, 2013 #10

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: (1) Do not remove the semicolon; remove only the line break.

    (2) Try the following.

    awk --lint '{if(/^>/)printf "%s ",substr($1,2);else print $0}' m.txt > tmp01
    awk --lint '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 > name1;close(name1)}' tmp01​

    (3) If that fails, try removing --lint.

    (4) If that fails, try changing awk to gawk.

    (5) If that fails, show us the contents of your file tmp01.
     
  12. Mar 22, 2013 #11
    Ok, great! db01 was created. I believe it was (3) that worked. Just a general question: what language is this? Awk?

    tmp01 is the same as m.txt. However, the db01 files are incorrect. It seems to have not broken it up the way it should have. I'm not sure that I decribed m.txt very well. It should be:
    Code (Text):

    >1123.m000123 conserved hypothetical protein
    ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
    AATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
    ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
    AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
    CCTCCCCAATTTTAATGAAAACCGGGGAATTCTCACT
    ACATCCAT
    >1113.m00028 conserved hypothetical protein
    AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
    AAAAACCTTAAA
    >1243.m000121 hypothetical protein
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
    TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
    AACACACATACACATTTGGGAC
     
    So, the sequence can be any number of lines.

    If the code is:

    Code (Text):
     awk '{if(/^>/)printf "%s ",substr($1,2);else print $0}' m.txt > tmp01
    Which means Look at each line of m.txt "if > print source filename to substring (what's $1?, a variable?). If not, print $0 (what's this?). And this is being done from m.txt to tmp01.

    Code (Text):
    awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 >  
    name1;close(name1)}' tmp01
    Which means look through tmp01 and for name1=db01/$1 (the first file in db01) make it the substring --what does length(name1)+2 mean? then since dat1 (or name 1) = name1? ...a bit lost

    Would you be able to explain what is happening here? If not, I understand, you've been very helpful and I don't mean to impose any more than I already have.
     
  13. Mar 22, 2013 #12

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: Yes, the language is awk, sometimes spelled gawk. awk is a command for processing one input record (line) at a time. Yes, $0 is a variable; it means the entire current record. $1 means the first string on the current record.

    To learn what each statement or entity means, you can start reading awk or gawk documentation. Issue the command "man awk". Or search internet for awk or gawk documentation. There are countless awk user guides, awk reference manuals, awk tutorials, and awk man pages on internet, ranging from introductory to advanced. Find one or some that you like. Start with introductory ones, to learn the basics, and to learn what each statement or entity means. Search for the entities in the documentation, to look them up. Try this first, and see if it helps answer some of your preliminary questions.

    OK, instead of using the awk commands in post 10, try the following. Ensure the first record in file m.txt begins with greater-than sign (>) in column 1 before running the following.

    awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
    print $0 > file1}' m.txt​
     
    Last edited: Mar 22, 2013
  14. Mar 22, 2013 #13
    Worked like a charm! Thanks nvn.
     
  15. Mar 26, 2013 #14
    nvn,

    I've been trying to learn awk, but I still have some questions.

    Code (Text):
     awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
    print $0 > file1}' m.txt
    I read this as: if a line matches ">" at the top, then close file 1 where file1="db01/"substr($1,2). Then, print the current record to file 1. I'm confused about the close() function; is it literally closing a file?


    Code (Text):
     awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \
    | awk 'NF>1 {if($2==$4)
    printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01
    This one is saying if the first field is not # print the first, second, then on the next line print the first, second. Then | means for the next awk statement to take that output and see if the NODES are the same for each line in the record--NF>1. If the nodes are the same, print the name of that m to a file in 2 same nodes. What's out01?
     
  16. Mar 27, 2013 #15

    nvn

    User Avatar
    Science Advisor
    Homework Helper

    blueberryfive: Aside:
    awk is a command intended for short, "one-liner" programs. Many aspects of awk are indeed great, although it is somewhat annoying (and limited) in the sense that the entire program must be a quoted string, quoted using apostrophes. Therefore, it is no replacement for a good programming language. But I currently do not have comments on which good programming language generally would be the best one to learn. That would be a topic or debate you could post in a new or different thread. I am, of course, not implying herein that awk is a replacement for a good programming language. I only used awk herein because it is perhaps one of the quickest, easiest ways to perform the current task.​

    Moving on to your post 14 questions, yes, the close(file1) function literally closes file1, if file1 is open. If file1 is not open, close(file1) has no effect.

    out01 is the arbitrary name of the output file, for the output from the second awk command.
     
    Last edited: Mar 27, 2013
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook