# Program to Pick Out Files From List

1. Mar 15, 2013

### blueberryfive

Hi Everyone!

This is my first question for programming. I have very little experience, so if you could explain as simply as possible.

I have a list of output that actually looks like this:

# Query: 2678.m000169 conserved hypothetical protein
# Database: EST_matched_Genome.fas.txt
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 3 hits found
2678.m000169 NODE_295089_length_165_cov_8.775758 100 209 0 0 120 328 1 209 2.00E-107 387
2678.m000169 NODE_575647_length_122_cov_7.155738 100 165 0 0 378 542 166 2 6.00E-83 305
2678.m000169 NODE_311774_length_123_cov_6.276423 100 159 0 0 1 159 159 1 1.00E-79 294
# BLASTN 2.2.23+
# Query: 2678.m000182 hypothetical protein
# Database: EST_matched_Genome.fas.txt
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 8 hits found
2678.m000182 NODE_207397_length_1580_cov_14.322152 99.87 1597 2 0 6354 7950 1598 2 0 2939
2678.m000182 NODE_217031_length_1396_cov_9.866762 99.86 1439 2 0 4724 6162 1440 2 0 2647
2678.m000182 NODE_298029_length_1052_cov_10.303232 100 1096 0 0 10519 11614 1096 1 0 2025
2678.m000182 NODE_224620_length_710_cov_8.192958 100 751 0 0 1357 2107 4 754 0 1387

It's output from a BLAST program.

I would like to write a script in Python, or whichever language is easiest, that (1) checks to see if NODE_X=NODE_Y where NODE_X is the value in the second column directly preceding NODE_Y, and (2) if NODE_X=NODE_Y, it finds the file corresponding to the value in the first column (e.g., 2678.m000182.txt) in a database and puts it into a new folder entitled "2_Same_Nodes."

Thanks

2. Mar 16, 2013

### nvn

blueberryfive: Try this, and let us know if it does not work.

(1) Type the following code into a Bourne or bash shell file named myprog.sh, where in01 is your input file name (containing your BLAST output), and dir01 is the database directory name.

#!/bin/sh
# Created: 2013-03-16, nvn.
awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \ | awk 'NF>1 {if($2==$4) printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01​

(2) On linux bash, create your output directory, by issuing the following bash command: mkdir 2-same-nodes

(3) Issue the following bash command: sh myprog.sh

(4) Issue the following bash command: sh out01

Last edited: Mar 16, 2013
3. Mar 17, 2013

### blueberryfive

Thanks nvn!

It's giving me output that looks like a long list of this:

cp: /Users/username/databasegene/2003.m000043.txt: No such file or directory

Which file or directory is it referring to?

Another question, how exactly do I set up the database? I have a bunch of sequences listed in one text file under the headings abc.m000yz , but I would like to break it apart into individual files and then create a database with them. Then I would be able to call them individually if they meet the criterion of myprog.sh. Does this make sense?

4. Mar 18, 2013

### nvn

blueberryfive: Manually find one of the database files you want to copy to 2-same-nodes, then manually enter a cp -p dir01/filename01 2-same-nodes command yourself, at the bash prompt. Debug this cp -p command until it works the way you want. Then enter your debugged cp -p command in myprog.sh. Try it, and see if it works.

In other words, the error message in post 3 is telling you, "A file named 2003.m000043.txt does not exist in the path and directory /Users/username/databasegene/, or a directory named /Users/username/databasegene/ does not exist." Find the directory where file 2003.m000043.txt is located, and then substitute that path and directory name for dir01 in myprog.sh.

Last edited: Mar 18, 2013
5. Mar 19, 2013

### blueberryfive

Great, that worked. Thanks a lot for your help, and thanks for your advice to test it manually.

Now I'm trying to build the database. I have a large .txt file m.txt in that looks like this:

...
>something.m.something1.txt
ATCTC......etc....ACTCCCCATTTT...
>something.m.something2.txt
GATTA...etc...GCTTTGGGG

...

It's about 900 entries long. I want to break it up into a database, so that it's broken up by the ">" character. so

>something.m.something1.txt
ATCTC......etc....ACTCCCCATTTT...

would be a file called something.m.something1.txt

I've looked around and come up with:

$awk '{if(match($0, /^$(.+?)$/, k)){name=k[1]}} {print >name".txt" }' entry.txt

I'm not sure how to tailor this to what I want to do.

6. Mar 19, 2013

### nvn

blueberryfive: Issue the bash command "mkdir db01", then execute the following. Try it, and see if it works.

awk '{if(/^>/)printf substr($0,2)" ";else print$0}' m.txt > tmp01
awk '{name1=$1;dat1=substr($0,length(name1)+2);
print dat1 > "db01/"name1}' tmp01​

Last edited: Mar 19, 2013
7. Mar 20, 2013

### blueberryfive

Hi nvn,

I'm getting:

awk: syntax error at source line 2
context is
print dat1 > >>> "db01/"name1 <<< }
awk: illegal statement at source line 2

8. Mar 20, 2013

### nvn

blueberryfive: Perhaps try removing the line break between the second and third line. The third line in post 6 is a continuation of the second line.

Is directory db01 in the same directory you are running this awk statement in? Directory db01 needs to exist in the directory you are running this awk statement in.

If that fails, perhaps try the following.

awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);
print dat1 > name1}' tmp01​

If that fails, perhaps show us the first several lines in your files m.txt and tmp01. (Enclose the lines within [noparse]
Code (Text):
...
[/noparse] tags, for posting.)

Last edited: Mar 20, 2013
9. Mar 21, 2013

### blueberryfive

Hi nvn,

It is in the same directory, removing the line break (;, right?) does not work.

tmp01 is blank.

m.txt looks like this:

Code (Text):

>1123.m000123 conserved hypothetical protein
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAAAATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
>1243.m000121 hypothetical protein
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGTTTGA

10. Mar 21, 2013

### nvn

blueberryfive: (1) Do not remove the semicolon; remove only the line break.

(2) Try the following.

awk --lint '{if(/^>/)printf "%s ",substr($1,2);else print$0}' m.txt > tmp01
awk --lint '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 > name1;close(name1)}' tmp01​

(3) If that fails, try removing --lint.

(4) If that fails, try changing awk to gawk.

(5) If that fails, show us the contents of your file tmp01.

11. Mar 22, 2013

### blueberryfive

Ok, great! db01 was created. I believe it was (3) that worked. Just a general question: what language is this? Awk?

tmp01 is the same as m.txt. However, the db01 files are incorrect. It seems to have not broken it up the way it should have. I'm not sure that I decribed m.txt very well. It should be:
Code (Text):

>1123.m000123 conserved hypothetical protein
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
AATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
CCTCCCCAATTTTAATGAAAACCGGGGAATTCTCACT
ACATCCAT
>1113.m00028 conserved hypothetical protein
AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
AAAAACCTTAAA
>1243.m000121 hypothetical protein
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
AACACACATACACATTTGGGAC

So, the sequence can be any number of lines.

If the code is:

Code (Text):
awk '{if(/^>/)printf "%s ",substr($1,2);else print$0}' m.txt > tmp01
Which means Look at each line of m.txt "if > print source filename to substring (what's $1?, a variable?). If not, print$0 (what's this?). And this is being done from m.txt to tmp01.

Code (Text):
awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 >
name1;close(name1)}' tmp01
print $0 > file1}' m.txt​ Last edited: Mar 22, 2013 13. Mar 22, 2013 ### blueberryfive Worked like a charm! Thanks nvn. 14. Mar 26, 2013 ### blueberryfive nvn, I've been trying to learn awk, but I still have some questions. Code (Text): awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
print $0 > file1}' m.txt I read this as: if a line matches ">" at the top, then close file 1 where file1="db01/"substr($1,2). Then, print the current record to file 1. I'm confused about the close() function; is it literally closing a file?

Code (Text):
awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \ | awk 'NF>1 {if($2==$4) printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01
This one is saying if the first field is not # print the first, second, then on the next line print the first, second. Then | means for the next awk statement to take that output and see if the NODES are the same for each line in the record--NF>1. If the nodes are the same, print the name of that m to a file in 2 same nodes. What's out01?

15. Mar 27, 2013

### nvn

blueberryfive: Aside:
awk is a command intended for short, "one-liner" programs. Many aspects of awk are indeed great, although it is somewhat annoying (and limited) in the sense that the entire program must be a quoted string, quoted using apostrophes. Therefore, it is no replacement for a good programming language. But I currently do not have comments on which good programming language generally would be the best one to learn. That would be a topic or debate you could post in a new or different thread. I am, of course, not implying herein that awk is a replacement for a good programming language. I only used awk herein because it is perhaps one of the quickest, easiest ways to perform the current task.​

Moving on to your post 14 questions, yes, the close(file1) function literally closes file1, if file1 is open. If file1 is not open, close(file1) has no effect.

out01 is the arbitrary name of the output file, for the output from the second awk command.

Last edited: Mar 27, 2013