Program to Pick Out Files From List

AI Thread Summary
The discussion revolves around a user seeking help with processing output from a BLAST program using scripting. The user wants to create a Python or shell script that checks for matching nodes in the output and copies corresponding files to a new directory. Initial attempts using shell commands resulted in errors indicating missing files, prompting further questions about setting up the database and handling file paths correctly. The conversation includes troubleshooting steps, such as creating directories and using `awk` to manipulate text files. The user also seeks clarification on how to break a large text file into individual files based on specific headers and how to understand the `awk` commands being used. Several solutions are proposed, including using `awk` to format the input and create the necessary output files. The user successfully creates the required directory and files after adjusting the commands based on feedback. The discussion highlights the importance of understanding file paths, command syntax, and the functionality of `awk` for text processing tasks.
blueberryfive
Messages
36
Reaction score
0
Hi Everyone!

This is my first question for programming. I have very little experience, so if you could explain as simply as possible.

I have a list of output that actually looks like this:

# Query: 2678.m000169 conserved hypothetical protein
# Database: EST_matched_Genome.fas.txt
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 3 hits found
2678.m000169 NODE_295089_length_165_cov_8.775758 100 209 0 0 120 328 1 209 2.00E-107 387
2678.m000169 NODE_575647_length_122_cov_7.155738 100 165 0 0 378 542 166 2 6.00E-83 305
2678.m000169 NODE_311774_length_123_cov_6.276423 100 159 0 0 1 159 159 1 1.00E-79 294
# BLASTN 2.2.23+
# Query: 2678.m000182 hypothetical protein
# Database: EST_matched_Genome.fas.txt
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 8 hits found
2678.m000182 NODE_207397_length_1580_cov_14.322152 99.87 1597 2 0 6354 7950 1598 2 0 2939
2678.m000182 NODE_217031_length_1396_cov_9.866762 99.86 1439 2 0 4724 6162 1440 2 0 2647
2678.m000182 NODE_298029_length_1052_cov_10.303232 100 1096 0 0 10519 11614 1096 1 0 2025
2678.m000182 NODE_224620_length_710_cov_8.192958 100 751 0 0 1357 2107 4 754 0 1387

It's output from a BLAST program.

I would like to write a script in Python, or whichever language is easiest, that (1) checks to see if NODE_X=NODE_Y where NODE_X is the value in the second column directly preceding NODE_Y, and (2) if NODE_X=NODE_Y, it finds the file corresponding to the value in the first column (e.g., 2678.m000182.txt) in a database and puts it into a new folder entitled "2_Same_Nodes."

Does anyone have any suggestions for how I would go about this?

Thanks
 
Technology news on Phys.org
blueberryfive: Try this, and let us know if it does not work.

(1) Type the following code into a Bourne or bash shell file named myprog.sh, where in01 is your input file name (containing your BLAST output), and dir01 is the database directory name.

#!/bin/sh
# Created: 2013-03-16, nvn.
awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \
| awk 'NF>1 {if($2==$4)
printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01​

(2) On linux bash, create your output directory, by issuing the following bash command: mkdir 2-same-nodes

(3) Issue the following bash command: sh myprog.sh

(4) Issue the following bash command: sh out01
 
Last edited:
Thanks nvn!

It's giving me output that looks like a long list of this:

cp: /Users/username/databasegene/2003.m000043.txt: No such file or directory

Which file or directory is it referring to?

Another question, how exactly do I set up the database? I have a bunch of sequences listed in one text file under the headings abc.m000yz , but I would like to break it apart into individual files and then create a database with them. Then I would be able to call them individually if they meet the criterion of myprog.sh. Does this make sense?

Thanks for your help.
 
blueberryfive: Manually find one of the database files you want to copy to 2-same-nodes, then manually enter a cp -p dir01/filename01 2-same-nodes[/color] command yourself, at the bash prompt. Debug this cp -p command until it works the way you want. Then enter your debugged cp -p command in myprog.sh. Try it, and see if it works.

In other words, the error message in post 3 is telling you, "A file named 2003.m000043.txt does not exist in the path and directory /Users/username/databasegene/, or a directory named /Users/username/databasegene/ does not exist." Find the directory where file 2003.m000043.txt is located, and then substitute that path and directory name for dir01 in myprog.sh.
 
Last edited:
Great, that worked. Thanks a lot for your help, and thanks for your advice to test it manually.

Now I'm trying to build the database. I have a large .txt file m.txt in that looks like this:

...
>something.m.something1.txt
ATCTC...etc...ACTCCCCATTTT...
>something.m.something2.txt
GATTA...etc...GCTTTGGGG

...

It's about 900 entries long. I want to break it up into a database, so that it's broken up by the ">" character. so

>something.m.something1.txt
ATCTC...etc...ACTCCCCATTTT...

would be a file called something.m.something1.txt

I've looked around and come up with:

$ awk '{if(match($0, /^\[ (.+?) \]/, k)){name=k[1]}} {print >name".txt" }' entry.txt

I'm not sure how to tailor this to what I want to do.
 
blueberryfive: Issue the bash command "mkdir db01", then execute the following. Try it, and see if it works.

awk '{if(/^>/)printf substr($0,2)" ";else print $0}' m.txt > tmp01
awk '{name1=$1;dat1=substr($0,length(name1)+2);
print dat1 > "db01/"name1}' tmp01​
 
Last edited:
Hi nvn,

I'm getting:

awk: syntax error at source line 2
context is
print dat1 > >>> "db01/"name1 <<< }
awk: illegal statement at source line 2
 
blueberryfive: Perhaps try removing the line break between the second and third line. The third line in post 6 is a continuation of the second line.

Is directory db01 in the same directory you are running this awk statement in? Directory db01 needs to exist in the directory you are running this awk statement in.

If that fails, perhaps try the following.

awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);
print dat1 > name1}' tmp01​

If that fails, perhaps show us the first several lines in your files m.txt and tmp01. (Enclose the lines within [noparse]
Code:
 ...
[/noparse] tags, for posting.)
 
Last edited:
Hi nvn,

It is in the same directory, removing the line break (;, right?) does not work.

tmp01 is blank.

m.txt looks like this:

Code:
>1123.m000123 conserved hypothetical protein
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAAAATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
>1243.m000121 hypothetical protein
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGTTTGA
 
  • #10
blueberryfive: (1) Do not remove the semicolon; remove only the line break.

(2) Try the following.

awk --lint '{if(/^>/)printf "%s ",substr($1,2);else print $0}' m.txt > tmp01
awk --lint '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 > name1;close(name1)}' tmp01​

(3) If that fails, try removing --lint.

(4) If that fails, try changing awk to gawk.

(5) If that fails, show us the contents of your file tmp01.
 
  • #11
Ok, great! db01 was created. I believe it was (3) that worked. Just a general question: what language is this? Awk?

tmp01 is the same as m.txt. However, the db01 files are incorrect. It seems to have not broken it up the way it should have. I'm not sure that I decribed m.txt very well. It should be:
Code:
>1123.m000123 conserved hypothetical protein
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
AATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
CCTCCCCAATTTTAATGAAAACCGGGGAATTCTCACT
ACATCCAT
>1113.m00028 conserved hypothetical protein
AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
AAAAACCTTAAA
>1243.m000121 hypothetical protein
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
AACACACATACACATTTGGGAC

So, the sequence can be any number of lines.

If the code is:

Code:
 awk '{if(/^>/)printf "%s ",substr($1,2);else print $0}' m.txt > tmp01

Which means Look at each line of m.txt "if > print source filename to substring (what's $1?, a variable?). If not, print $0 (what's this?). And this is being done from m.txt to tmp01.

Code:
awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 >  
name1;close(name1)}' tmp01

Which means look through tmp01 and for name1=db01/$1 (the first file in db01) make it the substring --what does length(name1)+2 mean? then since dat1 (or name 1) = name1? ...a bit lost

Would you be able to explain what is happening here? If not, I understand, you've been very helpful and I don't mean to impose any more than I already have.
 
  • #12
blueberryfive: Yes, the language is awk, sometimes spelled gawk. awk is a command for processing one input record (line) at a time. Yes, $0 is a variable; it means the entire current record. $1 means the first string on the current record.

To learn what each statement or entity means, you can start reading awk or gawk documentation. Issue the command "man awk". Or search internet for awk or gawk documentation. There are countless awk user guides, awk reference manuals, awk tutorials, and awk man pages on internet, ranging from introductory to advanced. Find one or some that you like. Start with introductory ones, to learn the basics, and to learn what each statement or entity means. Search for the entities in the documentation, to look them up. Try this first, and see if it helps answer some of your preliminary questions.

OK, instead of using the awk commands in post 10, try the following. Ensure the first record in file m.txt begins with greater-than sign (>) in column 1 before running the following.

awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
print $0 > file1}' m.txt​
 
Last edited:
  • #13
Worked like a charm! Thanks nvn.
 
  • #14
nvn,

I've been trying to learn awk, but I still have some questions.

Code:
 awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
print $0 > file1}' m.txt

I read this as: if a line matches ">" at the top, then close file 1 where file1="db01/"substr($1,2). Then, print the current record to file 1. I'm confused about the close() function; is it literally closing a file?
Code:
 awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \
| awk 'NF>1 {if($2==$4)
printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01

This one is saying if the first field is not # print the first, second, then on the next line print the first, second. Then | means for the next awk statement to take that output and see if the NODES are the same for each line in the record--NF>1. If the nodes are the same, print the name of that m to a file in 2 same nodes. What's out01?
 
  • #15
blueberryfive: Aside:
awk is a command intended for short, "one-liner" programs. Many aspects of awk are indeed great, although it is somewhat annoying (and limited) in the sense that the entire program must be a quoted string, quoted using apostrophes. Therefore, it is no replacement for a good programming language. But I currently do not have comments on which good programming language generally would be the best one to learn. That would be a topic or debate you could post in a new or different thread. I am, of course, not implying herein that awk is a replacement for a good programming language. I only used awk herein because it is perhaps one of the quickest, easiest ways to perform the current task.​

Moving on to your post 14 questions, yes, the close(file1) function literally closes file1, if file1 is open. If file1 is not open, close(file1) has no effect.

out01 is the arbitrary name of the output file, for the output from the second awk command.
 
Last edited:

Similar threads

Replies
8
Views
2K
Replies
1
Views
3K
Replies
75
Views
6K
Replies
34
Views
3K
Replies
16
Views
2K
Replies
5
Views
2K
Back
Top