Program to Pick Out Files From List

In summary, the conversation discusses troubleshooting for a programming project involving a BLAST program and creating a script in Python or another language. The project involves checking for matching NODE values and copying corresponding files into a new folder. Suggestions are given for how to go about this, including using the Bash shell and the Awk programming language. The conversation also discusses how to break up a large text file into a database and the process of creating the script. The details and steps of the code are explained.
  • #1
blueberryfive
36
0
Hi Everyone!

This is my first question for programming. I have very little experience, so if you could explain as simply as possible.

I have a list of output that actually looks like this:

# Query: 2678.m000169 conserved hypothetical protein
# Database: EST_matched_Genome.fas.txt
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 3 hits found
2678.m000169 NODE_295089_length_165_cov_8.775758 100 209 0 0 120 328 1 209 2.00E-107 387
2678.m000169 NODE_575647_length_122_cov_7.155738 100 165 0 0 378 542 166 2 6.00E-83 305
2678.m000169 NODE_311774_length_123_cov_6.276423 100 159 0 0 1 159 159 1 1.00E-79 294
# BLASTN 2.2.23+
# Query: 2678.m000182 hypothetical protein
# Database: EST_matched_Genome.fas.txt
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 8 hits found
2678.m000182 NODE_207397_length_1580_cov_14.322152 99.87 1597 2 0 6354 7950 1598 2 0 2939
2678.m000182 NODE_217031_length_1396_cov_9.866762 99.86 1439 2 0 4724 6162 1440 2 0 2647
2678.m000182 NODE_298029_length_1052_cov_10.303232 100 1096 0 0 10519 11614 1096 1 0 2025
2678.m000182 NODE_224620_length_710_cov_8.192958 100 751 0 0 1357 2107 4 754 0 1387


It's output from a BLAST program.

I would like to write a script in Python, or whichever language is easiest, that (1) checks to see if NODE_X=NODE_Y where NODE_X is the value in the second column directly preceding NODE_Y, and (2) if NODE_X=NODE_Y, it finds the file corresponding to the value in the first column (e.g., 2678.m000182.txt) in a database and puts it into a new folder entitled "2_Same_Nodes."

Does anyone have any suggestions for how I would go about this?

Thanks
 
Technology news on Phys.org
  • #2
blueberryfive: Try this, and let us know if it does not work.

(1) Type the following code into a Bourne or bash shell file named myprog.sh, where in01 is your input file name (containing your BLAST output), and dir01 is the database directory name.

#!/bin/sh
# Created: 2013-03-16, nvn.
awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \
| awk 'NF>1 {if($2==$4)
printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01​

(2) On linux bash, create your output directory, by issuing the following bash command: mkdir 2-same-nodes

(3) Issue the following bash command: sh myprog.sh

(4) Issue the following bash command: sh out01
 
Last edited:
  • #3
Thanks nvn!

It's giving me output that looks like a long list of this:

cp: /Users/username/databasegene/2003.m000043.txt: No such file or directory

Which file or directory is it referring to?

Another question, how exactly do I set up the database? I have a bunch of sequences listed in one text file under the headings abc.m000yz , but I would like to break it apart into individual files and then create a database with them. Then I would be able to call them individually if they meet the criterion of myprog.sh. Does this make sense?

Thanks for your help.
 
  • #4
blueberryfive: Manually find one of the database files you want to copy to 2-same-nodes, then manually enter a cp -p dir01/filename01 2-same-nodes command yourself, at the bash prompt. Debug this cp -p command until it works the way you want. Then enter your debugged cp -p command in myprog.sh. Try it, and see if it works.

In other words, the error message in post 3 is telling you, "A file named 2003.m000043.txt does not exist in the path and directory /Users/username/databasegene/, or a directory named /Users/username/databasegene/ does not exist." Find the directory where file 2003.m000043.txt is located, and then substitute that path and directory name for dir01 in myprog.sh.
 
Last edited:
  • #5
Great, that worked. Thanks a lot for your help, and thanks for your advice to test it manually.

Now I'm trying to build the database. I have a large .txt file m.txt in that looks like this:

...
>something.m.something1.txt
ATCTC...etc...ACTCCCCATTTT...
>something.m.something2.txt
GATTA...etc...GCTTTGGGG

...

It's about 900 entries long. I want to break it up into a database, so that it's broken up by the ">" character. so

>something.m.something1.txt
ATCTC...etc...ACTCCCCATTTT...

would be a file called something.m.something1.txt

I've looked around and come up with:

$ awk '{if(match($0, /^\[ (.+?) \]/, k)){name=k[1]}} {print >name".txt" }' entry.txt

I'm not sure how to tailor this to what I want to do.
 
  • #6
blueberryfive: Issue the bash command "mkdir db01", then execute the following. Try it, and see if it works.

awk '{if(/^>/)printf substr($0,2)" ";else print $0}' m.txt > tmp01
awk '{name1=$1;dat1=substr($0,length(name1)+2);
print dat1 > "db01/"name1}' tmp01​
 
Last edited:
  • #7
Hi nvn,

I'm getting:

awk: syntax error at source line 2
context is
print dat1 > >>> "db01/"name1 <<< }
awk: illegal statement at source line 2
 
  • #8
blueberryfive: Perhaps try removing the line break between the second and third line. The third line in post 6 is a continuation of the second line.

Is directory db01 in the same directory you are running this awk statement in? Directory db01 needs to exist in the directory you are running this awk statement in.

If that fails, perhaps try the following.

awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);
print dat1 > name1}' tmp01​

If that fails, perhaps show us the first several lines in your files m.txt and tmp01. (Enclose the lines within [noparse]
Code:
 ...
[/noparse] tags, for posting.)
 
Last edited:
  • #9
Hi nvn,

It is in the same directory, removing the line break (;, right?) does not work.

tmp01 is blank.

m.txt looks like this:

Code:
>1123.m000123 conserved hypothetical protein
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAAAATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
>1243.m000121 hypothetical protein
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGTTTGA
 
  • #10
blueberryfive: (1) Do not remove the semicolon; remove only the line break.

(2) Try the following.

awk --lint '{if(/^>/)printf "%s ",substr($1,2);else print $0}' m.txt > tmp01
awk --lint '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 > name1;close(name1)}' tmp01​

(3) If that fails, try removing --lint.

(4) If that fails, try changing awk to gawk.

(5) If that fails, show us the contents of your file tmp01.
 
  • #11
Ok, great! db01 was created. I believe it was (3) that worked. Just a general question: what language is this? Awk?

tmp01 is the same as m.txt. However, the db01 files are incorrect. It seems to have not broken it up the way it should have. I'm not sure that I decribed m.txt very well. It should be:
Code:
>1123.m000123 conserved hypothetical protein
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
AATTTGAAAACCGGGGAAACCCTCCGCAATTTATAA
ATAAAGAACGCCGGaaaAATGCCTCCCCAATTTTAAA
AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
CCTCCCCAATTTTAATGAAAACCGGGGAATTCTCACT
ACATCCAT
>1113.m00028 conserved hypothetical protein
AATTTGAAAACCGGGGAAACCACCGGGGAAACCCT
AAAAACCTTAAA
>1243.m000121 hypothetical protein
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
ATGAGAAAATAATGGAAATTTTTTATCTTTTAATGT
TTGATTGATTGATTGATTGATTGAATCCATCCCAAA
AACACACATACACATTTGGGAC

So, the sequence can be any number of lines.

If the code is:

Code:
 awk '{if(/^>/)printf "%s ",substr($1,2);else print $0}' m.txt > tmp01

Which means Look at each line of m.txt "if > print source filename to substring (what's $1?, a variable?). If not, print $0 (what's this?). And this is being done from m.txt to tmp01.

Code:
awk '{name1="db01/"$1;dat1=substr($0,length(name1)+2);print dat1 >  
name1;close(name1)}' tmp01

Which means look through tmp01 and for name1=db01/$1 (the first file in db01) make it the substring --what does length(name1)+2 mean? then since dat1 (or name 1) = name1? ...a bit lost

Would you be able to explain what is happening here? If not, I understand, you've been very helpful and I don't mean to impose any more than I already have.
 
  • #12
blueberryfive: Yes, the language is awk, sometimes spelled gawk. awk is a command for processing one input record (line) at a time. Yes, $0 is a variable; it means the entire current record. $1 means the first string on the current record.

To learn what each statement or entity means, you can start reading awk or gawk documentation. Issue the command "man awk". Or search internet for awk or gawk documentation. There are countless awk user guides, awk reference manuals, awk tutorials, and awk man pages on internet, ranging from introductory to advanced. Find one or some that you like. Start with introductory ones, to learn the basics, and to learn what each statement or entity means. Search for the entities in the documentation, to look them up. Try this first, and see if it helps answer some of your preliminary questions.

OK, instead of using the awk commands in post 10, try the following. Ensure the first record in file m.txt begins with greater-than sign (>) in column 1 before running the following.

awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
print $0 > file1}' m.txt​
 
Last edited:
  • #13
Worked like a charm! Thanks nvn.
 
  • #14
nvn,

I've been trying to learn awk, but I still have some questions.

Code:
 awk '{if(/^>/){close(file1);file1="db01/"substr($1,2)};
print $0 > file1}' m.txt

I read this as: if a line matches ">" at the top, then close file 1 where file1="db01/"substr($1,2). Then, print the current record to file 1. I'm confused about the close() function; is it literally closing a file?
Code:
 awk '$1 !~ /^#/ {printf "%s %s\n%s %s ",$1,$2,$1,$2}' in01 \
| awk 'NF>1 {if($2==$4)
printf "cp -p dir01/%s.txt 2-same-nodes\n",$1}' > out01

This one is saying if the first field is not # print the first, second, then on the next line print the first, second. Then | means for the next awk statement to take that output and see if the NODES are the same for each line in the record--NF>1. If the nodes are the same, print the name of that m to a file in 2 same nodes. What's out01?
 
  • #15
blueberryfive: Aside:
awk is a command intended for short, "one-liner" programs. Many aspects of awk are indeed great, although it is somewhat annoying (and limited) in the sense that the entire program must be a quoted string, quoted using apostrophes. Therefore, it is no replacement for a good programming language. But I currently do not have comments on which good programming language generally would be the best one to learn. That would be a topic or debate you could post in a new or different thread. I am, of course, not implying herein that awk is a replacement for a good programming language. I only used awk herein because it is perhaps one of the quickest, easiest ways to perform the current task.​

Moving on to your post 14 questions, yes, the close(file1) function literally closes file1, if file1 is open. If file1 is not open, close(file1) has no effect.

out01 is the arbitrary name of the output file, for the output from the second awk command.
 
Last edited:

What is a "Program to Pick Out Files From List"?

A "Program to Pick Out Files From List" is a software tool that allows you to select specific files from a list of files on your computer. This can be useful for organizing and managing large amounts of data, such as images, documents, or music files.

How does the program work?

The program works by allowing you to input a list of files, either by manually entering the file names or by selecting them from a folder on your computer. You can then specify criteria for which files to pick out, such as file type, size, or date created. The program will then filter through the list and select the files that meet your criteria.

What are some potential use cases for this program?

This program can be useful for a variety of purposes, such as organizing files for a project, selecting specific images for a presentation, or finding all documents created within a certain time frame. It can also be helpful for cleaning up cluttered folders or backing up important files.

Is this program user-friendly?

Yes, most "Programs to Pick Out Files From List" have a user-friendly interface and simple instructions for use. However, some programs may have more advanced features that require some technical knowledge. It is always recommended to read the instructions or watch a tutorial before using a new program.

Are there any limitations to this program?

Like any software tool, there may be limitations to what a "Program to Pick Out Files From List" can do. For example, some programs may only work with certain file types or have a maximum limit for the number of files that can be selected. It is important to research the capabilities of a program before using it to ensure it fits your needs.

Similar threads

  • Programming and Computer Science
Replies
12
Views
1K
  • Programming and Computer Science
Replies
5
Views
1K
  • Programming and Computer Science
3
Replies
75
Views
4K
  • Programming and Computer Science
Replies
8
Views
1K
  • Programming and Computer Science
Replies
1
Views
3K
  • Programming and Computer Science
Replies
6
Views
3K
  • Programming and Computer Science
Replies
1
Views
2K
  • Programming and Computer Science
Replies
4
Views
4K
  • Programming and Computer Science
Replies
1
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
4
Views
2K
Back
Top