- #1
FabianS
- 4
- 1
Homework Statement
I am given the first human chromosome as a long list of bases: ATGCCTGC... (I have N bases) I have to divide the the chromosome into subsequences of length L, so subsequences are as follows:
- From base 1 to base L
- From base 2 to base L+1
- From base 3 to base L+2
- ...
- From base N-L to N
I then have to take each subsequence and see how many times it is repeated throughout the genome (ideally, I could store the position at which it is repeated, as well, but it is not 100% necessary).
Once this is done, I increase L to L' = L +1, and repeat.
N is approximately 250 million base pairs.
L starts at 20.
Homework Equations
The Attempt at a Solution
I am using MATLAB do my processing. I first used Python to convert each A -> 1, T -> 2, G -> 3, C -> 4. I have tried using multiple methods, but they are just too slow.
First, I looked at the first subsequence of length L. I stored it in a hashtable, it's key being the subsequence, and value a position vector. I then moved on to the next subsequence: if that subsequence already exists in the hash table, then update the value vector to include another element: the starting position of this subsequence. If it doesn't exist, create a new key/value pair in the hash table. The problem with this method is that it requires A LOT of memory.
I then tried the 'naive' way. I take the first subsequence, scan along the whole sequence, from 1 to N and see how many times it is repeated. I then select the second subsequence and repeat the process. This does not require so much memory (since I only store a vector of positions, and I only store them if the subsequence is actually repeated), but is extremely slow.
Now, I am using a similar method, this time implemented using the 'strfind()' function in MATLAB. Same thing as before: less memory needed and, though it is faster than method 2, it's still quite slow (around 2 seconds per subsequence... WAY too long).
I have a Dell laptop, Intel Core i5 @ 2.5GHz, with 6GB of RAM.
I am very lost on how to do this. Any help is greatly appreciated.
I am willing to use and learn a different programming language if it will render better results.