Mining Large Quantities of Data (~250 million vectors)

FabianS · May 2, 2011

Homework Statement

I am given the first human chromosome as a long list of bases: ATGCCTGC... (I have N bases) I have to divide the the chromosome into subsequences of length L, so subsequences are as follows:

From base 1 to base L
From base 2 to base L+1
From base 3 to base L+2
...
From base N-L to N

I then have to take each subsequence and see how many times it is repeated throughout the genome (ideally, I could store the position at which it is repeated, as well, but it is not 100% necessary).

Once this is done, I increase L to L' = L +1, and repeat.

N is approximately 250 million base pairs.
L starts at 20.

Homework Equations

The Attempt at a Solution

I am using MATLAB do my processing. I first used Python to convert each A -> 1, T -> 2, G -> 3, C -> 4. I have tried using multiple methods, but they are just too slow.

First, I looked at the first subsequence of length L. I stored it in a hashtable, it's key being the subsequence, and value a position vector. I then moved on to the next subsequence: if that subsequence already exists in the hash table, then update the value vector to include another element: the starting position of this subsequence. If it doesn't exist, create a new key/value pair in the hash table. The problem with this method is that it requires A LOT of memory.

I then tried the 'naive' way. I take the first subsequence, scan along the whole sequence, from 1 to N and see how many times it is repeated. I then select the second subsequence and repeat the process. This does not require so much memory (since I only store a vector of positions, and I only store them if the subsequence is actually repeated), but is extremely slow.

Now, I am using a similar method, this time implemented using the 'strfind()' function in MATLAB. Same thing as before: less memory needed and, though it is faster than method 2, it's still quite slow (around 2 seconds per subsequence... WAY too long).

I have a Dell laptop, Intel Core i5 @ 2.5GHz, with 6GB of RAM.

I am very lost on how to do this. Any help is greatly appreciated.
I am willing to use and learn a different programming language if it will render better results.

Mark44 · May 2, 2011

With so much data to work with, it makes sense to me to work with a compiled language, such as C or C++, rather than an interpreted language, such as MATLAB or Python. In an interpreted language, each statement has to be translated into machine code and then executed. With a compiled language, the whole program is translated all at once, so what executes is pure machine code.

For another thing, since there are only four bases, A, T, C, and G, you can encode these as 0, 1, 2, or 3. This means that you can store a base in two bits. Using this strategy, you can store 16 bases in a 32-bit integer, and 32 bases in a 64-bit integer. If you are encoding A, T, C, and G as integer values 1, 2, 3, or 4, you are probably using 32 bits for each base, instead of just 2 bits as I described.

FabianS · May 2, 2011

Cool, thanks for the reply.

What you're saying about encoding the bases using just 2 bits makes sense. I have some questions, though, regarding the use of C/C++ in this case:

I would be importing a text file of 0,1,2,3 sequences. Would I have to do anything special to FORCE it to encode these values using 2-bits? Or would I just use two variables of type BOOL for each base?
When I start the scanning process, should I just use the naive method mentioned in my first post (second method)? Or is there a better way of doing this?
Finally, I'm assuming that using the bitwise operations in C++ (seq1 & seq2) would be the fastest way of comparing two sequences. Is this correct?

Grep · May 2, 2011

You could always try C++'s bitset class:

http://www.cplusplus.com/reference/stl/bitset/

There's also vector<bool>, which is a special case of vector that packs bits for space. The difference is that it can be resized, but you probably don't need that.

Mark44 · May 2, 2011

FabianS said:

Cool, thanks for the reply.

What you're saying about encoding the bases using just 2 bits makes sense. I have some questions, though, regarding the use of C/C++ in this case:

I would be importing a text file of 0,1,2,3 sequences. Would I have to do anything special to FORCE it to encode these values using 2-bits? Or would I just use two variables of type BOOL for each base?

Yes, you would have to do the encoding, and no, you don't want to use variables of type Boolean, since these variable use 32 bits to store a 1 or 0.

I'm not sure how your text file is formatted. Possibly it would have strings consisting of the characters '0', '1', '2', and '3'. Each of these characters is stored in 8 bits. Your program would need to read in some number of these characters, and convert each to a 2-bit quantity, and pack it into an int or long int.

FabianS said:

[*]When I start the scanning process, should I just use the naive method mentioned in my first post (second method)? Or is there a better way of doing this?
[*]Finally, I'm assuming that using the bitwise operations in C++ (seq1 & seq2) would be the fastest way of comparing two sequences. Is this correct?

Yes, I think so.

Mining Large Quantities of Data (~250 million vectors)

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Homework Statement

Homework Equations

The Attempt at a Solution

Similar threads

Four L-shaped members: Mechanical Analysis Problem

Engineering Joint and Marginal Distributions of a Randomly Selected Test Answer

Engineering Half wave voltage doubler

Engineering Drawing Acceleration Diagram with Coriolis Acceleration

Truss analysis problem

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect