Mining Large Quantities of Data (~250 million vectors)

Click For Summary

Discussion Overview

The discussion revolves around the challenge of efficiently processing a large dataset representing the first human chromosome, specifically focusing on dividing the chromosome into subsequences and counting their occurrences. The conversation includes considerations of programming languages, data encoding, and algorithmic efficiency.

Discussion Character

  • Homework-related
  • Technical explanation
  • Exploratory

Main Points Raised

  • One participant describes a method using a hashtable to store subsequences and their positions, but notes that it requires significant memory.
  • Another participant suggests using a naive scanning method that is less memory-intensive but very slow.
  • A suggestion is made to use compiled languages like C or C++ for better performance compared to interpreted languages like MATLAB or Python.
  • Encoding bases as 0, 1, 2, or 3 to utilize 2 bits per base is proposed as a way to reduce memory usage.
  • Questions arise about how to implement this encoding in C/C++, including whether special handling is needed for reading from a text file and the efficiency of bitwise operations for comparing sequences.
  • References to C++ features such as bitset and vector are provided as potential solutions for efficient data handling.

Areas of Agreement / Disagreement

Participants generally agree on the potential benefits of using compiled languages and efficient encoding methods, but there is no consensus on the best approach for implementing the scanning process or the specifics of data encoding.

Contextual Notes

Participants express uncertainty about the best methods for encoding and processing the data, particularly regarding the format of input files and the efficiency of different algorithms. There are also unresolved questions about the optimal approach for subsequence comparison.

FabianS
Messages
3
Reaction score
1

Homework Statement


I am given the first human chromosome as a long list of bases: ATGCCTGC... (I have N bases) I have to divide the the chromosome into subsequences of length L, so subsequences are as follows:

  • From base 1 to base L
  • From base 2 to base L+1
  • From base 3 to base L+2
  • ...
  • From base N-L to N

I then have to take each subsequence and see how many times it is repeated throughout the genome (ideally, I could store the position at which it is repeated, as well, but it is not 100% necessary).

Once this is done, I increase L to L' = L +1, and repeat.

N is approximately 250 million base pairs.
L starts at 20.

Homework Equations



The Attempt at a Solution



I am using MATLAB do my processing. I first used Python to convert each A -> 1, T -> 2, G -> 3, C -> 4. I have tried using multiple methods, but they are just too slow.

First, I looked at the first subsequence of length L. I stored it in a hashtable, it's key being the subsequence, and value a position vector. I then moved on to the next subsequence: if that subsequence already exists in the hash table, then update the value vector to include another element: the starting position of this subsequence. If it doesn't exist, create a new key/value pair in the hash table. The problem with this method is that it requires A LOT of memory.

I then tried the 'naive' way. I take the first subsequence, scan along the whole sequence, from 1 to N and see how many times it is repeated. I then select the second subsequence and repeat the process. This does not require so much memory (since I only store a vector of positions, and I only store them if the subsequence is actually repeated), but is extremely slow.

Now, I am using a similar method, this time implemented using the 'strfind()' function in MATLAB. Same thing as before: less memory needed and, though it is faster than method 2, it's still quite slow (around 2 seconds per subsequence... WAY too long).

I have a Dell laptop, Intel Core i5 @ 2.5GHz, with 6GB of RAM.

I am very lost on how to do this. Any help is greatly appreciated.
I am willing to use and learn a different programming language if it will render better results.
 
Physics news on Phys.org
With so much data to work with, it makes sense to me to work with a compiled language, such as C or C++, rather than an interpreted language, such as MATLAB or Python. In an interpreted language, each statement has to be translated into machine code and then executed. With a compiled language, the whole program is translated all at once, so what executes is pure machine code.

For another thing, since there are only four bases, A, T, C, and G, you can encode these as 0, 1, 2, or 3. This means that you can store a base in two bits. Using this strategy, you can store 16 bases in a 32-bit integer, and 32 bases in a 64-bit integer. If you are encoding A, T, C, and G as integer values 1, 2, 3, or 4, you are probably using 32 bits for each base, instead of just 2 bits as I described.
 
Cool, thanks for the reply.

What you're saying about encoding the bases using just 2 bits makes sense. I have some questions, though, regarding the use of C/C++ in this case:
  • I would be importing a text file of 0,1,2,3 sequences. Would I have to do anything special to FORCE it to encode these values using 2-bits? Or would I just use two variables of type BOOL for each base?
  • When I start the scanning process, should I just use the naive method mentioned in my first post (second method)? Or is there a better way of doing this?
  • Finally, I'm assuming that using the bitwise operations in C++ (seq1 & seq2) would be the fastest way of comparing two sequences. Is this correct?
 
You could always try C++'s bitset class:

http://www.cplusplus.com/reference/stl/bitset/

There's also vector<bool>, which is a special case of vector that packs bits for space. The difference is that it can be resized, but you probably don't need that.
 
FabianS said:
Cool, thanks for the reply.

What you're saying about encoding the bases using just 2 bits makes sense. I have some questions, though, regarding the use of C/C++ in this case:
  • I would be importing a text file of 0,1,2,3 sequences. Would I have to do anything special to FORCE it to encode these values using 2-bits? Or would I just use two variables of type BOOL for each base?
  • Yes, you would have to do the encoding, and no, you don't want to use variables of type Boolean, since these variable use 32 bits to store a 1 or 0.

    I'm not sure how your text file is formatted. Possibly it would have strings consisting of the characters '0', '1', '2', and '3'. Each of these characters is stored in 8 bits. Your program would need to read in some number of these characters, and convert each to a 2-bit quantity, and pack it into an int or long int.

    FabianS said:
    [*]When I start the scanning process, should I just use the naive method mentioned in my first post (second method)? Or is there a better way of doing this?
    [*]Finally, I'm assuming that using the bitwise operations in C++ (seq1 & seq2) would be the fastest way of comparing two sequences. Is this correct?
    Yes, I think so.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 9 ·
Replies
9
Views
3K
  • · Replies 1 ·
Replies
1
Views
2K
  • · Replies 32 ·
2
Replies
32
Views
5K
  • · Replies 6 ·
Replies
6
Views
4K
  • · Replies 10 ·
Replies
10
Views
4K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
23
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K