Mining Large Quantities of Data (~250 million vectors)

In summary: You could use a mask to isolate the two bits, and then compare them. If they match, you have a match. If they don't match, you skip ahead to the next position.In summary, the task at hand is to divide the first human chromosome into subsequences of increasing length and then determine how many times each subsequence is repeated throughout the genome. This can be done using various methods, such as storing the subsequences in a hash table or using bitwise operations in a compiled language like C or C++. However, with such a large dataset, it may be more efficient to use a compiled language and encode the bases as 2-bit quantities.
  • #1
FabianS
4
1

Homework Statement


I am given the first human chromosome as a long list of bases: ATGCCTGC... (I have N bases) I have to divide the the chromosome into subsequences of length L, so subsequences are as follows:

  • From base 1 to base L
  • From base 2 to base L+1
  • From base 3 to base L+2
  • ...
  • From base N-L to N

I then have to take each subsequence and see how many times it is repeated throughout the genome (ideally, I could store the position at which it is repeated, as well, but it is not 100% necessary).

Once this is done, I increase L to L' = L +1, and repeat.

N is approximately 250 million base pairs.
L starts at 20.

Homework Equations



The Attempt at a Solution



I am using MATLAB do my processing. I first used Python to convert each A -> 1, T -> 2, G -> 3, C -> 4. I have tried using multiple methods, but they are just too slow.

First, I looked at the first subsequence of length L. I stored it in a hashtable, it's key being the subsequence, and value a position vector. I then moved on to the next subsequence: if that subsequence already exists in the hash table, then update the value vector to include another element: the starting position of this subsequence. If it doesn't exist, create a new key/value pair in the hash table. The problem with this method is that it requires A LOT of memory.

I then tried the 'naive' way. I take the first subsequence, scan along the whole sequence, from 1 to N and see how many times it is repeated. I then select the second subsequence and repeat the process. This does not require so much memory (since I only store a vector of positions, and I only store them if the subsequence is actually repeated), but is extremely slow.

Now, I am using a similar method, this time implemented using the 'strfind()' function in MATLAB. Same thing as before: less memory needed and, though it is faster than method 2, it's still quite slow (around 2 seconds per subsequence... WAY too long).

I have a Dell laptop, Intel Core i5 @ 2.5GHz, with 6GB of RAM.

I am very lost on how to do this. Any help is greatly appreciated.
I am willing to use and learn a different programming language if it will render better results.
 
Physics news on Phys.org
  • #2
With so much data to work with, it makes sense to me to work with a compiled language, such as C or C++, rather than an interpreted language, such as MATLAB or Python. In an interpreted language, each statement has to be translated into machine code and then executed. With a compiled language, the whole program is translated all at once, so what executes is pure machine code.

For another thing, since there are only four bases, A, T, C, and G, you can encode these as 0, 1, 2, or 3. This means that you can store a base in two bits. Using this strategy, you can store 16 bases in a 32-bit integer, and 32 bases in a 64-bit integer. If you are encoding A, T, C, and G as integer values 1, 2, 3, or 4, you are probably using 32 bits for each base, instead of just 2 bits as I described.
 
  • #3
Cool, thanks for the reply.

What you're saying about encoding the bases using just 2 bits makes sense. I have some questions, though, regarding the use of C/C++ in this case:
  • I would be importing a text file of 0,1,2,3 sequences. Would I have to do anything special to FORCE it to encode these values using 2-bits? Or would I just use two variables of type BOOL for each base?
  • When I start the scanning process, should I just use the naive method mentioned in my first post (second method)? Or is there a better way of doing this?
  • Finally, I'm assuming that using the bitwise operations in C++ (seq1 & seq2) would be the fastest way of comparing two sequences. Is this correct?
 
  • #4
You could always try C++'s bitset class:

http://www.cplusplus.com/reference/stl/bitset/

There's also vector<bool>, which is a special case of vector that packs bits for space. The difference is that it can be resized, but you probably don't need that.
 
  • #5
FabianS said:
Cool, thanks for the reply.

What you're saying about encoding the bases using just 2 bits makes sense. I have some questions, though, regarding the use of C/C++ in this case:
  • I would be importing a text file of 0,1,2,3 sequences. Would I have to do anything special to FORCE it to encode these values using 2-bits? Or would I just use two variables of type BOOL for each base?
  • Yes, you would have to do the encoding, and no, you don't want to use variables of type Boolean, since these variable use 32 bits to store a 1 or 0.

    I'm not sure how your text file is formatted. Possibly it would have strings consisting of the characters '0', '1', '2', and '3'. Each of these characters is stored in 8 bits. Your program would need to read in some number of these characters, and convert each to a 2-bit quantity, and pack it into an int or long int.

    FabianS said:
    [*]When I start the scanning process, should I just use the naive method mentioned in my first post (second method)? Or is there a better way of doing this?
    [*]Finally, I'm assuming that using the bitwise operations in C++ (seq1 & seq2) would be the fastest way of comparing two sequences. Is this correct?
    Yes, I think so.
 

1. What is "Mining Large Quantities of Data"?

"Mining Large Quantities of Data" refers to the process of extracting useful information or patterns from a large dataset, typically consisting of millions or even billions of data points. This can involve using various techniques such as data mining, machine learning, and statistical analysis to uncover insights and make predictions.

2. Why is mining large quantities of data important?

Mining large quantities of data allows us to gain a deeper understanding of complex systems and make more informed decisions. It can reveal hidden patterns and relationships that may not be apparent through traditional methods, and can also help to identify trends and make predictions for the future.

3. What challenges are involved in mining large quantities of data?

Mining large quantities of data can be challenging due to the sheer volume of information involved. This can make it difficult to process and analyze the data in a timely manner, and can also lead to issues with data quality. Additionally, as the amount of data increases, so does the complexity of the analysis, requiring specialized tools and techniques.

4. How is "Mining Large Quantities of Data" used in real-world applications?

Mining large quantities of data has a wide range of applications in various industries. For example, it is used in finance to detect fraud and make investment decisions, in healthcare to identify patterns in patient data and improve treatment protocols, and in marketing to target specific audiences and personalize advertising. It is also used in scientific research to analyze large datasets and make discoveries.

5. What are some ethical considerations when mining large quantities of data?

As with any use of data, there are ethical considerations to keep in mind when mining large quantities of data. This includes ensuring the privacy and security of individuals' data, being transparent about the use of data and obtaining consent when necessary, and avoiding bias or discrimination in the analysis and interpretation of the data. It is important to follow ethical guidelines and regulations to protect both the individuals whose data is being mined and the integrity of the data itself.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
4
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
9
Views
2K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
954
  • Engineering and Comp Sci Homework Help
Replies
6
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
32
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
10
Views
1K
  • Advanced Physics Homework Help
Replies
6
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
2
Views
1K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
1K
  • Calculus and Beyond Homework Help
Replies
14
Views
595
Back
Top