Finding K-Byte Substrings in Two Long Strings

  • Thread starter Hurkyl
  • Start date
  • Tags
    Algorithm
In summary, the conversation discusses different approaches to finding every k-byte substring that occurs in two very long strings. The current algorithm of using a hash table has a high memory requirement, while the naive method has poor performance. Suggestions are made to use a tree or a finite state machine, and the Boyer-Moore algorithm is mentioned as a potential solution. The conversation also mentions the possibility of changing one string frequently and the need to find a faster algorithm. One suggestion is to use the suffix tree data structure.
  • #1
Hurkyl
Staff Emeritus
Science Advisor
Gold Member
14,981
26
I have two very long strings of length s and t. I want to find every k-byte substring that occurs in both of them. What good ways are there to do that?

My current algorithm is to store each k-byte substring of the first in a hash table, then look up every k-byte substring of the second. A reasonable algorithm, but the O(k*s) memory requirement is getting to be bothersome.

On the other hand, I simply cannot afford the O(k*s*t) performance of the naive method.


For some reason, I'm sure there has to be a good algorithm for doing exactly this, but I can't remember it.


Now, I have thought that I might be able to build a finite state machine that recognizes k-byte substrings of s, and will probably only take up O(s) memory, and might even be faster than the hash table. But, I haven't worked out a way to build the state machine. (It defeats my purposes if I need more than O(s) memory to build the state machine!)


I'm sorry if any of these are easy questions... I'm fairly exhausted today. =)
 
Computer science news on Phys.org
  • #2
Oh, by the way... if you can take advantage of the fact that one of these strings will generally remain constant, but I might want to change the other one a lot, that would be nice too. =) Pretty easy with the hash table idea.
 
  • #3
See the Boyer-Moor algorithm. I believe the best case is O(n/m) and the worst case is O(n).
 
  • #4
Hmm

Instead of using a hash, I would use a tree, which branches on the k elements in your substring.

Then, for each of the leaf nodes in the tree, place a reference to the corresponding branch for the 'tail' of that node. For example, the 'asdf' node would point to the 'sdf' node. That way, as you parse through the second string, you can remove and add one character to your window at a time easily.

Because there is overhead associated with the back references and whatnot, you won't get the best possible storage, but it should have good storage complexity as related to string size.
 
  • #5
Hrm. So, I think you're suggesting I take each k-byte substring of the first one, and look for it in the other using this algorithm.

I get to pick k, I'm not sure what I really want it to be... I was using 8 for my hashtable implementation, but a few hundred might be okay too.


This approach is clearly not memory intensive, but I'll have to figure out if it would be fast enough -- I'm pretty sure O(mn) is far too much, but maybe O(mn/k) won't be?


Or, maybe there's a clever way to tweak the algorithm for my problem -- it really seems like it should be adaptable to look for many patterns at once. My gut feeling is that I need to do this to get a reasonable running time.
 
  • #7
What you suggest, Nate, sounds similar to what I was thinking with my state machine idea. I don't think backreferences are necessary -- you just need to know to which state to transition on each possible character.

(Note that the hash table also allows quick sliding, at least with the appropriate hash function)


(P.S. for those interested I remember reading about an improvement on the Boyer-Moore algorithm -- "IntroSearch" or something like that. Basically, it keeps track of how poorly it does, and switches over to the other algorithm that has the better worst-case performance. I couldn't remember what it was until y'all reminded me though. There's a sorting algorithm with a similar idea, "IntroSort", which switches over to heap sort when the quicksort performs poorly. I think it's even implemented in GNU's implementation of the C++ STL)
 

1. What is the purpose of finding K-Byte substrings in two long strings?

The purpose of finding K-Byte substrings in two long strings is to identify and extract a specific sequence of characters (K-Byte substring) that appears in both strings. This can be useful for tasks such as DNA sequencing, text mining, and data analysis.

2. How is the length of the K-Byte substring determined?

The length of the K-Byte substring is determined by the value of K, which is typically a positive integer. This value can be chosen based on the specific needs of the task at hand, such as the expected length of the desired substring or the complexity of the strings being analyzed.

3. What is the algorithm used to find K-Byte substrings in two long strings?

The most commonly used algorithm for finding K-Byte substrings in two long strings is the sliding window algorithm. This involves moving a window of length K along both strings, comparing the characters within the window at each step to determine if they match. This process continues until all possible substrings have been checked.

4. Are there any limitations to finding K-Byte substrings in two long strings?

One limitation is that the algorithm may become computationally expensive for very large strings or when the value of K is large. Additionally, the algorithm may not be effective if the strings do not have enough overlap or if there are multiple matches for the K-Byte substring.

5. How can the results of finding K-Byte substrings be interpreted?

The results can be interpreted as a list of all the locations where the K-Byte substring appears in both strings, along with the actual substring itself. This information can be used for further analysis or for completing the desired task, such as identifying common patterns or relationships between the two strings.

Similar threads

  • Engineering and Comp Sci Homework Help
Replies
4
Views
1K
  • Programming and Computer Science
Replies
2
Views
160
  • Engineering and Comp Sci Homework Help
Replies
10
Views
1K
  • Programming and Computer Science
Replies
2
Views
2K
  • Computing and Technology
2
Replies
52
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
4
Views
2K
  • Programming and Computer Science
Replies
3
Views
2K
  • Programming and Computer Science
Replies
29
Views
3K
  • Programming and Computer Science
Replies
9
Views
3K
  • Engineering and Comp Sci Homework Help
Replies
1
Views
3K
Back
Top