# A searching algorithm?

Staff Emeritus
Gold Member
I have two very long strings of length s and t. I want to find every k-byte substring that occurs in both of them. What good ways are there to do that?

My current algorithm is to store each k-byte substring of the first in a hash table, then look up every k-byte substring of the second. A reasonable algorithm, but the O(k*s) memory requirement is getting to be bothersome.

On the other hand, I simply cannot afford the O(k*s*t) performance of the naive method.

For some reason, I'm sure there has to be a good algorithm for doing exactly this, but I can't remember it.

Now, I have thought that I might be able to build a finite state machine that recognizes k-byte substrings of s, and will probably only take up O(s) memory, and might even be faster than the hash table. But, I haven't worked out a way to build the state machine. (It defeats my purposes if I need more than O(s) memory to build the state machine!)

I'm sorry if any of these are easy questions... I'm fairly exhausted today. =)

Staff Emeritus
Gold Member
Oh, by the way... if you can take advantage of the fact that one of these strings will generally remain constant, but I might want to change the other one a lot, that would be nice too. =) Pretty easy with the hash table idea.

dduardo
Staff Emeritus
See the Boyer-Moor algorithm. I believe the best case is O(n/m) and the worst case is O(n).

NateTG
Homework Helper
Hmm

Instead of using a hash, I would use a tree, which branches on the k elements in your substring.

Then, for each of the leaf nodes in the tree, place a reference to the corresponding branch for the 'tail' of that node. For example, the 'asdf' node would point to the 'sdf' node. That way, as you parse through the second string, you can remove and add one character to your window at a time easily.

Because there is overhead associated with the back references and whatnot, you won't get the best possible storage, but it should have good storage complexity as related to string size.

Staff Emeritus
Gold Member
Hrm. So, I think you're suggesting I take each k-byte substring of the first one, and look for it in the other using this algorithm.

I get to pick k, I'm not sure what I really want it to be... I was using 8 for my hashtable implementation, but a few hundred might be okay too.

This approach is clearly not memory intensive, but I'll have to figure out if it would be fast enough -- I'm pretty sure O(mn) is far too much, but maybe O(mn/k) won't be?

Or, maybe there's a clever way to tweak the algorithm for my problem -- it really seems like it should be adaptable to look for many patterns at once. My gut feeling is that I need to do this to get a reasonable running time.

Staff Emeritus