What Is the Shortest String Containing All m-Digit Substrings Made of n Symbols?

StatusX · Jun 27, 2005

Find (for specific cases, or a general algorithm) the shortest string which contain as substrings all m digit strings made of n symbols. For example, for (2,2), the possible substrings are 00,01,10,11. One string of minimal length (which in general is n^m+m-1 digits) which contains all of these as substrings is 00110.

I don't know of a general algorithim to generate the minimal superstring, or even whether it always exists. One thing I've noticed, though, is that the superstrings always seem to end with the same m-1 digits they start with, (which you can exploit to generate many more solutions), but I haven't been able to prove this.

BicycleTree · Jun 28, 2005

Is this homework?

NateTG · Jun 28, 2005

It shouldn't be that hard to figure out that:
It's easy to extend any non-repeating partial superstring up to some length to make a minimal superstring.

There is a large number of possible minimal superstrings. (Something like m!^n but I don't remember the particulars.)

StatusX · Jun 28, 2005

No. Try it and you'll see it's a lot harder than it looks. By the way here's some stuff if anyone wants it:

(2,3): 0011220210

(3,2): 0001011100

Proof of what I mentioned at the end of the first post, that the first m-1 digits are also the last m-1 digits:

Call the first (m-1) digits of the superstring A. The following strings must all appear exactly one time as substrings of the superstring: 0A, 1A, 2A,...,nA, (the xA) and A0, A1, A2,...,An (the Ax):. If A doesn't end the superstring, there need to be n copies of A throughout the superstring (including possible overlaps between copies of A. eg, if A=11, then there are 3 copies of A in 11122113.), one for each possible Ax. However, then there would only be (n-1) instances of xA (because one of the A's begins the superstring), and one of the strings above would not be included. The only way to add it without adding an extra Ax is to have an A at the end.

BicycleTree · Jun 28, 2005

Let's stop talking in white. This isn't like a quick brainteaser where you're worried about revealing answers.

Interesting. For binary and 3, I have 1100010111 and 0100011101. You can get four others from each of those by replacing 0 with 1 and vice versa, and by reversing the whole string, so I have 8 ways for that. Also for binary and 4 I got 1100001111010010110. Treating each 000, 001, 010, etc. as a two's complement binary number with 3 digits, the problem is the same as visiting each node exactly once of a graph where there is a directed edge between n and n*2, and between n and n*2+1, and between n and not(not(n)*2), and between n and not(not(n)*2+1).

NateTG · Jun 28, 2005

OK there is always a minimal loop:

Let's refer to the symbols as 0,1,...,n-1 since there are finitely many, they can be aribitarily numbered.
Start with 0,0,0,,,,0 (m 0's.) This forms a 'loop' in that the trailing m-1 symbols, and the leading m-1 symbols are the same in the same sequence.

For n=1 this is the 'superstring'.

Clearly, for n>1 we can break open this loop, and start appending any symbols that would not lead to the repetition of a substring. It's easy to prove that the only times that if there are no legal symbols to append, the superstring is a loop.

That means that we can 'close the loop' and break it open somewhere else to add new symbols, if there are any unused substrings that would allow it.

Now, let's say that there are no places to 'break open' and extend the loop.

Clearly, we have, at some point in the parital, the symbols a_1,a_2,...a_m in that order.
Clearly any substring of the form a_2,a_3...a_{m},x_1 could be added there, so the loop could be broken there. Therefore, all subtrings of the form a_2,a_3...a_m,x_1 must be in the loop.
Since any subtring of that form is in the loop, any substing of the form a_2,a_3...,x_1,x_2 must also be in the loop, otherwise it could be grafted on to theoccurance of a_2,a_3...x_1.
This leads to the conclusion that any substring of the form x_1,x_2...x_m must be in the loop, so the loop covers the entire space.

BicycleTree · Jun 28, 2005

What do you mean, "break open"? "Close"?

NateTG · Jun 28, 2005

Let's say we're looking at 3 bit strings:
000 001 010 011 100 101 110 111

Now, something like:
0001(00)->
is a loop in the sense that it's the same as:
0010(00)->
or
0100(01)->

So if you think of it as a loop then you have to 'break' it open in order to splice stuff in
0100(01) -> 01001 -> 010011 -> 010011(01)

I don't know if that clarifies things. The symbols in parens represent 'overlap'.

BicycleTree · Jun 28, 2005

Define "break open." I don't know exactly what you mean from the example.

StatusX · Jun 28, 2005

I followed you up to the last paragraph. I also thought of these strings as loops, and you can generate many superstrings by starting at any point in the loop and repeating the first m-1 digits again at the end. But I can't figure out how you would add a digit at some point in the loop and be sure you wouldn't duplicate any of the m substrings it becomes a part of. Could you maybe illustrate your method with an example of how you could create a superstring with it?

NateTG · Jun 29, 2005

StatusX said:

I followed you up to the last paragraph. I also thought of these strings as loops, and you can generate many superstrings by starting at any point in the loop and repeating the first m-1 digits again at the end. But I can't figure out how you would add a digit at some point in the loop and be sure you wouldn't duplicate any of the m substrings it becomes a part of. Could you maybe illustrate your method with an example of how you could create a superstring with it?

I'll try to be more formal:
Let's, assume, for a moment, that given a loop
A=a_1,a_2...a_k,a_1,a_2,...a_{m-1}
Claim:
There is either a substring
B=a_{o+1},a_{o+2}..a_{o+m-1},x
such that: B is not in A
or
A contains all of the possible substrings.

And, let's also assume that if
S=s_1,s_2,s_3...s_k
is a 'non-repeating partial superstring', that it is always possible to append symbols to it to form a loop.

So, let's start with an arbitrary 'non-repeating partial substring' - for example 0,0,...0. Clearly, based on the second assumption it can be extended into a loop
a_1,a_2...a_l,a_1,a_3...a_{m-1}
Now, if there is a suitable B=a_{o+1},a_{o+2}...a_{o+m-1},x
Then
a_{o+m},a_{o+m+1}...a_{l},a_1,a_2...a_{o+1},a_{o+2}...a_{o+m-1},x
must be a larger 'non-repeating partial superstring' that the inital one and we can repeat the process.
Otherwise, we have the minimal non-repeating loop and we are done.

Since the number of symbols in a non-repeating loop is bounded above by m^n+m-1 (or m^n depending on how you calculate loop length) and the process increases the length of the string by one symbol each iteration, it will run at most m^n+m-1 times before terminating.

For example, let's say we're looking at legth 3, 2-symbol strings.

So, we start with 0,0,0 which is already a loop.
But, we can find a 'suitable B': 0,0,1
So we have:
0,0,0,1
0,0,0,1,1
0,0,0,1,1,0
0,0,0,1,1,0,1
And, once again, there is a problem because we can't add any more symbols. At this point the only remaining B is 1,1,1
So, rotate
0,0,0,1,1,0,1 \rightarrow 0,1,0,0,0,1,1,
and append
0,1,0,0,0,1,1,1
And this is the string
0,1,0,0,0,1,1,1,(0,1)

StatusX · Jun 29, 2005

But you're assuming that rotating will preserve the property that no substrings are repeated. What about the string 000110, for (3,2)? No substrings are repeated here, but rotating by one will leave two copies of 000. The problem is easy to fix here because this is a very simple case, but we need a completely general formula if we want to handle larger m and n values.

NateTG · Jun 29, 2005

StatusX said:

But you're assuming that rotating will preserve the property that no substrings are repeated. What about the string 000110, for (3,2)? No substrings are repeated here, but rotating by one will leave two copies of 000. The problem is easy to fix here because this is a very simple case, but we need a completely general formula if we want to handle larger m and n values.

You chop off the overlap before you rotate. For example, if you start with
000110
You still add 0 (or actually 1) before you get into trouble:
0001100
Now the loop is really
00011
Which is what you rotate.

You're guaranteed to be able to close the loop for any non-repeating string, so you don't really lose any length when you do so.

StatusX · Jun 29, 2005

I understand what you're saying, but I don't think it will carry on to more complicated strings. When you add a digit, you aren't just adding a new substring and not affecting anything else. You are changing every substring within m digits of that spot. How can you be sure you aren't causing some substring to be duplicated or another to be removed?

For example, in (4,2), if the loop is 000011001(000), and you add a 1 in the middle to form 0000111001(000), you've added 0111 and 1110 but you've gotten rid of 0110.

NateTG · Jun 29, 2005

I don't care so much about losing substrings as long as I can keep making the partial longer until they're all there.

Regarding creating duplicates:
You can't create any new substrings on the left of the new symbol, and, if the string doesn't already loop, you're guaranteed to have unused substrings that allow you to close the loop. (It's pretty easy to prove).

StatusX · Jun 29, 2005

I'm not exactly sure what it is you're saying is easy to prove. Please feel free to show some of these easy proofs you've been mentioning. But in any case, your method, as I understand it, would come to a dead end on this string:

0010100001111(001)

I found this with a computer program I wrote, and there are probably simpler examples. There are no duplicates in this string, and yet there is no place to add a digit without creating a new loop which does have duplicates. Sorry if it feels like I'm attacking your idea, I just think this is a deeper, subtler problem and I'd be disappointed if it could be solved by such a simple, brute force approach.

NateTG · Jun 30, 2005

StatusX said:

Sorry if it feels like I'm attacking your idea, I just think this is a deeper, subtler problem and I'd be disappointed if it could be solved by such a simple, brute force approach.

No, no, please keep chopping away. It keeps me honest, and improves my ability to explain things. That said, I'm quite sure that what I have in my head is valid, but I have a hard time describing it clearly.

Now:
Your loop is:
0010100001111(001)

Based on the (001) I would guess that we're looking at 4 bit strings, so there are 2^4 = 16 of them. There are 13 substrings in the loop, so there are three missing. (0110, 1011, and 1101.)
I should be able to rotate to:
0000111100101(000)
'Chop off' the overlap
0000111100101
And add a 1
00001111001011
The following 0 is forced
000011110010110
And the following 1 is forced, and that's a minimal superstring.
0000111100101101(000)

P.S. What programming language are you using?

StatusX · Jun 30, 2005

NateTG said:

And add a 1
00001111001011

But now there are two copies of 1100. True, there are no duplicates if you don't consider this as a loop, but what if you couldn't append another digit at the end here? You'd have to rotate the string and then there would be duplicates, loop or not.

Maybe I'm still misunderstanding your method. It seems to involve some on the fly thinking, stuff that a computer couldn't do. Could you maybe outline the structure of a program that would find the minimal string?

P.S. What programming language are you using?

It's a program called jamagic. It's not the best thing for something like this (it's designed for making 2D and 3D games), but it's all I had on this computer.

NateTG · Jun 30, 2005

OK.

Think of the process this way:
Start with a loop. For example the (00...0) substring.
If there are no unused substrings, then we are done.
Otherwise, there is some substring for which the first m-1 symbols match a section of the loop. (This requires a proof.)
Rotate the loop so that it 'ends' with the matching section (after chopping off the overlap) - at this point we cease treating it as a loop.
Append the last symbol of said unused substring.
Continue appending symbols that do not cause a repetition until there are no such symbols.
At this point the superstring will, once again, form a loop (this requires a proof) so the process can be restarted.

Note: Although the process 'looses' substrings from chopping off the overlap, the 'reformation' of a loop guarantees that it will regain at least as many substrings as it lost. With the extra symbol that is added, this means that the number of substrings covered by the process is strictly increasing unless all are covered. Since there is a finite number of them, the process must terminate.

This seems like it would be somewhat annoying to program...

StatusX · Jun 30, 2005

At this point the superstring will, once again, form a loop (this requires a proof) so the process can be restarted.

Alright, here is the assumption you've been making that I wasn't clear on. Do you have a proof in mind for this? If so, I think this is a promising idea.

StatusX · Jun 30, 2005

Ok, you're right. I found a proof for it: Call the last m-1 digits A. If there is no digit that can be added at the end of A without forming duplicates, there must be n other copies of A throughout the string, each with a different digit at the end. Then there are n+1 total copies of A, and since there are no duplicates, only n of these can have a space at the front, which means the superstring must start with A, and is therefore a loop.

Nice job. I'll look through the whole proof just to make sure, but it looks like you've proved the existence of a minimal superstring. I'm still going to look at this problem, though, to see if I can find any patterns in this process to come up with an algorithm that takes as few steps as possible.

NateTG · Jun 30, 2005

StatusX said:

Nice job. I'll look through the whole proof just to make sure, but it looks like you've proved the existence of a minimal superstring. I'm still going to look at this problem, though, to see if I can find any patterns in this process to come up with an algorithm that takes as few steps as possible.

Are you trying to find all of them, or just one?

StatusX · Jun 30, 2005

NateTG said:

Are you trying to find all of them, or just one?

Just a simple pattern that generates the minimal strings. For example, for (m,2), the strings, minus the tails, are: 1122, 11121222, ... It feels like there could a pattern there, and maybe for other n values as well.

But I just have one more question about your method. After you have a loop and chop off the final m-1 digits, you need to cycle the string until you find a spot where you can add more digits. For one thing, you could always add digits without cycling and get those m-1 digits back, so you need to specify that you can't do that. (I'm only saying this because I'm a little concerned about how complicated the algorithm is getting). But assuming you do choose a different path, how can you be sure that it won't terminate even sooner?

NateTG · Jun 30, 2005

StatusX said:

But I just have one more question about your method. After you have a loop and chop off the final m-1 digits, you need to cycle the string until you find a spot where you can add more digits. For one thing, you could always add digits without cycling and get those m-1 digits back, so you need to specify that you can't do that. (I'm only saying this because I'm a little concerned about how complicated the algorithm is getting). But assuming you do choose a different path, how can you be sure that it won't terminate even sooner?

I specify that the first appended symbol is the one that adds a new substring. Any symbol that would lead to a shorter cycle, or a cycle of the same length, would requre that the substring was in the loop previously.

StatusX · Jul 1, 2005

NateTG said:

Any symbol that would lead to a shorter cycle, or a cycle of the same length, would requre that the substring was in the loop previously.

How do you know that?

NateTG · Jul 1, 2005

Because the symbol is being added 'just before' the overlap. If the loop ends up being the same length, thent the new symbol must be part of the overlap, thus that substring must have been in the prior loop.

StatusX · Jul 1, 2005

Sorry to drag this on so long, but I want to make sure the proof is completely solid. Say you come to this string in (3,2):

0101110001

At this point, there's no digit you can add at the end, so you need to make it a loop and chop off the tail:

01011100(01) -> 01011100

Now you can add a 1 at the end to form:

010111001

This is a perfectly valid operation under your algorithm, but now we've got a new, shorter loop. If the length is no longer strictly increasing, I don't see how the proof can stand.

NateTG · Jul 1, 2005

Actually, and this may have been too subtle, the process requires that the first substring that you create is not in the previous loop - so what you did would not be legal according to the method I proposed.

BicycleTree · Jul 1, 2005

Also, "minimal" is not the right word. Every set of strings for given n, m has a minimal-length member, because the set of string lengths is a set of integers, which is well-ordered. A term like "theoretic minimal" would be more accurate.

Ah, reading the rest of StatusX's posts, I see he already said what I said in the previous two posts, now deleted.

BicycleTree · Jul 1, 2005

NateTG said:

Because the symbol is being added 'just before' the overlap. If the loop ends up being the same length, thent the new symbol must be part of the overlap, thus that substring must have been in the prior loop.

How does that follow?

BicycleTree · Jul 1, 2005

I googled "superstring" and "substring" and it seems that finding the shortest common superstring of a set of substrings is not an unknown problem. StatusX, you never answered my question of whether this is homework.

NateTG · Jul 1, 2005

BicycleTree said:

That does not follow. The so-called "extra symbol" may be part of the m-1 overlap symbols, generating a new string of only the same length.

Since appending that symbol at that point cannot, by construction, generate a substring that was in the loop (before the chop occured) it cannot be the first of the m-1 symbols of overlap - since, in order to be that, it would have to match a segment of the original loop.

That said, I'm not entirely sure I've covered the case where the situation is something like:
Pre-chop: 10(011)
Post-chop: 1(011)

So the algorithm doens't necessarily terminate.

StatusX · Jul 1, 2005

BicycleTree said:

StatusX, you never answered my question of whether this is homework.

Yes I did. Its not.

BicycleTree · Jul 1, 2005

Note: I have edited and deleted some posts.

BicycleTree · Jul 1, 2005

StatusX said:

Yes I did. Its not.

OK... but you didn't say it before.

StatusX · Jul 1, 2005

I hate to nitpick, but in my second post, that "No" was for you. Nate's post wasn't there when I started or else I would have addressed you more clearly.

StatusX · Jul 1, 2005

NateTG said:

So the algorithm doens't necessarily terminate.

Well then I guess this is still an open question. I think this is a good path, so let's see if we can patch up this hole.

NateTG · Jul 1, 2005

Fixed:

A revised algoritm:
The m=1 and n=1 cases are trivial, so we assume m>1 and n>1.
0.Start with a non-repeating loop of length at least (m-1).
1. If there are no substrings of length (m-1) that occur, but occur fewer than n times then the loop covers all of the possible substrings, and we are done.
Proof:
Clearly there is at least one substring of length (m-1):
a_1,a_2,a_3...a_{m-1}. Since it occurs it must occur n times. That means that for all possible symbols x_1,
a_2,a_3...a_{m-1},x_1 occurs in the loop.
We can induct to show that for any possible symbols x_1,x_2...x_{m-1},x_1,x_2...x_{m-1} must occur.
Now, it follows that x_1,x_2...x_{m_1},x_m must also occur since each of the length (m-1) substrings must occur n times.
2. Since the algorithm has not terminated, we know that there is some length (m-1) substring that has not occurred n times in the loop, using the notion of treating the string as a loop, break it open so that said length (m-1) substring is the overlap section.
3. Extend the string (leaving the overlap in place) until there are no more possible additions.
At this point, the string must end in the same (m-1) symbols that it starts with so it can be rolled back into a loop.
Proof:
If the superstring cannot be extended, then there are no unused length m substrings that start with the last (m-1) sybols in the superstring.
There are n length m substrings that start with those (m-1) symbols, so if there are no unused substrings that start with the length (m-1) string, then it must have occurred n times previously - so it has appeared a total of n+1 times. However, there are only n substrings that end with the length (m-1) string, so, the only way that it can occur n+1 times is if it is a the start of the superstring.
Corrolary: At this point, the length (m-1) string from the start of the substring must occur n+1 times in the superstring.
Clearly the string was extended by at least (m-1) symbols in step 3, since no symbols were removed, and the number of occurances of the length (m-1) substring that occurs at the start were increased by at least 1. Since we have m>2, that means that the number of symbols was increased by at least 1.

Since the rolling and unrolling of the loop is a net change of zero symbols, that means that every iteration of steps 1 through 3 adds at least one symbol to the string. Therefore the process is properly increasing, and must terminate.

NateTG · Jul 1, 2005

Computer Implementation

I would store the superstring in a doubly linked list. This makes it inexpensive to splice things in, or rotate it.

For each of the possible length m-1 substrings, keep track of:
1. The number of times it has occured
and
2. One of nodes where it occurs if it has occured.

For each of the possible length m substrings, keep track of whether it has been used.

Using the data about the length m-1 substrings, it should be relatively painless to figure out how to rotate or splice the linked list so that it can keep expanding.

NateTG · Jul 1, 2005

StatusX said:

So let me get this straight. Instead of chopping the tail off the loop and adding to the end of it, you cycle until you reach a point where the first m-1 digits start a substring that isn't yet included in the loop, then you put those m-1 digits at then end and then add the appropriate digit. Is that right?

Yep, that's about the size of it. (You have to keep extending until you get a loop again - but I think you got that.)

P.S. The edit feature plays havoc with quoting. I suppose that's a sort of race condition.

StatusX · Jul 1, 2005

Sorry, I deleted my post, but I think I get it now. Sounds good.

NateTG · Jul 1, 2005

See, I told you it was good that you keep bugging me about stuff. ;)

P.S. I wonder if a 'greedy' algorithm would work for this:
Keep track of how many of each (m-1) sequence have been used, and keep trying to use the least used ones.

StatusX · Jul 1, 2005

You mean you think that would prevent you from having to cycle the string? I'll have to think about that. But in the mean time, I wrote a program to implement your method, and here are some of the minimal superstrings it found:

(2,2): 10011
(2,3): 2001021122
(2,4): 30010203112132233
(2,5): 40010203041121314223243344
(2,6): 5001020304051121314152232425334354455
(2,7): 60010203040506112131415162232425263343536445465566

(3,2): 1100010111
(3,3): 22000100201101202102211121222
(3,4): 330001002003011012013021022023031032033111211312212313213322232333

(4,2): 1110000100110101111
(4,3): 222000010002001100120021002201010201110112012101220202110212022102221111211221212222

(5,2): 111100000100011001010011101011011111

(6,2): 111110000001000011000101000111001001011001101001111010101110110111111

Maybe there are some patterns that can be found here to lead to a simpler algorithm. If anyone wants any strings that aren't here, let me know. From here, the next step could either be to find the minimal superstring for a set of substrings, not necessarily every possible one of a certain number of digits, or maybe finding how many possible minimal superstrings exist for each case.
I'll work on these on my own, but if anyone wants to post any work here, that'd be great, and if I find something particularly interesting, I'll post it here.

NateTG · Jul 2, 2005

My buddy has a very fast algorithm for them m=2 cases. I'll try to get a write-up later.

StatusX · Jul 2, 2005

Well, looking at the strings I listed, a pretty simple pattern comes out:

(m-1)0010203...0(m-1)1121314...1(m-1)2232425...2(m-1)... ...(m-3)(m-3)(m-2)(m-3)(m-1)(m-2)(m-2)(m-1)(m-1)

What Is the Shortest String Containing All m-Digit Substrings Made of n Symbols?

Similar threads

Hot Threads

What do feral children teach us about the development of IQ? (moved to Discussion)

Why do we spend so much time learning grammar in the public school system?

What is the deepest/most impactful statement that you have ever seen?

News Gerard't Hooft Interview

'How Does Your Garden Grow' thread - raised beds

Recent Insights

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem