What Is the Shortest String Containing All m-Digit Substrings Made of n Symbols?

  • Context: Graduate 
  • Thread starter Thread starter StatusX
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around finding the shortest string that contains all possible m-digit substrings made from n symbols. Participants explore specific cases, propose algorithms, and discuss the properties of such superstrings, including their structure and potential methods for generation.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant seeks a general algorithm for generating the shortest superstring containing all m-digit strings made of n symbols, noting specific examples for (2,2).
  • Another participant suggests that extending any non-repeating partial superstring can lead to a minimal superstring, although the exact number of possible minimal superstrings is uncertain.
  • Some participants discuss the property that the superstring often ends with the same m-1 digits it starts with, but this remains unproven.
  • Several examples of minimal superstrings are provided, but the method for generating them remains unclear to some participants.
  • There is a discussion about the concept of "breaking open" a loop to add new symbols without duplicating existing substrings, with varying levels of understanding among participants.
  • Concerns are raised about whether rotating a superstring preserves the property of having no repeated substrings, particularly in specific cases.

Areas of Agreement / Disagreement

Participants express differing views on the existence of a general algorithm and the properties of superstrings. Some agree on certain structural aspects, while others challenge the methods proposed for generating superstrings and the implications of rotating them.

Contextual Notes

Participants acknowledge limitations in their understanding of the problem, particularly regarding the generalizability of their methods and the assumptions made about substring repetition and loop structures.

  • #31
I googled "superstring" and "substring" and it seems that finding the shortest common superstring of a set of substrings is not an unknown problem. StatusX, you never answered my question of whether this is homework.
 
Mathematics news on Phys.org
  • #32
BicycleTree said:
That does not follow. The so-called "extra symbol" may be part of the m-1 overlap symbols, generating a new string of only the same length.

Since appending that symbol at that point cannot, by construction, generate a substring that was in the loop (before the chop occurred) it cannot be the first of the m-1 symbols of overlap - since, in order to be that, it would have to match a segment of the original loop.

That said, I'm not entirely sure I've covered the case where the situation is something like:
Pre-chop: 10(011)
Post-chop: 1(011)

So the algorithm doens't necessarily terminate.
 
  • #33
BicycleTree said:
StatusX, you never answered my question of whether this is homework.

Yes I did. Its not.
 
  • #34
Note: I have edited and deleted some posts.
 
  • #35
StatusX said:
Yes I did. Its not.
OK... but you didn't say it before.
 
  • #36
I hate to nitpick, but in my second post, that "No" was for you. Nate's post wasn't there when I started or else I would have addressed you more clearly.
 
  • #37
NateTG said:
So the algorithm doens't necessarily terminate.

Well then I guess this is still an open question. I think this is a good path, so let's see if we can patch up this hole.
 
Last edited:
  • #38
Fixed:

A revised algorithm:
The m=1 and n=1 cases are trivial, so we assume m>1 and n>1.
0.Start with a non-repeating loop of length at least (m-1).
1. If there are no substrings of length (m-1) that occur, but occur fewer than n times then the loop covers all of the possible substrings, and we are done.
Proof:
Clearly there is at least one substring of length (m-1):
a_1,a_2,a_3...a_{m-1}. Since it occurs it must occur n times. That means that for all possible symbols x_1,
a_2,a_3...a_{m-1},x_1 occurs in the loop.
We can induct to show that for any possible symbols x_1,x_2...x_{m-1},x_1,x_2...x_{m-1} must occur.
Now, it follows that x_1,x_2...x_{m_1},x_m must also occur since each of the length (m-1) substrings must occur n times.
2. Since the algorithm has not terminated, we know that there is some length (m-1) substring that has not occurred n times in the loop, using the notion of treating the string as a loop, break it open so that said length (m-1) substring is the overlap section.
3. Extend the string (leaving the overlap in place) until there are no more possible additions.
At this point, the string must end in the same (m-1) symbols that it starts with so it can be rolled back into a loop.
Proof:
If the superstring cannot be extended, then there are no unused length m substrings that start with the last (m-1) sybols in the superstring.
There are n length m substrings that start with those (m-1) symbols, so if there are no unused substrings that start with the length (m-1) string, then it must have occurred n times previously - so it has appeared a total of n+1 times. However, there are only n substrings that end with the length (m-1) string, so, the only way that it can occur n+1 times is if it is a the start of the superstring.
Corrolary: At this point, the length (m-1) string from the start of the substring must occur n+1 times in the superstring.
Clearly the string was extended by at least (m-1) symbols in step 3, since no symbols were removed, and the number of occurances of the length (m-1) substring that occurs at the start were increased by at least 1. Since we have m>2, that means that the number of symbols was increased by at least 1.

Since the rolling and unrolling of the loop is a net change of zero symbols, that means that every iteration of steps 1 through 3 adds at least one symbol to the string. Therefore the process is properly increasing, and must terminate.
 
Last edited:
  • #39
Computer Implementation

I would store the superstring in a doubly linked list. This makes it inexpensive to splice things in, or rotate it.

For each of the possible length m-1 substrings, keep track of:
1. The number of times it has occurred
and
2. One of nodes where it occurs if it has occurred.

For each of the possible length m substrings, keep track of whether it has been used.

Using the data about the length m-1 substrings, it should be relatively painless to figure out how to rotate or splice the linked list so that it can keep expanding.
 
  • #40
StatusX said:
So let me get this straight. Instead of chopping the tail off the loop and adding to the end of it, you cycle until you reach a point where the first m-1 digits start a substring that isn't yet included in the loop, then you put those m-1 digits at then end and then add the appropriate digit. Is that right?

Yep, that's about the size of it. (You have to keep extending until you get a loop again - but I think you got that.)

P.S. The edit feature plays havoc with quoting. I suppose that's a sort of race condition.
 
  • #41
Sorry, I deleted my post, but I think I get it now. Sounds good.
 
  • #42
See, I told you it was good that you keep bugging me about stuff. ;)

P.S. I wonder if a 'greedy' algorithm would work for this:
Keep track of how many of each (m-1) sequence have been used, and keep trying to use the least used ones.
 
  • #43
You mean you think that would prevent you from having to cycle the string? I'll have to think about that. But in the mean time, I wrote a program to implement your method, and here are some of the minimal superstrings it found:

(2,2): 10011
(2,3): 2001021122
(2,4): 30010203112132233
(2,5): 40010203041121314223243344
(2,6): 5001020304051121314152232425334354455
(2,7): 60010203040506112131415162232425263343536445465566

(3,2): 1100010111
(3,3): 22000100201101202102211121222
(3,4): 330001002003011012013021022023031032033111211312212313213322232333

(4,2): 1110000100110101111
(4,3): 222000010002001100120021002201010201110112012101220202110212022102221111211221212222

(5,2): 111100000100011001010011101011011111

(6,2): 111110000001000011000101000111001001011001101001111010101110110111111

Maybe there are some patterns that can be found here to lead to a simpler algorithm. If anyone wants any strings that aren't here, let me know. From here, the next step could either be to find the minimal superstring for a set of substrings, not necessarily every possible one of a certain number of digits, or maybe finding how many possible minimal superstrings exist for each case.
I'll work on these on my own, but if anyone wants to post any work here, that'd be great, and if I find something particularly interesting, I'll post it here.
 
Last edited:
  • #44
My buddy has a very fast algorithm for them m=2 cases. I'll try to get a write-up later.
 
  • #45
Well, looking at the strings I listed, a pretty simple pattern comes out:

(m-1)0010203...0(m-1)1121314...1(m-1)2232425...2(m-1)... ...(m-3)(m-3)(m-2)(m-3)(m-1)(m-2)(m-2)(m-1)(m-1)
 

Similar threads

  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 29 ·
Replies
29
Views
5K
  • · Replies 6 ·
Replies
6
Views
3K
  • · Replies 26 ·
Replies
26
Views
6K
  • · Replies 41 ·
2
Replies
41
Views
14K
  • · Replies 21 ·
Replies
21
Views
6K
  • · Replies 10 ·
Replies
10
Views
4K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 62 ·
3
Replies
62
Views
11K
  • · Replies 67 ·
3
Replies
67
Views
16K