Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

String difference

  1. Oct 24, 2006 #1

    0rthodontist

    User Avatar
    Science Advisor

    I have a need to compute a "difference" between two strings A and B. The idea is that B is similar to A, except it has had operations of string deletion and string insertion performed on it. An operation of deletion transforms a string xyz into a string xz (x, y, z being strings) and an operation of insertion transforms a string xz into a string xyz.

    What I want is an input of A and B, where A is fairly similar to B, and an output which is a series of deletion and insertion operations that transforms A into B. Hopefully this will be a shortest series, but reasonably close is good enough.

    Any idea where I should look for information on this before I try to hack something up?

    edit: actually I hadn't really done enough looking, 10 more seconds of time brings me to the Levenshtein distance. Though this isn't exactly what I want since I'd rather be able to say "insert the string 'abcdxyzqrstuvqwertyz' at character position 900" than have to give that as 20 different character insertions.

    edit2: Perhaps I'm basically looking to implement a custom version of unix's "diff" that goes character-by-character and not line-by-line. Unless "diff" works by treating lines as single symbols, in which case that's not what I need.

    edit3: I could be more specific on the criterion I want for "shortest series of insertions and deletions." What I mean is roughly, let an insertion be denoted by a string +n,k,[s1..sk] where n is a number indicating the position of the first character to be inserted, k is the number of characters that follow, and s1..sk are the characters to be inserted. Similarly a deletion can be denoted by a string -n,k where k is the # of characters after position n to be deleted. I want to roughly minimize the total length of the descriptions of all insertions and deletions required to transform A into B. To simplify the problem, you could similarly assume that the cost of an insertion is p + L where L is the length of the inserted string, and p is a small "penalty" constant, and the cost of a deletion is p.
     
    Last edited: Oct 24, 2006
  2. jcsd
  3. Oct 24, 2006 #2

    0rthodontist

    User Avatar
    Science Advisor

    Further thought: I could use the longest common substring algorithm as in the diff algorithm to generate the character-by-character operations to transform one string into the other. Then I could treat the operations as the set of all removals followed by the set of all additions, and unify adjacent single-character operations to create contiguous stretches that are summarized by multi-character operations. Then I could look for "islands" in the contiguous stretches of removals--short sequences between two removals--and replace these by a longer removal and then an addition. Letting r denote removed characters and k denote characters that are in both A and B, if the sequence goes like
    rrrrrkkrrrrrrkkrrrr
    which would be represented by two removals, then I could shorten this to
    rrrrrrrrrrrrrrrrrrr followed by the insertion of kkkk, at a cost of 2p + 4 instead of 3p.

    Also, the standard diff algorithm takes like the square of the string lengths to run (actually I haven't verified that but I believe it to be true), but it would be great if I had a diff-like algorithm that takes nearly linear time when the strings are nearly the same.

    The purpose of all this is to conserve bandwidth, sending differences instead of entire files when a low-bandwidth computer requests updates of a file from a server.
     
  4. Oct 24, 2006 #3

    0rthodontist

    User Avatar
    Science Advisor

    Well, fortuitously, Python happens to have a function that does almost precisely what I need, and Python is the language I have been working in. The difflib library and the SequenceMatcher class do it all.
     
    Last edited: Oct 24, 2006
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: String difference
  1. String Theory (Replies: 1)

  2. Ball of string (Replies: 7)

  3. Difference Operator (Replies: 3)

  4. Squares difference (Replies: 4)

  5. Averaging differences (Replies: 5)

Loading...