1. Not finding help here? Sign up for a free 30min tutor trial with Chegg Tutors
    Dismiss Notice
Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

String difference

  1. Oct 24, 2006 #1

    0rthodontist

    User Avatar
    Science Advisor

    I have a need to compute a "difference" between two strings A and B. The idea is that B is similar to A, except it has had operations of string deletion and string insertion performed on it. An operation of deletion transforms a string xyz into a string xz (x, y, z being strings) and an operation of insertion transforms a string xz into a string xyz.

    What I want is an input of A and B, where A is fairly similar to B, and an output which is a series of deletion and insertion operations that transforms A into B. Hopefully this will be a shortest series, but reasonably close is good enough.

    Any idea where I should look for information on this before I try to hack something up?

    edit: actually I hadn't really done enough looking, 10 more seconds of time brings me to the Levenshtein distance. Though this isn't exactly what I want since I'd rather be able to say "insert the string 'abcdxyzqrstuvqwertyz' at character position 900" than have to give that as 20 different character insertions.

    edit2: Perhaps I'm basically looking to implement a custom version of unix's "diff" that goes character-by-character and not line-by-line. Unless "diff" works by treating lines as single symbols, in which case that's not what I need.

    edit3: I could be more specific on the criterion I want for "shortest series of insertions and deletions." What I mean is roughly, let an insertion be denoted by a string +n,k,[s1..sk] where n is a number indicating the position of the first character to be inserted, k is the number of characters that follow, and s1..sk are the characters to be inserted. Similarly a deletion can be denoted by a string -n,k where k is the # of characters after position n to be deleted. I want to roughly minimize the total length of the descriptions of all insertions and deletions required to transform A into B. To simplify the problem, you could similarly assume that the cost of an insertion is p + L where L is the length of the inserted string, and p is a small "penalty" constant, and the cost of a deletion is p.
     
    Last edited: Oct 24, 2006
  2. jcsd
  3. Oct 24, 2006 #2

    0rthodontist

    User Avatar
    Science Advisor

    Further thought: I could use the longest common substring algorithm as in the diff algorithm to generate the character-by-character operations to transform one string into the other. Then I could treat the operations as the set of all removals followed by the set of all additions, and unify adjacent single-character operations to create contiguous stretches that are summarized by multi-character operations. Then I could look for "islands" in the contiguous stretches of removals--short sequences between two removals--and replace these by a longer removal and then an addition. Letting r denote removed characters and k denote characters that are in both A and B, if the sequence goes like
    rrrrrkkrrrrrrkkrrrr
    which would be represented by two removals, then I could shorten this to
    rrrrrrrrrrrrrrrrrrr followed by the insertion of kkkk, at a cost of 2p + 4 instead of 3p.

    Also, the standard diff algorithm takes like the square of the string lengths to run (actually I haven't verified that but I believe it to be true), but it would be great if I had a diff-like algorithm that takes nearly linear time when the strings are nearly the same.

    The purpose of all this is to conserve bandwidth, sending differences instead of entire files when a low-bandwidth computer requests updates of a file from a server.
     
  4. Oct 24, 2006 #3

    0rthodontist

    User Avatar
    Science Advisor

    Well, fortuitously, Python happens to have a function that does almost precisely what I need, and Python is the language I have been working in. The difflib library and the SequenceMatcher class do it all.
     
    Last edited: Oct 24, 2006
Know someone interested in this topic? Share this thread via Reddit, Google+, Twitter, or Facebook

Have something to add?



Similar Discussions: String difference
  1. String Theory (Replies: 1)

  2. Ball of string (Replies: 7)

  3. Difference Operator (Replies: 3)

  4. Squares difference (Replies: 4)

  5. Averaging differences (Replies: 5)

Loading...