Compute String Difference: Minimize Inserts/Deletes

  • Thread starter Thread starter 0rthodontist
  • Start date Start date
  • Tags Tags
    Difference String
0rthodontist
Science Advisor
Messages
1,229
Reaction score
0
I have a need to compute a "difference" between two strings A and B. The idea is that B is similar to A, except it has had operations of string deletion and string insertion performed on it. An operation of deletion transforms a string xyz into a string xz (x, y, z being strings) and an operation of insertion transforms a string xz into a string xyz.

What I want is an input of A and B, where A is fairly similar to B, and an output which is a series of deletion and insertion operations that transforms A into B. Hopefully this will be a shortest series, but reasonably close is good enough.

Any idea where I should look for information on this before I try to hack something up?

edit: actually I hadn't really done enough looking, 10 more seconds of time brings me to the Levenshtein distance. Though this isn't exactly what I want since I'd rather be able to say "insert the string 'abcdxyzqrstuvqwertyz' at character position 900" than have to give that as 20 different character insertions.

edit2: Perhaps I'm basically looking to implement a custom version of unix's "diff" that goes character-by-character and not line-by-line. Unless "diff" works by treating lines as single symbols, in which case that's not what I need.

edit3: I could be more specific on the criterion I want for "shortest series of insertions and deletions." What I mean is roughly, let an insertion be denoted by a string +n,k,[s1..sk] where n is a number indicating the position of the first character to be inserted, k is the number of characters that follow, and s1..sk are the characters to be inserted. Similarly a deletion can be denoted by a string -n,k where k is the # of characters after position n to be deleted. I want to roughly minimize the total length of the descriptions of all insertions and deletions required to transform A into B. To simplify the problem, you could similarly assume that the cost of an insertion is p + L where L is the length of the inserted string, and p is a small "penalty" constant, and the cost of a deletion is p.
 
Last edited:
Mathematics news on Phys.org
Further thought: I could use the longest common substring algorithm as in the diff algorithm to generate the character-by-character operations to transform one string into the other. Then I could treat the operations as the set of all removals followed by the set of all additions, and unify adjacent single-character operations to create contiguous stretches that are summarized by multi-character operations. Then I could look for "islands" in the contiguous stretches of removals--short sequences between two removals--and replace these by a longer removal and then an addition. Letting r denote removed characters and k denote characters that are in both A and B, if the sequence goes like
rrrrrkkrrrrrrkkrrrr
which would be represented by two removals, then I could shorten this to
rrrrrrrrrrrrrrrrrrr followed by the insertion of kkkk, at a cost of 2p + 4 instead of 3p.

Also, the standard diff algorithm takes like the square of the string lengths to run (actually I haven't verified that but I believe it to be true), but it would be great if I had a diff-like algorithm that takes nearly linear time when the strings are nearly the same.

The purpose of all this is to conserve bandwidth, sending differences instead of entire files when a low-bandwidth computer requests updates of a file from a server.
 
Well, fortuitously, Python happens to have a function that does almost precisely what I need, and Python is the language I have been working in. The difflib library and the SequenceMatcher class do it all.
 
Last edited:
Insights auto threads is broken atm, so I'm manually creating these for new Insight articles. In Dirac’s Principles of Quantum Mechanics published in 1930 he introduced a “convenient notation” he referred to as a “delta function” which he treated as a continuum analog to the discrete Kronecker delta. The Kronecker delta is simply the indexed components of the identity operator in matrix algebra Source: https://www.physicsforums.com/insights/what-exactly-is-diracs-delta-function/ by...
Fermat's Last Theorem has long been one of the most famous mathematical problems, and is now one of the most famous theorems. It simply states that the equation $$ a^n+b^n=c^n $$ has no solutions with positive integers if ##n>2.## It was named after Pierre de Fermat (1607-1665). The problem itself stems from the book Arithmetica by Diophantus of Alexandria. It gained popularity because Fermat noted in his copy "Cubum autem in duos cubos, aut quadratoquadratum in duos quadratoquadratos, et...
Thread 'Imaginary Pythagorus'
I posted this in the Lame Math thread, but it's got me thinking. Is there any validity to this? Or is it really just a mathematical trick? Naively, I see that i2 + plus 12 does equal zero2. But does this have a meaning? I know one can treat the imaginary number line as just another axis like the reals, but does that mean this does represent a triangle in the complex plane with a hypotenuse of length zero? Ibix offered a rendering of the diagram using what I assume is matrix* notation...
Back
Top