Unsolvable python code bug? (finding the difference between two input strings)

AI Thread Summary
The discussion focuses on finding an effective method to compare two bodies of text and identify their differences in the correct order. The initial code provided attempts to compare words from two input texts but outputs differences incorrectly. Suggestions include utilizing the Linux diff command, which employs the Hunt-Szymanski algorithm for efficient comparison, and implementing visual diff tools that highlight line-by-line discrepancies. Alternatives discussed involve using queues for line comparison and enhancing the algorithm to recognize repeated lines, which can complicate standard diff outputs. The conversation emphasizes the importance of matching substrings accurately and suggests advanced data structures, such as linked lists in trees, to improve the mapping of word occurrences and their contexts in the texts. This approach could yield more meaningful insights into the differences between the two texts.
Sucvicbil
Messages
5
Reaction score
0
TL;DR Summary
Finding differences between two bodies of text, difference output in order
All day I have been trying to find a method of finding the difference between two input (body of text), and get the difference but in order. I have tried breaking down the text into separate lines, then try to find the difference between both lines, but somehow this doesn't solve the problem and the code seems to keep outputting the difference but in the wrong order, can anybody help? this is my current code, it keeps outputting the difference but in the wrong order:
Python:
def compare_word_groups(group1, group2):
    words1 = group1.split()
    words2 = group2.split()

    removed_words = []
    i = j = 0
    while i < len(words1) and j < len(words2):
        if words1[ i].lower() == words2[j].lower():
            i += 1
            j += 1
        else:
            removed_words.append(words1[ i])
            i += 1
   
    # Add any remaining words from group1
    removed_words.extend(words1[ i:])

    print(f"[{', '.join(removed_words)}]")

Example usage

group1 = """should have known, like it was a red flag because for one he was asking to use my car. what's up y'all? welcome back to my channel. it's payden, and today i'm back with another video. today's video is going to be a very long overdue gardening wing stop mukbang. I've been meaning to do another Wingstop Mukbang for a long time. y'all, Something just told me: let me come and give y'all a story time while I eat some Wingstop. Today's video is going to be my story time on my worst situation shift, literally ever in my entire life. Y'all, I had to come and tell y'all this story time just because I went through so much and like y'all might already know what this video is going to be about, who this video is going to be about, i'm not going to say any names, but- and i would prefer if no one comments any names or mentions any names in the comment section- got this wing stop. i got my wing stop order. i'm so freaking hungry right now. y'all, it's about to be two in the morning, but i didn't let that stop me from coming and making this video for y'all. i did not want to do this video because the person who was involved with this story time is actually going to be really happy that i did this. this person is very happy when it comes to getting attention and throughout our relationship- which i'll explain a little bit more about the relationship that we had- i could just tell that they were trying to use me. i'm just going to tell you guys all the crazy stuff that happened just as quick as it started. it ended quickly as well. i'm also going to be getting lit in today's video. okay, i have this little pre-roll. let's pop my little pre-roll off camera. make sure you guys are subscribed on patreon. I don't want them to age restrict this video. They might age restrict it anyway. I always see you guys in the comments asking why I have to like bleep everything out and do all this extra stuff. It's basically just because YouTube is really strict. you guys Sorry that I can't show everything. If you guys want to, y'all can subscribe to Patreon if you can. If y'all want to see the extended version, just go on Patreon. Let's go ahead and get started. And by the time I come back, the food is going to be warmed up, And I warmed all my food up. Now let me tell you guys what I ordered. I ordered a lot of Parmesan cinders and honey hot tenders. i've never tried honey hot before. i actually hate trying new things, but i decided to try something new today. i'm mad that i even tried honey hot, because i didn't know it was dry rub. i thought it was going to be wet. if i would have not, i would have just got buffalo. this is buffalo and garlic parmesan- all flats, extra fried, crispy bone and wings. this is the honey hot and garlic parmesan tenders. and then i got some voodoo fries, but they didn't put no rant. i'ma put the ranch. they honestly pissed me off when i went to go and get this. and of course, y'all can't go wrong with the ranch and the honey mustard. i'm definitely not. as so y'all this story time. it goes back to like last year."""

group2 = """should have known, like it was a red flag because for one he was asking to use my car. it's payden, and today i'm back with another video. today's video is going to be a very long overdue gardening wing stop mukbang. I've been meaning to do another Wingstop Mukbang for a long time. y'all, Something just told me: let me come and give y'all a story time while I eat some Wingstop. Today's video is going to be my story time on my worst situation shift, literally ever in my entire life. Y'all, I had to come and tell y'all this story time just because I went through so much and like y'all might already know what this video is going to be about, who this video is going to be about, i'm not going to say any names, but- and i would prefer if no one comments any names or mentions any names in the comment section- got this wing stop. y'all, it's about to be two in the morning, but i didn't let that stop me from coming and making this video for y'all. i did not want to do this video because the person who was involved with this story time is actually going to be really happy that i did this. this person is very happy when it comes to getting attention and throughout our relationship- which i'll explain a little bit more about the relationship that we had- i could just tell that they were trying to use me. i'm just going to tell you guys all the crazy stuff that happened just as quick as it started. i'm also going to be getting lit in today's video. let's pop my little pre-roll off camera. They might age restrict it anyway. I always see you guys in the comments asking why I have to like bleep everything out and do all this extra stuff. It's basically just because YouTube is really strict. you guys Sorry that I can't show everything. If y'all want to see the extended version, just go on Patreon. Now let me tell you guys what I ordered. i actually hate trying new things, but i decided to try something new today. i thought it was going to be wet. this is the honey hot and garlic parmesan tenders. and then i got some voodoo fries, but they didn't put no rant. they honestly pissed me off when i went to go and get this. and of course, y'all can't go wrong with the ranch and the honey mustard. i'm definitely not. as so y'all this story time. it goes back to like last year."""

compare_word_groups(group1, group2)
 
Last edited by a moderator:
Technology news on Phys.org
This is an exciting and somewhat complex subject.

Have you used the Linux diff command, i.e., placed the two texts in separate files and used the diff command to compare them?

https://en.wikipedia.org/wiki/Diff

The diff command uses the Hunt-Szymanski algorithm:

https://en.wikipedia.org/wiki/Hunt–Szymanski_algorithm

One common usage is in visual diff displays, as can be found in some programmer IDE tools. These tools display line-by-line differences, showing lines in one file, not in the other, or vice-versa.

---

I implemented something like this many years ago using two queues, one for each file. I would read the lines into their respective queues and compare the A and B queues, looking for two lines that matched.

I used a two-line comparison for alignment as a single-line comparison came up with less meaningful differences.

On no match, I would read a line in from each file and add to its respective queue.

On a match of two lines, I would print the non-matching lines of both queues and remove them from the queues.

It's akin to a left-lookahead scheme for diffing.

---

One thing that can make for a more helpful diff command is discovering when people have replicated a few lines repeatedly in the second text, as happens when using a text editor.

You might decide to add logging statements to some code and so you enter one line, copy it and then paste in several spots. To the diff algorithm they are just inserted lines not present in the older file. However with an improved diff, you could mark the lines as copies of a common line.
 
  • Like
Likes harborsparrow, Swamp Thing and berkeman
Don’t forget you don’t need Linux, difflib is a standard Python3 library.


The reason your algorithm won’t work is consider s1=“I threw the book at the dumb cat” s2= “I threw the book (about the boy at play with the childless cat lady) at the dumb cat” notice how your algorithm will match things inside the parens to outside in the other string. That’s why the longest matching substrings matter. It’s a super interesting space though, with ties to how search engines index, and LLMs if you go deep enough. If you want to play with it, a useful data structure is two linked lists in a tree. The trees will serve as indexes for each word, recording their instances of appearance. The linked list will tell what words come after or before with a word instance. Your job is to make a one to one map. So in the example above, “at” will occur once in s1, twice in s2. You can then see that the first occurrence in s2 is followed by “play” unlike in s1, but the second is followed by “the cat” like in s1, so that’s the one that should be mapped to. The thing is everything having to do with this has applications in a million other places.
 
  • Like
Likes harborsparrow, PhDeezNutz and jedishrfu
Dear Peeps I have posted a few questions about programing on this sectio of the PF forum. I want to ask you veterans how you folks learn program in assembly and about computer architecture for the x86 family. In addition to finish learning C, I am also reading the book From bits to Gates to C and Beyond. In the book, it uses the mini LC3 assembly language. I also have books on assembly programming and computer architecture. The few famous ones i have are Computer Organization and...
I had a Microsoft Technical interview this past Friday, the question I was asked was this : How do you find the middle value for a dataset that is too big to fit in RAM? I was not able to figure this out during the interview, but I have been look in this all weekend and I read something online that said it can be done at O(N) using something called the counting sort histogram algorithm ( I did not learn that in my advanced data structures and algorithms class). I have watched some youtube...
Back
Top