What is the best approach for extracting sentences from a text?

  • Thread starter Thread starter Jamin2112
  • Start date Start date
  • Tags Tags
    Approach Text
AI Thread Summary
The discussion revolves around the challenges of developing an effective plagiarism detection program, particularly in identifying full sentences. Key points include the complexity of distinguishing sentence boundaries due to punctuation ambiguities, such as periods that can signify both sentence endings and abbreviations. Participants suggest alternative strategies, such as breaking text into smaller character strings or focusing on matching unique text strings rather than full sentences, as plagiarists may manipulate sentence structures to evade detection. The conversation also touches on calculating shared word percentages between sentences and exploring cluster-based approaches for plagiarism detection. Participants emphasize the importance of minimizing errors in identifying sentence boundaries while acknowledging that achieving near-perfect accuracy is unrealistic. Suggestions include simplifying the algorithm to handle common punctuation patterns and considering the context of words to improve detection accuracy. Overall, the discussion highlights the need for innovative methods to enhance plagiarism detection while recognizing inherent limitations.
Jamin2112
Messages
973
Reaction score
12
I can't think of an elegant way to do it, and yet I need to, somehow.

If you have a period followed by a space, then you know you're at the end of a sentence. Same with a period followed by a quotation marks (in the American way of writing. In the British way, they keep the quotation marks inside the periods). But wait ... I just realized something else. If you have a name like E.B. White, that'll get cut off. D'oh! It gets more complicated when someone encloses multiple consecutive sentences in parentheses.

Is this futile? I might have to go about my plagiarism detection program by another means. Looking for highly correlative sentences between texts cannot possibly work unless there's a good way of finding sentences.
 
Technology news on Phys.org
Why do you need to worry about full sentences? Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.
 
phinds said:
Why do you need to worry about full sentences?

Because people think in terms of sentences. That's how we separate our thoughts. By the way, my program catches if a plagiarist tried to combine or split sentences in an effort to distance his writing from the copied text.

phinds said:
Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.

Hmmm ... Not a bad idea.
 
If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.

Why not just look for matching character strings. e.g. fairly short strings that occur rarely in the allegedly plagiarized text?

I once wrote some software to do "industrial archeology" on different versions of computer software that worked like that. Hunting down matching occurrences of strings like "i = 0" was a waste of time, but there were plenty of unique text strings to match up, without "knowing" anything about the programming language syntax or what the code was supposed to be doing. I don't think I used any knowledge about the language syntax except to ignore all white space, regardless of whether or not it was significant.
 
AlephZero said:
If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.

I'm looking at the percentage of words shared in common between pairs of sentences. The percentage is calculated on the size of the smaller sentence (so there will be a 100% match if one sentence is contained in another). If S and T are the sets of words in two sentences, then I'm looking at

(number of elements in smaller{S,T} that are contained in larger{S,T}) / sizeof(smaller{S,T})
 
Detecting plagiarism usually starts with breaking text into words, doing word counts, and looking for phrases that have the same Levenshtein distances.

On the next level try a cluster-based approach:
http://www.uni-weimar.de/medien/webis/research/events/pan-10/pan10-papers-final/pan10-plagiarism-detection/zou10-a-cluster-based-plagiarism-detection-method.pdf

Google uses this algorithm to winnow out unique hits on a web search.

Reinventing the plagiarism software is fun but your result will always be compared to what is out there already. Have fun.
 
Last edited by a moderator:
Jamin2112: A period can be used at the end of a sentence, or can be embedded within a sentence. Therefore, it is virtually impossible to distinguish an end-of-sentence period anywhere near 100 % of the time. Therefore, you can immediately abandon the idea that you will ever get the program anywhere near 100 % correct. And thereby avoid wasting a large amount of time trying to achieve something that is unachievable. Instead, you could focus on, "How do I easily minimize the percentage of wrong end-of-sentence breaks?" Perhaps try the following. It is fast and simplistic, because you already know you will never achieve anything near perfection.

First, convert any double blank characters to single blank characters. Next, if any of the following strings are followed by a capital letter, then convert the subject blank character to an end-of-sentence marker, or newline, or whatever you are using to denote end-of-sentence. (The backslash is a quoting character here.)

". "
"! "
"? "
".\" "
"!\" "
"?\" "
".) "
"!) "
"?) "
".] "
"!] "
"?] "​

Notice, the above algorithm will interpret the blank character in "E. White" as an end-of-sentence. This is just part of your losses, since no algorithm can be perfect, due to the ambiguous usage of period. There is no way to distinguish "E. White" from "in figure B. The". Instead, you just want to reduce your losses. If the above algorithm gets 98 % of the end-of-sentence periods correct (and gets only 2 % wrong), then that would be great.
 
Last edited:
You could make a reasonable guess that a "word" of one upper-case letter is not the last word in a sentence. You might need to think about two letter "words" as well, like St. or Mc. But St. might be the end of a sentence if it is an abbreviation for "street" and not "saint".
 
AlephZero: If you do that (as mentioned in post 8), then I think the program would start making a lot of mistakes, for two-character words, and also for one-character words. Check out how many mistakes it would then make in just this one arbitrary, common (real-life) example, quoted (paraphrased) from an actual document.

You have a round rod of diameter Xd. The surface of the metal block is point A. The cantilever tip is point B. The load applied at point B is force P. Reaction force Fc is located at point C. Reaction force Fd is located at point D. The tip of the embedded portion of the rod is point E.

The cantilever length (distance AB) is L. The distance from point A to point C is Xa. The distance from point A to point D is Xb. The distance from point A to point E is La. Summing horizontal forces, and solving for Fd, gives Fd = Fc - P.

E.g., let's say La = 10.5 mm. Let's say you assume a center of rotation at 3.5 mm from point A. Therefore, we have, Fc = 457.1 N. And Fd = 57.1 N.

You can adjust dimensions Xa and Xb slightly, to see what effect it has on reaction forces Fc and Fd.​
 
  • #10
I'm thinking that the best way to approach this wouldn't be to look at sentences by groups of words. A word is easy to isolate based on spaces. Define a word to be of a minimum length say 4 to get rid of: it, is, the, and etc

Then build your own sentences of say 10 words, sentence one being words 1-10 sentence 2 being 2-11 etc, and build a hueristic based on the number of sentences in source 1 that share X words in common with a sentence in source 2.
 
  • #11
Take a look at context free grammars to classify and extract information and its structure.

Once you have this and the corresponding structure, you can classify the information and choose what classifications you want as well as design filters and transformations for your program needs.
 

Similar threads

Replies
1
Views
3K
2
Replies
83
Views
21K
Replies
70
Views
13K
Replies
1
Views
4K
Replies
27
Views
5K
Replies
3
Views
4K
Replies
65
Views
10K
Back
Top