What is the best approach for extracting sentences from a text?

  • Thread starter Jamin2112
  • Start date
  • #1
986
9

Main Question or Discussion Point

I can't think of an elegant way to do it, and yet I need to, somehow.

If you have a period followed by a space, then you know you're at the end of a sentence. Same with a period followed by a quotation marks (in the American way of writing. In the British way, they keep the quotation marks inside the periods). But wait ..... I just realized something else. If you have a name like E.B. White, that'll get cut off. D'oh! It gets more complicated when someone encloses multiple consecutive sentences in parentheses.

Is this futile? I might have to go about my plagiarism detection program by another means. Looking for highly correlative sentences between texts cannot possibly work unless there's a good way of finding sentences.
 

Answers and Replies

  • #2
phinds
Science Advisor
Insights Author
Gold Member
2019 Award
16,223
6,293
Why do you need to worry about full sentences? Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.
 
  • #3
986
9
Why do you need to worry about full sentences?
Because people think in terms of sentences. That's how we separate our thoughts. By the way, my program catches if a plagiarist tried to combine or split sentences in an effort to distance his writing from the copied text.

Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.
Hmmm ... Not a bad idea.
 
  • #4
AlephZero
Science Advisor
Homework Helper
6,994
291
If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.

Why not just look for matching character strings. e.g. fairly short strings that occur rarely in the allegedly plagiarized text?

I once wrote some software to do "industrial archeology" on different versions of computer software that worked like that. Hunting down matching occurrences of strings like "i = 0" was a waste of time, but there were plenty of unique text strings to match up, without "knowing" anything about the programming language syntax or what the code was supposed to be doing. I don't think I used any knowledge about the language syntax except to ignore all white space, regardless of whether or not it was significant.
 
  • #5
986
9
If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.
I'm looking at the percentage of words shared in common between pairs of sentences. The percentage is calculated on the size of the smaller sentence (so there will be a 100% match if one sentence is contained in another). If S and T are the sets of words in two sentences, then I'm looking at

(number of elements in smaller{S,T} that are contained in larger{S,T}) / sizeof(smaller{S,T})
 
  • #6
jim mcnamara
Mentor
3,949
2,351
Detecting plagiarism usually starts with breaking text into words, doing word counts, and looking for phrases that have the same Levenshtein distances.

On the next level try a cluster-based approach:
http://www.uni-weimar.de/medien/webis/research/events/pan-10/pan10-papers-final/pan10-plagiarism-detection/zou10-a-cluster-based-plagiarism-detection-method.pdf [Broken]

Google uses this algorithm to winnow out unique hits on a web search.

Reinventing the plagiarism software is fun but your result will always be compared to what is out there already. Have fun.
 
Last edited by a moderator:
  • #7
nvn
Science Advisor
Homework Helper
2,128
32
Jamin2112: A period can be used at the end of a sentence, or can be embedded within a sentence. Therefore, it is virtually impossible to distinguish an end-of-sentence period anywhere near 100 % of the time. Therefore, you can immediately abandon the idea that you will ever get the program anywhere near 100 % correct. And thereby avoid wasting a large amount of time trying to achieve something that is unachievable. Instead, you could focus on, "How do I easily minimize the percentage of wrong end-of-sentence breaks?" Perhaps try the following. It is fast and simplistic, because you already know you will never achieve anything near perfection.

First, convert any double blank characters to single blank characters. Next, if any of the following strings are followed by a capital letter, then convert the subject blank character to an end-of-sentence marker, or newline, or whatever you are using to denote end-of-sentence. (The backslash is a quoting character here.)

". "
"! "
"? "
".\" "
"!\" "
"?\" "
".) "
"!) "
"?) "
".] "
"!] "
"?] "​

Notice, the above algorithm will interpret the blank character in "E. White" as an end-of-sentence. This is just part of your losses, since no algorithm can be perfect, due to the ambiguous usage of period. There is no way to distinguish "E. White" from "in figure B. The". Instead, you just want to reduce your losses. If the above algorithm gets 98 % of the end-of-sentence periods correct (and gets only 2 % wrong), then that would be great.
 
Last edited:
  • #8
AlephZero
Science Advisor
Homework Helper
6,994
291
You could make a reasonable guess that a "word" of one upper-case letter is not the last word in a sentence. You might need to think about two letter "words" as well, like St. or Mc. But St. might be the end of a sentence if it is an abbreviation for "street" and not "saint".
 
  • #9
nvn
Science Advisor
Homework Helper
2,128
32
AlephZero: If you do that (as mentioned in post 8), then I think the program would start making a lot of mistakes, for two-character words, and also for one-character words. Check out how many mistakes it would then make in just this one arbitrary, common (real-life) example, quoted (paraphrased) from an actual document.

You have a round rod of diameter Xd. The surface of the metal block is point A. The cantilever tip is point B. The load applied at point B is force P. Reaction force Fc is located at point C. Reaction force Fd is located at point D. The tip of the embedded portion of the rod is point E.

The cantilever length (distance AB) is L. The distance from point A to point C is Xa. The distance from point A to point D is Xb. The distance from point A to point E is La. Summing horizontal forces, and solving for Fd, gives Fd = Fc - P.

E.g., let's say La = 10.5 mm. Let's say you assume a center of rotation at 3.5 mm from point A. Therefore, we have, Fc = 457.1 N. And Fd = 57.1 N.

You can adjust dimensions Xa and Xb slightly, to see what effect it has on reaction forces Fc and Fd.​
 
  • #10
402
120
I'm thinking that the best way to aproach this wouldn't be to look at sentences by groups of words. A word is easy to isolate based on spaces. Define a word to be of a minimum lenght say 4 to get rid of: it, is, the, and etc

Then build your own sentences of say 10 words, sentence one being words 1-10 sentence 2 being 2-11 etc, and build a hueristic based on the number of sentences in source 1 that share X words in common with a sentence in source 2.
 
  • #11
chiro
Science Advisor
4,790
132
Take a look at context free grammars to classify and extract information and its structure.

Once you have this and the corresponding structure, you can classify the information and choose what classifications you want as well as design filters and transformations for your program needs.
 

Related Threads on What is the best approach for extracting sentences from a text?

Replies
6
Views
3K
Replies
3
Views
2K
  • Last Post
Replies
2
Views
3K
Replies
13
Views
634
Replies
17
Views
4K
Replies
7
Views
2K
Replies
2
Views
1K
Replies
7
Views
2K
Top