What is the best approach for extracting sentences from a text?

In summary, the conversation discusses the difficulty of detecting plagiarism and suggests different approaches, such as looking for matching character strings or using a cluster-based method. It is noted that detecting plagiarism is not a perfect science and any algorithm will have its limitations. Different suggestions are given to minimize the mistakes in detecting plagiarism, such as converting double blank characters to single ones and using certain strings to denote end-of-sentence. The conversation also provides an example of how even a small change, such as converting one-character words to end-of-sentence markers, can lead to mistakes in detecting plagiarism.
  • #1
Jamin2112
986
12
I can't think of an elegant way to do it, and yet I need to, somehow.

If you have a period followed by a space, then you know you're at the end of a sentence. Same with a period followed by a quotation marks (in the American way of writing. In the British way, they keep the quotation marks inside the periods). But wait ... I just realized something else. If you have a name like E.B. White, that'll get cut off. D'oh! It gets more complicated when someone encloses multiple consecutive sentences in parentheses.

Is this futile? I might have to go about my plagiarism detection program by another means. Looking for highly correlative sentences between texts cannot possibly work unless there's a good way of finding sentences.
 
Technology news on Phys.org
  • #2
Why do you need to worry about full sentences? Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.
 
  • #3
phinds said:
Why do you need to worry about full sentences?

Because people think in terms of sentences. That's how we separate our thoughts. By the way, my program catches if a plagiarist tried to combine or split sentences in an effort to distance his writing from the copied text.

phinds said:
Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.

Hmmm ... Not a bad idea.
 
  • #4
If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.

Why not just look for matching character strings. e.g. fairly short strings that occur rarely in the allegedly plagiarized text?

I once wrote some software to do "industrial archeology" on different versions of computer software that worked like that. Hunting down matching occurrences of strings like "i = 0" was a waste of time, but there were plenty of unique text strings to match up, without "knowing" anything about the programming language syntax or what the code was supposed to be doing. I don't think I used any knowledge about the language syntax except to ignore all white space, regardless of whether or not it was significant.
 
  • #5
AlephZero said:
If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.

I'm looking at the percentage of words shared in common between pairs of sentences. The percentage is calculated on the size of the smaller sentence (so there will be a 100% match if one sentence is contained in another). If S and T are the sets of words in two sentences, then I'm looking at

(number of elements in smaller{S,T} that are contained in larger{S,T}) / sizeof(smaller{S,T})
 
  • #6
Detecting plagiarism usually starts with breaking text into words, doing word counts, and looking for phrases that have the same Levenshtein distances.

On the next level try a cluster-based approach:
http://www.uni-weimar.de/medien/webis/research/events/pan-10/pan10-papers-final/pan10-plagiarism-detection/zou10-a-cluster-based-plagiarism-detection-method.pdf

Google uses this algorithm to winnow out unique hits on a web search.

Reinventing the plagiarism software is fun but your result will always be compared to what is out there already. Have fun.
 
Last edited by a moderator:
  • #7
Jamin2112: A period can be used at the end of a sentence, or can be embedded within a sentence. Therefore, it is virtually impossible to distinguish an end-of-sentence period anywhere near 100 % of the time. Therefore, you can immediately abandon the idea that you will ever get the program anywhere near 100 % correct. And thereby avoid wasting a large amount of time trying to achieve something that is unachievable. Instead, you could focus on, "How do I easily minimize the percentage of wrong end-of-sentence breaks?" Perhaps try the following. It is fast and simplistic, because you already know you will never achieve anything near perfection.

First, convert any double blank characters to single blank characters. Next, if any of the following strings are followed by a capital letter, then convert the subject blank character to an end-of-sentence marker, or newline, or whatever you are using to denote end-of-sentence. (The backslash is a quoting character here.)

". "
"! "
"? "
".\" "
"!\" "
"?\" "
".) "
"!) "
"?) "
".] "
"!] "
"?] "​

Notice, the above algorithm will interpret the blank character in "E. White" as an end-of-sentence. This is just part of your losses, since no algorithm can be perfect, due to the ambiguous usage of period. There is no way to distinguish "E. White" from "in figure B. The". Instead, you just want to reduce your losses. If the above algorithm gets 98 % of the end-of-sentence periods correct (and gets only 2 % wrong), then that would be great.
 
Last edited:
  • #8
You could make a reasonable guess that a "word" of one upper-case letter is not the last word in a sentence. You might need to think about two letter "words" as well, like St. or Mc. But St. might be the end of a sentence if it is an abbreviation for "street" and not "saint".
 
  • #9
AlephZero: If you do that (as mentioned in post 8), then I think the program would start making a lot of mistakes, for two-character words, and also for one-character words. Check out how many mistakes it would then make in just this one arbitrary, common (real-life) example, quoted (paraphrased) from an actual document.

You have a round rod of diameter Xd. The surface of the metal block is point A. The cantilever tip is point B. The load applied at point B is force P. Reaction force Fc is located at point C. Reaction force Fd is located at point D. The tip of the embedded portion of the rod is point E.

The cantilever length (distance AB) is L. The distance from point A to point C is Xa. The distance from point A to point D is Xb. The distance from point A to point E is La. Summing horizontal forces, and solving for Fd, gives Fd = Fc - P.

E.g., let's say La = 10.5 mm. Let's say you assume a center of rotation at 3.5 mm from point A. Therefore, we have, Fc = 457.1 N. And Fd = 57.1 N.

You can adjust dimensions Xa and Xb slightly, to see what effect it has on reaction forces Fc and Fd.​
 
  • #10
I'm thinking that the best way to approach this wouldn't be to look at sentences by groups of words. A word is easy to isolate based on spaces. Define a word to be of a minimum length say 4 to get rid of: it, is, the, and etc

Then build your own sentences of say 10 words, sentence one being words 1-10 sentence 2 being 2-11 etc, and build a hueristic based on the number of sentences in source 1 that share X words in common with a sentence in source 2.
 
  • #11
Take a look at context free grammars to classify and extract information and its structure.

Once you have this and the corresponding structure, you can classify the information and choose what classifications you want as well as design filters and transformations for your program needs.
 

1. What are the commonly used methods for extracting sentences from a text?

Some commonly used methods for extracting sentences from a text include rule-based methods, statistical methods, and machine learning techniques.

2. How do rule-based methods work for extracting sentences from a text?

Rule-based methods use a set of predefined rules and patterns to identify and extract sentences from a text. These rules can include grammar rules, punctuation patterns, and sentence length.

3. What is the process for using statistical methods to extract sentences from a text?

Statistical methods use statistical models and algorithms to extract sentences from a text. This involves training the model on a large dataset of labeled sentences and then using the model to predict the sentences in a new text.

4. Can machine learning techniques be used to extract sentences from any type of text?

Yes, machine learning techniques can be used to extract sentences from any type of text. However, they may require a large amount of training data and may not perform well on texts with unique or rare sentence structures.

5. What factors should be considered when choosing the best approach for extracting sentences from a text?

The best approach for extracting sentences from a text will depend on the specific goals and requirements of the task. Factors to consider include the type and complexity of the text, the available resources and training data, and the desired level of accuracy and efficiency.

Similar threads

  • Art, Music, History, and Linguistics
Replies
1
Views
1K
  • Math Proof Training and Practice
3
Replies
83
Views
17K
Replies
1
Views
1K
  • Sci-Fi Writing and World Building
Replies
1
Views
3K
  • General Discussion
3
Replies
70
Views
11K
  • General Discussion
Replies
4
Views
5K
  • General Discussion
Replies
3
Views
3K
  • STEM Academic Advising
Replies
27
Views
4K
  • General Discussion
Replies
1
Views
8K
  • General Discussion
2
Replies
65
Views
8K
Back
Top