Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

What is the best approach for extracting sentences from a text?

  1. Dec 10, 2013 #1
    I can't think of an elegant way to do it, and yet I need to, somehow.

    If you have a period followed by a space, then you know you're at the end of a sentence. Same with a period followed by a quotation marks (in the American way of writing. In the British way, they keep the quotation marks inside the periods). But wait ..... I just realized something else. If you have a name like E.B. White, that'll get cut off. D'oh! It gets more complicated when someone encloses multiple consecutive sentences in parentheses.

    Is this futile? I might have to go about my plagiarism detection program by another means. Looking for highly correlative sentences between texts cannot possibly work unless there's a good way of finding sentences.
  2. jcsd
  3. Dec 10, 2013 #2


    User Avatar
    Gold Member

    Why do you need to worry about full sentences? Surely there are other approaches. For example, break text up into strings that each end, say, 5 characters in FRONT of a period, discard the next 10 characters and then start the next string. Surely something even that simple would catch most high correlations.
  4. Dec 10, 2013 #3
    Because people think in terms of sentences. That's how we separate our thoughts. By the way, my program catches if a plagiarist tried to combine or split sentences in an effort to distance his writing from the copied text.

    Hmmm ... Not a bad idea.
  5. Dec 10, 2013 #4


    User Avatar
    Science Advisor
    Homework Helper

    If plagiarists are going to combine or break up "sentences" to hide their activity, looking for "sentences" is the wrong strategy anyway.

    Why not just look for matching character strings. e.g. fairly short strings that occur rarely in the allegedly plagiarized text?

    I once wrote some software to do "industrial archeology" on different versions of computer software that worked like that. Hunting down matching occurrences of strings like "i = 0" was a waste of time, but there were plenty of unique text strings to match up, without "knowing" anything about the programming language syntax or what the code was supposed to be doing. I don't think I used any knowledge about the language syntax except to ignore all white space, regardless of whether or not it was significant.
  6. Dec 10, 2013 #5
    I'm looking at the percentage of words shared in common between pairs of sentences. The percentage is calculated on the size of the smaller sentence (so there will be a 100% match if one sentence is contained in another). If S and T are the sets of words in two sentences, then I'm looking at

    (number of elements in smaller{S,T} that are contained in larger{S,T}) / sizeof(smaller{S,T})
  7. Dec 10, 2013 #6

    jim mcnamara

    User Avatar

    Staff: Mentor

    Detecting plagiarism usually starts with breaking text into words, doing word counts, and looking for phrases that have the same Levenshtein distances.

    On the next level try a cluster-based approach:
    http://www.uni-weimar.de/medien/webis/research/events/pan-10/pan10-papers-final/pan10-plagiarism-detection/zou10-a-cluster-based-plagiarism-detection-method.pdf [Broken]

    Google uses this algorithm to winnow out unique hits on a web search.

    Reinventing the plagiarism software is fun but your result will always be compared to what is out there already. Have fun.
    Last edited by a moderator: May 6, 2017
  8. Dec 11, 2013 #7


    User Avatar
    Science Advisor
    Homework Helper

    Jamin2112: A period can be used at the end of a sentence, or can be embedded within a sentence. Therefore, it is virtually impossible to distinguish an end-of-sentence period anywhere near 100 % of the time. Therefore, you can immediately abandon the idea that you will ever get the program anywhere near 100 % correct. And thereby avoid wasting a large amount of time trying to achieve something that is unachievable. Instead, you could focus on, "How do I easily minimize the percentage of wrong end-of-sentence breaks?" Perhaps try the following. It is fast and simplistic, because you already know you will never achieve anything near perfection.

    First, convert any double blank characters to single blank characters. Next, if any of the following strings are followed by a capital letter, then convert the subject blank character to an end-of-sentence marker, or newline, or whatever you are using to denote end-of-sentence. (The backslash is a quoting character here.)

    ". "
    "! "
    "? "
    ".\" "
    "!\" "
    "?\" "
    ".) "
    "!) "
    "?) "
    ".] "
    "!] "
    "?] "​

    Notice, the above algorithm will interpret the blank character in "E. White" as an end-of-sentence. This is just part of your losses, since no algorithm can be perfect, due to the ambiguous usage of period. There is no way to distinguish "E. White" from "in figure B. The". Instead, you just want to reduce your losses. If the above algorithm gets 98 % of the end-of-sentence periods correct (and gets only 2 % wrong), then that would be great.
    Last edited: Dec 11, 2013
  9. Dec 12, 2013 #8


    User Avatar
    Science Advisor
    Homework Helper

    You could make a reasonable guess that a "word" of one upper-case letter is not the last word in a sentence. You might need to think about two letter "words" as well, like St. or Mc. But St. might be the end of a sentence if it is an abbreviation for "street" and not "saint".
  10. Dec 12, 2013 #9


    User Avatar
    Science Advisor
    Homework Helper

    AlephZero: If you do that (as mentioned in post 8), then I think the program would start making a lot of mistakes, for two-character words, and also for one-character words. Check out how many mistakes it would then make in just this one arbitrary, common (real-life) example, quoted (paraphrased) from an actual document.

    You have a round rod of diameter Xd. The surface of the metal block is point A. The cantilever tip is point B. The load applied at point B is force P. Reaction force Fc is located at point C. Reaction force Fd is located at point D. The tip of the embedded portion of the rod is point E.

    The cantilever length (distance AB) is L. The distance from point A to point C is Xa. The distance from point A to point D is Xb. The distance from point A to point E is La. Summing horizontal forces, and solving for Fd, gives Fd = Fc - P.

    E.g., let's say La = 10.5 mm. Let's say you assume a center of rotation at 3.5 mm from point A. Therefore, we have, Fc = 457.1 N. And Fd = 57.1 N.

    You can adjust dimensions Xa and Xb slightly, to see what effect it has on reaction forces Fc and Fd.​
  11. Dec 12, 2013 #10
    I'm thinking that the best way to aproach this wouldn't be to look at sentences by groups of words. A word is easy to isolate based on spaces. Define a word to be of a minimum lenght say 4 to get rid of: it, is, the, and etc

    Then build your own sentences of say 10 words, sentence one being words 1-10 sentence 2 being 2-11 etc, and build a hueristic based on the number of sentences in source 1 that share X words in common with a sentence in source 2.
  12. Dec 12, 2013 #11


    User Avatar
    Science Advisor

    Take a look at context free grammars to classify and extract information and its structure.

    Once you have this and the corresponding structure, you can classify the information and choose what classifications you want as well as design filters and transformations for your program needs.
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook