Dismiss Notice
Join Physics Forums Today!
The friendliest, high quality science and math community on the planet! Everyone who loves science is here!

Combinations of Word Pairs

  1. Jan 7, 2008 #1
    I developed a php page that allows you to paste in text, and gives all the words listed alphabetically, and their counts in the text, tab delimited.

    http://www.cnetworksllc.com/word_lister [Broken]

    for instance, if I type "the quick red fox jumps over the lazy brown dog"

    I get:

    brown 1
    dog 1
    fox 1
    jumps 1
    lazy 1
    over 1
    quick 1
    red 1
    the 2

    I am wondering about pseudo-code to give me word pairs like:

    the quick 1
    quick red 1
    red fox 1
    fox jumps 1
    jumps over 1
    over the 1
    the lazy 1
    lazy brown 1
    brown dog 1

    PHP gives an EXPLODE function that converts a string to an array using a delimiter. I probably could use a reverse loop counting down from the word count, and concatinating two words.
    Last edited by a moderator: May 3, 2017
  2. jcsd
  3. Jan 7, 2008 #2


    User Avatar
    Science Advisor
    Homework Helper

    What about using a regular expression that matches any two words and applying a function like preg_match_all ?
  4. Jan 7, 2008 #3
    I actually don't need matching. I want to group the words. So I think I can loop through them, and concatenate word n and word n+1 with a space separator.
  5. Jan 8, 2008 #4
    Looping through should work fine. One technique might be to use an array to store the counts using text indices as you go through the sentence in a for loop.


    Then just loop through printing out the counts.
  6. Jan 8, 2008 #5


    User Avatar
    Science Advisor
    Homework Helper

    Be careful with memory (and time!), because with n words you have [itex]\mathcal{O}(n^2)[/itex] bigrams.
  7. Jan 8, 2008 #6
    The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

    I'm using this to process content off of my own websites, so it will most likely have < 1000 words.

    The 2D array might not work for what I need.

    This is what I'm going for:
    $arr[0] = "the quick"
    $arr[1] = "quick red"
    $arr[2] = "red fox"

    and so on.

    My original string will be $myString that contains the content of the text. A temporary string stores this value, is lowercased, and all punctuation is removed. Newlines and tabs are replaced with spaces, and non alpha-numeric (except for apostrophe) are removed.

    Explode creates the array, using space as a delimiter.

    Then, I do sort to sort the array, and an array_unique to make the array unique.

    I can then loop through the array (using foreach), and do substr_count to get the # of times the word-pair is used, with each iteration of the array's value.
    Last edited: Jan 8, 2008
  8. Jan 8, 2008 #7


    User Avatar
    Science Advisor
    Homework Helper

    So you could do something like
    Code (Text):

    $string = "The quick brown fox jumps over the lazy dog";
    $words = explode(' ', $string);
    $singleWords = sort($words); // and array_unique, and whatever
    $wordPairs = array();
    for($i = 0; $i <= count($words); $i++)
      $wordPairs[] = $words[$i] . " " . $words[$i + 1];
    $wordPairs = sort($wordPairs); // and array_unique, etc.
    Perhaps not the most optimized code, but should work.
  9. Jan 9, 2008 #8


    User Avatar
    Science Advisor
    Homework Helper

    What I mean is that

    "apple bear cow apple bear apple cow bear apple bear cow apple"

    has 3 unique words, 6 = 3(3-1) unique bigrams, and 6 = 3(3-1)(3-2) unique trigrams. For a large corpus you might have a few million words, 1 million unique words, and hundreds of billions of bigrams.
    Last edited: Jan 9, 2008
Share this great discussion with others via Reddit, Google+, Twitter, or Facebook