# Combinations of Word Pairs

1. Jan 7, 2008

### nanoWatt

I developed a php page that allows you to paste in text, and gives all the words listed alphabetically, and their counts in the text, tab delimited.

http://www.cnetworksllc.com/word_lister [Broken]

for instance, if I type "the quick red fox jumps over the lazy brown dog"

I get:

brown 1
dog 1
fox 1
jumps 1
lazy 1
over 1
quick 1
red 1
the 2

I am wondering about pseudo-code to give me word pairs like:

the quick 1
quick red 1
red fox 1
fox jumps 1
jumps over 1
over the 1
the lazy 1
lazy brown 1
brown dog 1

PHP gives an EXPLODE function that converts a string to an array using a delimiter. I probably could use a reverse loop counting down from the word count, and concatinating two words.

Last edited by a moderator: May 3, 2017
2. Jan 7, 2008

### CompuChip

What about using a regular expression that matches any two words and applying a function like preg_match_all ?

3. Jan 7, 2008

### nanoWatt

I actually don't need matching. I want to group the words. So I think I can loop through them, and concatenate word n and word n+1 with a space separator.

4. Jan 8, 2008

### MikeH

Looping through should work fine. One technique might be to use an array to store the counts using text indices as you go through the sentence in a for loop.

IE:
$arr['the']['quick']++;$arr['quick']['red']++;
$arr['red']['fox']++; Then just loop through printing out the counts. 5. Jan 8, 2008 ### CRGreathouse Be careful with memory (and time!), because with n words you have $\mathcal{O}(n^2)$ bigrams. 6. Jan 8, 2008 ### nanoWatt The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams. I'm using this to process content off of my own websites, so it will most likely have < 1000 words. The 2D array might not work for what I need. This is what I'm going for:$arr[0] = "the quick"
$arr[1] = "quick red"$arr[2] = "red fox"

and so on.

My original string will be $myString that contains the content of the text. A temporary string stores this value, is lowercased, and all punctuation is removed. Newlines and tabs are replaced with spaces, and non alpha-numeric (except for apostrophe) are removed. Explode creates the array, using space as a delimiter. Then, I do sort to sort the array, and an array_unique to make the array unique. I can then loop through the array (using foreach), and do substr_count to get the # of times the word-pair is used, with each iteration of the array's value. Last edited: Jan 8, 2008 7. Jan 8, 2008 ### CompuChip So you could do something like Code (Text):$string = "The quick brown fox jumps over the lazy dog";
$words = explode(' ',$string);
$singleWords = sort($words); // and array_unique, and whatever
$wordPairs = array(); for($i = 0; $i <= count($words); $i++)$wordPairs[] = $words[$i] . " " . $words[$i + 1];
$wordPairs = sort($wordPairs); // and array_unique, etc.

Perhaps not the most optimized code, but should work.

8. Jan 9, 2008

### CRGreathouse

What I mean is that

"apple bear cow apple bear apple cow bear apple bear cow apple"

has 3 unique words, 6 = 3(3-1) unique bigrams, and 6 = 3(3-1)(3-2) unique trigrams. For a large corpus you might have a few million words, 1 million unique words, and hundreds of billions of bigrams.

Last edited: Jan 9, 2008