Combinations of Word Pairs

  • Thread starter nanoWatt
  • Start date
  • #1
88
2
I developed a php page that allows you to paste in text, and gives all the words listed alphabetically, and their counts in the text, tab delimited.

http://www.cnetworksllc.com/word_lister [Broken]

for instance, if I type "the quick red fox jumps over the lazy brown dog"

I get:

brown 1
dog 1
fox 1
jumps 1
lazy 1
over 1
quick 1
red 1
the 2


I am wondering about pseudo-code to give me word pairs like:

the quick 1
quick red 1
red fox 1
fox jumps 1
jumps over 1
over the 1
the lazy 1
lazy brown 1
brown dog 1

PHP gives an EXPLODE function that converts a string to an array using a delimiter. I probably could use a reverse loop counting down from the word count, and concatinating two words.
 
Last edited by a moderator:

Answers and Replies

  • #2
CompuChip
Science Advisor
Homework Helper
4,306
48
What about using a regular expression that matches any two words and applying a function like preg_match_all ?
 
  • #3
88
2
I actually don't need matching. I want to group the words. So I think I can loop through them, and concatenate word n and word n+1 with a space separator.
 
  • #4
29
0
Looping through should work fine. One technique might be to use an array to store the counts using text indices as you go through the sentence in a for loop.

IE:
$arr['the']['quick']++;
$arr['quick']['red']++;
$arr['red']['fox']++;

Then just loop through printing out the counts.
 
  • #5
CRGreathouse
Science Advisor
Homework Helper
2,824
0
Be careful with memory (and time!), because with n words you have [itex]\mathcal{O}(n^2)[/itex] bigrams.
 
  • #6
88
2
The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

I'm using this to process content off of my own websites, so it will most likely have < 1000 words.

The 2D array might not work for what I need.

This is what I'm going for:
$arr[0] = "the quick"
$arr[1] = "quick red"
$arr[2] = "red fox"

and so on.

My original string will be $myString that contains the content of the text. A temporary string stores this value, is lowercased, and all punctuation is removed. Newlines and tabs are replaced with spaces, and non alpha-numeric (except for apostrophe) are removed.

Explode creates the array, using space as a delimiter.

Then, I do sort to sort the array, and an array_unique to make the array unique.

I can then loop through the array (using foreach), and do substr_count to get the # of times the word-pair is used, with each iteration of the array's value.
 
Last edited:
  • #7
CompuChip
Science Advisor
Homework Helper
4,306
48
So you could do something like
Code:
$string = "The quick brown fox jumps over the lazy dog";
$words = explode(' ', $string);
$singleWords = sort($words); // and array_unique, and whatever
$wordPairs = array();
for($i = 0; $i <= count($words); $i++) 
  $wordPairs[] = $words[$i] . " " . $words[$i + 1];
$wordPairs = sort($wordPairs); // and array_unique, etc.

Perhaps not the most optimized code, but should work.
 
  • #8
CRGreathouse
Science Advisor
Homework Helper
2,824
0
The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

What I mean is that

"apple bear cow apple bear apple cow bear apple bear cow apple"

has 3 unique words, 6 = 3(3-1) unique bigrams, and 6 = 3(3-1)(3-2) unique trigrams. For a large corpus you might have a few million words, 1 million unique words, and hundreds of billions of bigrams.
 
Last edited:

Related Threads on Combinations of Word Pairs

  • Last Post
Replies
3
Views
3K
  • Last Post
Replies
1
Views
2K
  • Last Post
Replies
1
Views
4K
  • Last Post
Replies
4
Views
3K
  • Last Post
Replies
5
Views
10K
  • Last Post
Replies
1
Views
142
Replies
4
Views
9K
  • Last Post
Replies
9
Views
2K
Replies
2
Views
5K
Top