How can I create a PHP code to generate word pairs from a given text?

nanoWatt · Jan 7, 2008

I developed a php page that allows you to paste in text, and gives all the words listed alphabetically, and their counts in the text, tab delimited.

http://www.cnetworksllc.com/word_lister

for instance, if I type "the quick red fox jumps over the lazy brown dog"

I get:

brown 1
dog 1
fox 1
jumps 1
lazy 1
over 1
quick 1
red 1
the 2

I am wondering about pseudo-code to give me word pairs like:

the quick 1
quick red 1
red fox 1
fox jumps 1
jumps over 1
over the 1
the lazy 1
lazy brown 1
brown dog 1

PHP gives an EXPLODE function that converts a string to an array using a delimiter. I probably could use a reverse loop counting down from the word count, and concatinating two words.

CompuChip · Jan 7, 2008

What about using a regular expression that matches any two words and applying a function like preg_match_all ?

nanoWatt · Jan 7, 2008

I actually don't need matching. I want to group the words. So I think I can loop through them, and concatenate word n and word n+1 with a space separator.

MikeH · Jan 8, 2008

Looping through should work fine. One technique might be to use an array to store the counts using text indices as you go through the sentence in a for loop.

IE:
$arr['the']['quick']++;
$arr['quick']['red']++;
$arr['red']['fox']++;

Then just loop through printing out the counts.

CRGreathouse · Jan 8, 2008

Be careful with memory (and time!), because with n words you have \mathcal{O}(n^2) bigrams.

nanoWatt · Jan 8, 2008

The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

I'm using this to process content off of my own websites, so it will most likely have < 1000 words.

The 2D array might not work for what I need.

This is what I'm going for:
$arr[0] = "the quick"
$arr[1] = "quick red"
$arr[2] = "red fox"

and so on.

My original string will be $myString that contains the content of the text. A temporary string stores this value, is lowercased, and all punctuation is removed. Newlines and tabs are replaced with spaces, and non alpha-numeric (except for apostrophe) are removed.

Explode creates the array, using space as a delimiter.

Then, I do sort to sort the array, and an array_unique to make the array unique.

I can then loop through the array (using foreach), and do substr_count to get the # of times the word-pair is used, with each iteration of the array's value.

CompuChip · Jan 8, 2008

So you could do something like

Code:

$string = "The quick brown fox jumps over the lazy dog";
$words = explode(' ', $string);
$singleWords = sort($words); // and array_unique, and whatever
$wordPairs = array();
for($i = 0; $i <= count($words); $i++) 
  $wordPairs[] = $words[$i] . " " . $words[$i + 1];
$wordPairs = sort($wordPairs); // and array_unique, etc.

Perhaps not the most optimized code, but should work.

CRGreathouse · Jan 9, 2008

nanoWatt said:

The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

What I mean is that

"apple bear cow apple bear apple cow bear apple bear cow apple"

has 3 unique words, 6 = 3(3-1) unique bigrams, and 6 = 3(3-1)(3-2) unique trigrams. For a large corpus you might have a few million words, 1 million unique words, and hundreds of billions of bigrams.

How can I create a PHP code to generate word pairs from a given text?

Similar threads

Hot Threads

Hackathon ideas?

Touch-typing for programmers

How to calculate Tension for a series of connected points?

Trying To Debug A Python File

Python Complaining About Python

Recent Insights

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Insights Fermat's Last Theorem

Insights Why Vector Spaces Explain The World: A Historical Perspective