How can I create a PHP code to generate word pairs from a given text?

  • Thread starter Thread starter nanoWatt
  • Start date Start date
  • Tags Tags
    Combinations
AI Thread Summary
A PHP page has been developed to generate word counts from a given text, displaying results in a tab-delimited format. The discussion focuses on creating a PHP code to generate word pairs (bigrams) from the text, emphasizing the use of the EXPLODE function to convert strings into arrays. The proposed method involves looping through the array of words, concatenating each word with its neighbor, and storing the counts in a 2D array. Concerns about memory and processing time are noted, especially with larger texts, but the approach is deemed feasible for smaller datasets. The overall goal is to efficiently process text content and generate bigrams for analysis.
nanoWatt
Messages
85
Reaction score
2
I developed a php page that allows you to paste in text, and gives all the words listed alphabetically, and their counts in the text, tab delimited.

http://www.cnetworksllc.com/word_lister

for instance, if I type "the quick red fox jumps over the lazy brown dog"

I get:

brown 1
dog 1
fox 1
jumps 1
lazy 1
over 1
quick 1
red 1
the 2


I am wondering about pseudo-code to give me word pairs like:

the quick 1
quick red 1
red fox 1
fox jumps 1
jumps over 1
over the 1
the lazy 1
lazy brown 1
brown dog 1

PHP gives an EXPLODE function that converts a string to an array using a delimiter. I probably could use a reverse loop counting down from the word count, and concatinating two words.
 
Last edited by a moderator:
Technology news on Phys.org
What about using a regular expression that matches any two words and applying a function like preg_match_all ?
 
I actually don't need matching. I want to group the words. So I think I can loop through them, and concatenate word n and word n+1 with a space separator.
 
Looping through should work fine. One technique might be to use an array to store the counts using text indices as you go through the sentence in a for loop.

IE:
$arr['the']['quick']++;
$arr['quick']['red']++;
$arr['red']['fox']++;

Then just loop through printing out the counts.
 
Be careful with memory (and time!), because with n words you have \mathcal{O}(n^2) bigrams.
 
The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

I'm using this to process content off of my own websites, so it will most likely have < 1000 words.

The 2D array might not work for what I need.

This is what I'm going for:
$arr[0] = "the quick"
$arr[1] = "quick red"
$arr[2] = "red fox"

and so on.

My original string will be $myString that contains the content of the text. A temporary string stores this value, is lowercased, and all punctuation is removed. Newlines and tabs are replaced with spaces, and non alpha-numeric (except for apostrophe) are removed.

Explode creates the array, using space as a delimiter.

Then, I do sort to sort the array, and an array_unique to make the array unique.

I can then loop through the array (using foreach), and do substr_count to get the # of times the word-pair is used, with each iteration of the array's value.
 
Last edited:
So you could do something like
Code:
$string = "The quick brown fox jumps over the lazy dog";
$words = explode(' ', $string);
$singleWords = sort($words); // and array_unique, and whatever
$wordPairs = array();
for($i = 0; $i <= count($words); $i++) 
  $wordPairs[] = $words[$i] . " " . $words[$i + 1];
$wordPairs = sort($wordPairs); // and array_unique, etc.

Perhaps not the most optimized code, but should work.
 
nanoWatt said:
The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

What I mean is that

"apple bear cow apple bear apple cow bear apple bear cow apple"

has 3 unique words, 6 = 3(3-1) unique bigrams, and 6 = 3(3-1)(3-2) unique trigrams. For a large corpus you might have a few million words, 1 million unique words, and hundreds of billions of bigrams.
 
Last edited:
Back
Top