How can I create a PHP code to generate word pairs from a given text?

  • Thread starter Thread starter nanoWatt
  • Start date Start date
  • Tags Tags
    Combinations
Click For Summary

Discussion Overview

The discussion revolves around creating PHP code to generate word pairs (bigrams) from a given text. Participants explore various methods for extracting and counting these pairs, focusing on techniques such as string manipulation, looping, and data structures.

Discussion Character

  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant describes a PHP page that lists words and their counts, suggesting a method to generate word pairs using a reverse loop and concatenation.
  • Another participant proposes using a regular expression with preg_match_all to match two words, although this is later deemed unnecessary by others.
  • Some participants suggest looping through the words and concatenating neighboring words, with one providing an example of using an array to store counts based on word indices.
  • Concerns are raised about memory and time complexity, with a participant clarifying that the number of bigrams is not squared since only neighboring words are considered.
  • A participant outlines their approach involving string manipulation to prepare the text, including lowercasing, punctuation removal, and creating an array of words using explode.
  • Another participant shares a code snippet illustrating how to generate word pairs using a loop, although they acknowledge it may not be the most optimized solution.
  • Discussion includes a clarification about the number of unique bigrams and trigrams in relation to the number of unique words in a larger corpus.

Areas of Agreement / Disagreement

Participants express various methods and ideas for generating word pairs, but there is no consensus on a single approach. Some techniques are debated, particularly regarding efficiency and memory usage.

Contextual Notes

Participants mention specific limitations related to memory and time complexity, as well as the need for careful handling of large text inputs. There are also considerations regarding the uniqueness of word pairs and the implications for larger datasets.

nanoWatt
Messages
85
Reaction score
2
I developed a php page that allows you to paste in text, and gives all the words listed alphabetically, and their counts in the text, tab delimited.

http://www.cnetworksllc.com/word_lister

for instance, if I type "the quick red fox jumps over the lazy brown dog"

I get:

brown 1
dog 1
fox 1
jumps 1
lazy 1
over 1
quick 1
red 1
the 2


I am wondering about pseudo-code to give me word pairs like:

the quick 1
quick red 1
red fox 1
fox jumps 1
jumps over 1
over the 1
the lazy 1
lazy brown 1
brown dog 1

PHP gives an EXPLODE function that converts a string to an array using a delimiter. I probably could use a reverse loop counting down from the word count, and concatinating two words.
 
Last edited by a moderator:
Technology news on Phys.org
What about using a regular expression that matches any two words and applying a function like preg_match_all ?
 
I actually don't need matching. I want to group the words. So I think I can loop through them, and concatenate word n and word n+1 with a space separator.
 
Looping through should work fine. One technique might be to use an array to store the counts using text indices as you go through the sentence in a for loop.

IE:
$arr['the']['quick']++;
$arr['quick']['red']++;
$arr['red']['fox']++;

Then just loop through printing out the counts.
 
Be careful with memory (and time!), because with n words you have \mathcal{O}(n^2) bigrams.
 
The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

I'm using this to process content off of my own websites, so it will most likely have < 1000 words.

The 2D array might not work for what I need.

This is what I'm going for:
$arr[0] = "the quick"
$arr[1] = "quick red"
$arr[2] = "red fox"

and so on.

My original string will be $myString that contains the content of the text. A temporary string stores this value, is lowercased, and all punctuation is removed. Newlines and tabs are replaced with spaces, and non alpha-numeric (except for apostrophe) are removed.

Explode creates the array, using space as a delimiter.

Then, I do sort to sort the array, and an array_unique to make the array unique.

I can then loop through the array (using foreach), and do substr_count to get the # of times the word-pair is used, with each iteration of the array's value.
 
Last edited:
So you could do something like
Code:
$string = "The quick brown fox jumps over the lazy dog";
$words = explode(' ', $string);
$singleWords = sort($words); // and array_unique, and whatever
$wordPairs = array();
for($i = 0; $i <= count($words); $i++) 
  $wordPairs[] = $words[$i] . " " . $words[$i + 1];
$wordPairs = sort($wordPairs); // and array_unique, etc.

Perhaps not the most optimized code, but should work.
 
nanoWatt said:
The # of bigrams won't be squared. I am only using neighboring words. So for the 9-word example, I had 9 bigrams.

What I mean is that

"apple bear cow apple bear apple cow bear apple bear cow apple"

has 3 unique words, 6 = 3(3-1) unique bigrams, and 6 = 3(3-1)(3-2) unique trigrams. For a large corpus you might have a few million words, 1 million unique words, and hundreds of billions of bigrams.
 
Last edited:

Similar threads

Replies
2
Views
3K
Replies
13
Views
4K
Replies
1
Views
2K
  • · Replies 49 ·
2
Replies
49
Views
12K
  • · Replies 16 ·
Replies
16
Views
11K
Replies
7
Views
3K
  • · Replies 1 ·
Replies
1
Views
6K
  • · Replies 2 ·
Replies
2
Views
3K
Replies
5
Views
17K
Replies
4
Views
3K