Estimating Vocab Size of IRC Users

  • Context: Graduate 
  • Thread starter Thread starter mXSCNT
  • Start date Start date
Click For Summary
SUMMARY

This discussion focuses on estimating the vocabulary size of IRC users by analyzing their speech patterns through unique vocabulary words and total word counts. The key function, fp(n), represents the number of unique vocabulary words spoken by a user after n words. To accurately estimate the maximum vocabulary size (vp), participants suggest fitting a regression curve to fp(n) and employing statistical methods such as maximum likelihood estimation and Stirling numbers. The conversation also explores the implications of common versus uncommon words in vocabulary estimation.

PREREQUISITES
  • Understanding of statistical regression analysis
  • Familiarity with Stirling numbers of the second kind
  • Knowledge of maximum likelihood estimation techniques
  • Basic concepts of probability distributions, particularly Poisson distribution
NEXT STEPS
  • Research methods for fitting regression curves to empirical data
  • Learn about Stirling numbers and their applications in combinatorial problems
  • Explore maximum likelihood estimation in the context of vocabulary size estimation
  • Investigate the properties of Poisson distribution and its relevance to unique word counts
USEFUL FOR

Data scientists, linguists, and researchers interested in computational linguistics, vocabulary analysis, and statistical modeling of language use.

mXSCNT
Messages
310
Reaction score
1
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.
 
Physics news on Phys.org
mXSCNT said:
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

fp(n) is a random variable. How about choosing an upper band where the probability of exceeding is about one in five billion. That is on average we expect at most once person on Earth to break this upper bound. Now we can plot this error bar and if it looks like it follows some kind of function then we can preform regression on it. First though find an estimate of this error bar for each f(n) up to some value N. This will give estimates on the maximum vocabulary but not the total vocabulary. One problem I see is no matter how large you make N who is to say they are going to use their entire vocabulary?

What do you think the statistics are like for each f(n) Do you they follow a poisson distribution?
http://en.wikipedia.org/wiki/Poisson_distribution
 
Last edited:
Let's solve the following problem first. Suppose that we have strings of n letters over an alphabet of size k. Let U(n,j,k) denote the number of such strings where exactly j _different_ letters appear. U(3,2,3) = 18, for example, because the strings abb, bab, bba, acc, cac, cca, baa, aba, aab, bcc, cbc, ccb, caa, aca, aac, cbb, bcb, bbc, are all the ways to form strings of length 3 where 2 letters are unique over an alphabet of 3 letters.

To form such a string with j letters different, first one can choose which j letters appear in the string, for C(k, j) ways (j > 0 and C means choose). Then one can choose which position in the string gets mapped to which letter. This is the number of onto functions from the n positions in the string to the j different letters, which is a Stirling number of the second kind, S(n, j). So,
U(n,j,k) = {k \choose j} S(n, j)
Let V(n,k) be the random variable giving the number of unique letters in a string of length n over an alphabet of size k. If each letter in the string is equally likely (and statistically independent of the other letters), then
P(V(n,k)=j) = {k \choose j} \frac{S(n, j)}{k^n}.
So the expected number of unique letters is
E(V(n,k)) = \sum_j j {k \choose j} \frac{S(n,j)}{k^n}

Now, back to the problem of estimating a person's vocabulary. It's clear that the probabilities of a person speaking the word "the" or the same person speaking the word "zygomatic" are not the same. But it might be an acceptable simplifying assumption to suppose that, after the first thousand or so most common words, all the remaining words the person knows have equal probability of being said. (If we assume that the person has a finite vocabulary, we can't give their probability of saying a word an exponential distribution or any other common distribution since that has infinite support). Let C be the probability that a word spoken is in the 1000 most common, and let R be the number of words spoken before the 1000 most common appear. The probability that a word is not in the 1000 most common is 1-C, so about (1-C)n words are uncommon in the first n. So then, for n > R we can approximate:
f_p(n) = V((1-C)n,v_p-1000) + 1000.
C and R can be empirically calculated, if we have enough data. So all that remains is to find the value of v_p that best fits the function for sufficiently large n, and hope that all the approximations work out.
 
In order to use that, though, I'm going to need some kind of approximation to P(V(n,k)=j). I can use Stirling's approximation for the choice function but what can I use to approximate the Stirling number of the second kind?
 
mXSCNT said:
So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

Do poor spellers have an unfair advantage?

Anyway, how about the following approach using maximum likelihood. Suppose each person uses the words in their vocabulary with the same relative frequency as the overall population, i.e.
P_p(W_i=w)=\frac{P(W_i=w)}{S_p}
where P(Wi=w) is the estimated from all words ever used and Sp (unknown) is the sum of P(Wi=w) over all words in the vocabulary. Similarly let Sp(n) be the sum over the vocabulary observed so far.

Now at each step the probability that fp(k+1)-fp(k)=1 (i.e. a new word is used) is
1-S_p(k)/S_p

From this we get a likelihood function based on the sequence of observed values of fp(k), and maximize it (or its log) to estimate Sp.

Since we don't know which words are represented by Sp-Sp(n) out of those not already used, take the most common words not observed so far as the most likely candidates and the count the number of terms required to exceed the value. That gives an estimate of (vp-fp(n)), the number of words in the vocabulary that haven't been used so far.

Hope this helps. I'd be very interested to see the results :)
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
5K
  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
5K
  • · Replies 67 ·
3
Replies
67
Views
15K
  • · Replies 125 ·
5
Replies
125
Views
20K
  • · Replies 2 ·
Replies
2
Views
7K
Replies
4
Views
3K
  • · Replies 21 ·
Replies
21
Views
6K
Replies
2
Views
6K