# Estimating vocabulary

1. Feb 19, 2009

### mXSCNT

For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

2. Feb 19, 2009

### John Creighto

fp(n) is a random variable. How about choosing an upper band where the probability of exceeding is about one in five billion. That is on average we expect at most once person on earth to break this upper bound. Now we can plot this error bar and if it looks like it follows some kind of function then we can preform regression on it. First though find an estimate of this error bar for each f(n) up to some value N. This will give estimates on the maximum vocabulary but not the total vocabulary. One problem I see is no matter how large you make N who is to say they are going to use their entire vocabulary?

What do you think the statistics are like for each f(n) Do you they follow a poisson distribution?
http://en.wikipedia.org/wiki/Poisson_distribution

Last edited: Feb 19, 2009
3. Feb 19, 2009

### mXSCNT

Let's solve the following problem first. Suppose that we have strings of n letters over an alphabet of size k. Let U(n,j,k) denote the number of such strings where exactly j _different_ letters appear. U(3,2,3) = 18, for example, because the strings abb, bab, bba, acc, cac, cca, baa, aba, aab, bcc, cbc, ccb, caa, aca, aac, cbb, bcb, bbc, are all the ways to form strings of length 3 where 2 letters are unique over an alphabet of 3 letters.

To form such a string with j letters different, first one can choose which j letters appear in the string, for C(k, j) ways (j > 0 and C means choose). Then one can choose which position in the string gets mapped to which letter. This is the number of onto functions from the n positions in the string to the j different letters, which is a Stirling number of the second kind, S(n, j). So,
$$U(n,j,k) = {k \choose j} S(n, j)$$
Let V(n,k) be the random variable giving the number of unique letters in a string of length n over an alphabet of size k. If each letter in the string is equally likely (and statistically independent of the other letters), then
$$P(V(n,k)=j) = {k \choose j} \frac{S(n, j)}{k^n}$$.
So the expected number of unique letters is
$$E(V(n,k)) = \sum_j j {k \choose j} \frac{S(n,j)}{k^n}$$

Now, back to the problem of estimating a person's vocabulary. It's clear that the probabilities of a person speaking the word "the" or the same person speaking the word "zygomatic" are not the same. But it might be an acceptable simplifying assumption to suppose that, after the first thousand or so most common words, all the remaining words the person knows have equal probability of being said. (If we assume that the person has a finite vocabulary, we can't give their probability of saying a word an exponential distribution or any other common distribution since that has infinite support). Let C be the probability that a word spoken is in the 1000 most common, and let R be the number of words spoken before the 1000 most common appear. The probability that a word is not in the 1000 most common is 1-C, so about (1-C)n words are uncommon in the first n. So then, for n > R we can approximate:
$$f_p(n) = V((1-C)n,v_p-1000) + 1000$$.
C and R can be empirically calculated, if we have enough data. So all that remains is to find the value of v_p that best fits the function for sufficiently large n, and hope that all the approximations work out.

4. Feb 19, 2009

### mXSCNT

In order to use that, though, I'm going to need some kind of approximation to P(V(n,k)=j). I can use Stirling's approximation for the choice function but what can I use to approximate the Stirling number of the second kind?

5. Feb 27, 2009

### bpet

Do poor spellers have an unfair advantage?

Anyway, how about the following approach using maximum likelihood. Suppose each person uses the words in their vocabulary with the same relative frequency as the overall population, i.e.
$$P_p(W_i=w)=\frac{P(W_i=w)}{S_p}$$
where P(Wi=w) is the estimated from all words ever used and Sp (unknown) is the sum of P(Wi=w) over all words in the vocabulary. Similarly let Sp(n) be the sum over the vocabulary observed so far.

Now at each step the probability that fp(k+1)-fp(k)=1 (i.e. a new word is used) is
$$1-S_p(k)/S_p$$

From this we get a likelihood function based on the sequence of observed values of fp(k), and maximize it (or its log) to estimate Sp.

Since we don't know which words are represented by Sp-Sp(n) out of those not already used, take the most common words not observed so far as the most likely candidates and the count the number of terms required to exceed the value. That gives an estimate of (vp-fp(n)), the number of words in the vocabulary that haven't been used so far.

Hope this helps. I'd be very interested to see the results :)