Estimating Vocab Size of IRC Users

mXSCNT · Feb 19, 2009

For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function f_p(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, f_p(n) should be bounded from above by a constant v_p for each person; v_p is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to f_p(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability P_p(W_i=w) that the i'th word spoken by a person p is w. And one could assume that the W_i are independent. Then f_p(n) could be estimated by the number of unique values of W_i, for i = 1,2,...,n. Where to go from here? I don't know.

John Creighto · Feb 19, 2009

mXSCNT said:

For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function f_p(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, f_p(n) should be bounded from above by a constant v_p for each person; v_p is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to f_p(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability P_p(W_i=w) that the i'th word spoken by a person p is w. And one could assume that the W_i are independent. Then f_p(n) could be estimated by the number of unique values of W_i, for i = 1,2,...,n. Where to go from here? I don't know.

f_p(n) is a random variable. How about choosing an upper band where the probability of exceeding is about one in five billion. That is on average we expect at most once person on Earth to break this upper bound. Now we can plot this error bar and if it looks like it follows some kind of function then we can preform regression on it. First though find an estimate of this error bar for each f(n) up to some value N. This will give estimates on the maximum vocabulary but not the total vocabulary. One problem I see is no matter how large you make N who is to say they are going to use their entire vocabulary?

What do you think the statistics are like for each f(n) Do you they follow a poisson distribution?
http://en.wikipedia.org/wiki/Poisson_distribution

mXSCNT · Feb 19, 2009

Let's solve the following problem first. Suppose that we have strings of n letters over an alphabet of size k. Let U(n,j,k) denote the number of such strings where exactly j _different_ letters appear. U(3,2,3) = 18, for example, because the strings abb, bab, bba, acc, cac, cca, baa, aba, aab, bcc, cbc, ccb, caa, aca, aac, cbb, bcb, bbc, are all the ways to form strings of length 3 where 2 letters are unique over an alphabet of 3 letters.

To form such a string with j letters different, first one can choose which j letters appear in the string, for C(k, j) ways (j > 0 and C means choose). Then one can choose which position in the string gets mapped to which letter. This is the number of onto functions from the n positions in the string to the j different letters, which is a Stirling number of the second kind, S(n, j). So,
[tex]U(n,j,k) = {k \choose j} S(n, j)[/tex]
Let V(n,k) be the random variable giving the number of unique letters in a string of length n over an alphabet of size k. If each letter in the string is equally likely (and statistically independent of the other letters), then
[tex]P(V(n,k)=j) = {k \choose j} \frac{S(n, j)}{k^n}[/tex].
So the expected number of unique letters is
[tex]E(V(n,k)) = \sum_j j {k \choose j} \frac{S(n,j)}{k^n}[/tex]

Now, back to the problem of estimating a person's vocabulary. It's clear that the probabilities of a person speaking the word "the" or the same person speaking the word "zygomatic" are not the same. But it might be an acceptable simplifying assumption to suppose that, after the first thousand or so most common words, all the remaining words the person knows have equal probability of being said. (If we assume that the person has a finite vocabulary, we can't give their probability of saying a word an exponential distribution or any other common distribution since that has infinite support). Let C be the probability that a word spoken is in the 1000 most common, and let R be the number of words spoken before the 1000 most common appear. The probability that a word is not in the 1000 most common is 1-C, so about (1-C)n words are uncommon in the first n. So then, for n > R we can approximate:
[tex]f_p(n) = V((1-C)n,v_p-1000) + 1000[/tex].
C and R can be empirically calculated, if we have enough data. So all that remains is to find the value of v_p that best fits the function for sufficiently large n, and hope that all the approximations work out.

mXSCNT · Feb 19, 2009

In order to use that, though, I'm going to need some kind of approximation to P(V(n,k)=j). I can use Stirling's approximation for the choice function but what can I use to approximate the Stirling number of the second kind?

bpet · Feb 27, 2009

mXSCNT said:

So for each person p, I have a function f_p(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, f_p(n) should be bounded from above by a constant v_p for each person; v_p is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to f_p(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability P_p(W_i=w) that the i'th word spoken by a person p is w. And one could assume that the W_i are independent. Then f_p(n) could be estimated by the number of unique values of W_i, for i = 1,2,...,n. Where to go from here? I don't know.

Do poor spellers have an unfair advantage?

Anyway, how about the following approach using maximum likelihood. Suppose each person uses the words in their vocabulary with the same relative frequency as the overall population, i.e.
[tex]P_p(W_i=w)=\frac{P(W_i=w)}{S_p}[/tex]
where P(W_i=w) is the estimated from all words ever used and S_p (unknown) is the sum of P(W_i=w) over all words in the vocabulary. Similarly let S_p(n) be the sum over the vocabulary observed so far.

Now at each step the probability that f_p(k+1)-f_p(k)=1 (i.e. a new word is used) is
[tex]1-S_p(k)/S_p[/tex]

From this we get a likelihood function based on the sequence of observed values of f_p(k), and maximize it (or its log) to estimate S_p.

Since we don't know which words are represented by S_p-S_p(n) out of those not already used, take the most common words not observed so far as the most likely candidates and the count the number of terms required to exceed the value. That gives an estimate of (v_p-f_p(n)), the number of words in the vocabulary that haven't been used so far.

Hope this helps. I'd be very interested to see the results :)

Estimating Vocab Size of IRC Users

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Similar threads

Undergrad My basic understanding of set theory

Undergrad The problem of points

Graduate Expected numbers of cards of a last color remaining

Undergrad How does axiom of foundation prevent infinite sequence of elements?

Graduate Probability puzzle

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect