Estimating Vocab Size of IRC Users

  • Context: Graduate 
  • Thread starter Thread starter mXSCNT
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around estimating the vocabulary size of individuals on IRC based on chat logs. Participants explore statistical methods and models to analyze the number of unique words spoken by users, considering biases introduced by varying amounts of speech. The conversation includes theoretical approaches, potential regression models, and statistical distributions relevant to vocabulary estimation.

Discussion Character

  • Exploratory
  • Technical explanation
  • Mathematical reasoning
  • Debate/contested

Main Points Raised

  • One participant proposes a function fp(n) to represent the number of unique vocabulary words spoken by a person after n words, suggesting it should be bounded by a constant vp, which represents the vocabulary size to estimate.
  • Another participant suggests using an upper bound for fp(n) based on a probability threshold, indicating that this could help in plotting error bars for regression analysis.
  • A different perspective introduces the idea of using the Poisson distribution to model the statistics of unique words spoken, questioning if this is a valid approach.
  • One participant discusses the combinatorial aspect of forming strings with unique letters and relates it to the vocabulary estimation problem, proposing that after a certain number of common words, remaining words could be treated as equally probable.
  • There is a suggestion to use maximum likelihood estimation based on the relative frequency of words used in the overall population to refine vocabulary size estimates.
  • Concerns are raised about the potential biases introduced by poor spelling and how it might affect vocabulary estimation.

Areas of Agreement / Disagreement

Participants express various models and approaches to estimate vocabulary size, but there is no consensus on a single method or model. Multiple competing views and uncertainties remain regarding the best statistical methods to apply.

Contextual Notes

Limitations include assumptions about the independence of word usage, the distribution of vocabulary across different users, and the potential biases introduced by varying speech amounts and spelling abilities.

mXSCNT
Messages
310
Reaction score
1
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.
 
Physics news on Phys.org
mXSCNT said:
For fun, I'm trying a project where I estimate the total vocabularies of people on IRC. I have logs from the past few months, and I can easily produce the number of unique vocabulary words spoken by any given person, as well as the total number of words they have spoken. For example, if someone had said "one two three four five six five one four" there would be six unique vocabulary words in that, and 9 total words.

However, people speak different amounts. Simply recording the number of vocabulary words a person used would bias heavily in favor of people who talked more. Also, dividing the number of vocab words by the total number of words would bias heavily in favor of people who talked less.

So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

fp(n) is a random variable. How about choosing an upper band where the probability of exceeding is about one in five billion. That is on average we expect at most once person on Earth to break this upper bound. Now we can plot this error bar and if it looks like it follows some kind of function then we can preform regression on it. First though find an estimate of this error bar for each f(n) up to some value N. This will give estimates on the maximum vocabulary but not the total vocabulary. One problem I see is no matter how large you make N who is to say they are going to use their entire vocabulary?

What do you think the statistics are like for each f(n) Do you they follow a poisson distribution?
http://en.wikipedia.org/wiki/Poisson_distribution
 
Last edited:
Let's solve the following problem first. Suppose that we have strings of n letters over an alphabet of size k. Let U(n,j,k) denote the number of such strings where exactly j _different_ letters appear. U(3,2,3) = 18, for example, because the strings abb, bab, bba, acc, cac, cca, baa, aba, aab, bcc, cbc, ccb, caa, aca, aac, cbb, bcb, bbc, are all the ways to form strings of length 3 where 2 letters are unique over an alphabet of 3 letters.

To form such a string with j letters different, first one can choose which j letters appear in the string, for C(k, j) ways (j > 0 and C means choose). Then one can choose which position in the string gets mapped to which letter. This is the number of onto functions from the n positions in the string to the j different letters, which is a Stirling number of the second kind, S(n, j). So,
[tex]U(n,j,k) = {k \choose j} S(n, j)[/tex]
Let V(n,k) be the random variable giving the number of unique letters in a string of length n over an alphabet of size k. If each letter in the string is equally likely (and statistically independent of the other letters), then
[tex]P(V(n,k)=j) = {k \choose j} \frac{S(n, j)}{k^n}[/tex].
So the expected number of unique letters is
[tex]E(V(n,k)) = \sum_j j {k \choose j} \frac{S(n,j)}{k^n}[/tex]

Now, back to the problem of estimating a person's vocabulary. It's clear that the probabilities of a person speaking the word "the" or the same person speaking the word "zygomatic" are not the same. But it might be an acceptable simplifying assumption to suppose that, after the first thousand or so most common words, all the remaining words the person knows have equal probability of being said. (If we assume that the person has a finite vocabulary, we can't give their probability of saying a word an exponential distribution or any other common distribution since that has infinite support). Let C be the probability that a word spoken is in the 1000 most common, and let R be the number of words spoken before the 1000 most common appear. The probability that a word is not in the 1000 most common is 1-C, so about (1-C)n words are uncommon in the first n. So then, for n > R we can approximate:
[tex]f_p(n) = V((1-C)n,v_p-1000) + 1000[/tex].
C and R can be empirically calculated, if we have enough data. So all that remains is to find the value of v_p that best fits the function for sufficiently large n, and hope that all the approximations work out.
 
In order to use that, though, I'm going to need some kind of approximation to P(V(n,k)=j). I can use Stirling's approximation for the choice function but what can I use to approximate the Stirling number of the second kind?
 
mXSCNT said:
So for each person p, I have a function fp(n), which is the number of unique vocabulary words a person has spoken by the time they reach n words. Assuming that people have a limited vocabulary, fp(n) should be bounded from above by a constant vp for each person; vp is the vocabulary size that I wish to estimate.

My problem is now to fit a regression curve to fp(n) for each person, in order to estimate the max vocab. Any ideas about the form of this curve? One possibility is to start with the probability Pp(Wi=w) that the i'th word spoken by a person p is w. And one could assume that the Wi are independent. Then fp(n) could be estimated by the number of unique values of Wi, for i = 1,2,...,n. Where to go from here? I don't know.

Do poor spellers have an unfair advantage?

Anyway, how about the following approach using maximum likelihood. Suppose each person uses the words in their vocabulary with the same relative frequency as the overall population, i.e.
[tex]P_p(W_i=w)=\frac{P(W_i=w)}{S_p}[/tex]
where P(Wi=w) is the estimated from all words ever used and Sp (unknown) is the sum of P(Wi=w) over all words in the vocabulary. Similarly let Sp(n) be the sum over the vocabulary observed so far.

Now at each step the probability that fp(k+1)-fp(k)=1 (i.e. a new word is used) is
[tex]1-S_p(k)/S_p[/tex]

From this we get a likelihood function based on the sequence of observed values of fp(k), and maximize it (or its log) to estimate Sp.

Since we don't know which words are represented by Sp-Sp(n) out of those not already used, take the most common words not observed so far as the most likely candidates and the count the number of terms required to exceed the value. That gives an estimate of (vp-fp(n)), the number of words in the vocabulary that haven't been used so far.

Hope this helps. I'd be very interested to see the results :)
 

Similar threads

  • · Replies 3 ·
Replies
3
Views
5K
  • · Replies 4 ·
Replies
4
Views
1K
  • · Replies 2 ·
Replies
2
Views
3K
  • · Replies 30 ·
2
Replies
30
Views
6K
  • · Replies 67 ·
3
Replies
67
Views
17K
  • · Replies 2 ·
Replies
2
Views
625
  • · Replies 21 ·
Replies
21
Views
7K
  • · Replies 125 ·
5
Replies
125
Views
20K
  • · Replies 2 ·
Replies
2
Views
7K
Replies
4
Views
3K